Civic News
Digital access / History

With its new archivist at the head, here’s how NARA is digitalizing America’s documents

Colleen Shogan began her tenure earlier this year, with plans to make firm headway on digitalizing the 13 billion records in the archives.

The National Archives. Photo by Flicker user NCinDC

This editorial article is a part of The Tech Behind Month of Technical.ly’s editorial calendar.

Full disclosure: This editorial article mentions Fearless, a Technical.ly Talent Builder client. That relationship had no impact on this report.
Update: This article has been updated to incorporate post-publication comment from LeAnne Matlach, the director of communications for Fearless. (10/11/2023, 9:31 a.m.) 
How do you turn a piece of onionskin paper into an online archive? Or a huge map? Or a piece of paper almost completely torn up? Or all of that combined, times a billion?

A few months into her tenure, Colleen Shogan, the current Archivist of the United States, already has plenty on her plate. But it’s a little more complicated than just placing a document on a scanner.

President Joe Biden appointed Shogan to lead the National Archives and Records Administration (NARA) in August 2022. Since her confirmation in May, Shogan and the NARA staff have been hard at work digitalizing the 13 billion records in the agency’s possession. That, according to Chief Innovation Officer Pamela Wright, requires various different scanners and technology to make sure it’s done right.

As the archivist, Shogan is the steward and protector of all of those documents, which include the Declaration of Independence, the Constitution and the Bill of Rights. But digitalization is a key part of her plans for the agency — primarily because of the access it provides.

“I’m looking forward to serving as a passionate advocate for the work we do, namely strengthening our nation’s democracy through access and accountability,” Shogan told Technical.ly in an email.

To start, she’s focused on reducing the backlog of veterans’ requests, which piled up during the pandemic, at the National Personnel Records Center. These documents can help veterans and families with the documents they need for benefits. The agency has already made its way through a lot of the backlog, Shogan said, and is on track to eliminate it by January 2024.

Longer-term, the NARA has committed to digitalizing 500 million pages of records and making them available online to the public in the National Archives Catalog by Oct. 1, 2026. This will be achieved through a mix of in-house, contracted and public-private partnership-based digitalization. She also wants to improve the catalog’s search functionality, so the public has an easier time accessing what they need, and double down on providing documents and resources for educators to help student scores in history.

“In the long term, access is my top priority,” Shogan said. “We want to help as many people as possible connect with our records in our research rooms, through our educational programs for teachers and students, through our public programs and presentations, and in our museums and galleries.”

For digitalization, projects are prioritized based on historical significance, high research use and whether or not they’re related to underserved communities, needed for exhibition or otherwise important for preservation. Shogan said stakeholder input is also key in the decision-making process.

The process itself, though, is much more thorough than many understand, according to CIO Wright.

“Most people think of the scanning of records as the digitization process, but it entails much more than just the scan,” Wright said via email.

The archives hold documents that are decades old, and many have been folded, stapled or printed on fragile paper. Before anything can even be scanned, records need to be unfolded, staples must be removed and torn or fragile papers need to be repaired. To ensure quality, most must be scanned one at a time, by hand, instead of with an automated feeder. Then, the staff needs to add metadata with a description featuring the date, title, scope and archival context of the document (noting that, say, it’s one of many letters sent to the president from one individual). Then the original records need to be returned and refiled.

The metadata is written based on agency-wide standards, so staff have to be trained in said standards and technical protocol for the scans. If it’s in-house, the work is also reviewed internally before it makes it to the online catalog if it’s in-house. Wright noted that other agencies, contractors, partners and the public also help with getting records scanned and into the catalog.

One of those contractors is Baltimore-based company Fearless, whose director of communications LeAnne Matlach told Technical.ly is involved in “comprehensive modernization” of the archives.

“Fearless took the lead in several crucial initiatives,” Matlach said via email. “These included streamlining the extract, transform, and load (ETL) process to accelerate the availability of digitized catalog items as well as their corresponding public contributions. We also restructured the data model and made the transition from SQL to Opensearch, thereby enabling more robust and high-performance search capabilities. Furthermore, we executed a full site redesign, adhering to best practices in web design, development, and operations.”

Matlach added that the Fearless staffers working with NARA “placed a strong emphasis on modernization, scalability, and performance, alongside our dedication to accessibility, usability, and data quality, ensuring the fulfillment of user requirements and the attainment of 508 compliance. This initiative necessitated a collaborative approach and a shift towards an agile-focused culture within the agency.”

The agency has a number of scanners in various sizes to fit all the different kinds of records like maps, onionskin paper and bound volumes. Wright said that staff is looking into high-speed scanners for appropriate records, as well. Digital copies are stored in the cloud before being added to the catalog and an internal system was created for input of the metadata (Wright did not share details of how the tech was developed, but said that the agency worked with contractors to create the online catalog).

So far, the online catalog has 230 million digital copies available, which is about 2% of the 13 billion in the archives. The agency has descriptions available for about 95% of those records, though, so the public can still have some access to records waiting to be digitized.

While there might still be a long way to go, Wright said she feels it’s an immensely important task.

“So many people today assume that everything is already online, and it simply is not,” Wright said. “Our goal is to get as much online as possible and, importantly, to make it as easy to access as possible, so that people who can’t afford the time or money to come to a particular archives facility can access the records online.”

Companies: Fearless
Series: The Tech Behind Month 2023 / The Tech Behind
Engagement

Join the conversation!

Find news, events, jobs and people who share your interests on Technical.ly's open community Slack

Trending

Delaware daily roundup: 20+ things to do in May; Technical.ly's Dev Conference; Dupont earnings

Philly daily roundup: Minecraft in a Philly school; PTW kicks off; Tech and art happy hour

DC daily roundup: April's biggest DMV funding stories; VCs head to Hill and Valley Forum; AI lobbying tripled

Delaware daily roundup: DE innovation leaders; High schoolers win STEM competition; New Ladybug Fest location

Technically Media