Inside the race to archive the US government’s websites

Over the past three weeks, the new US presidential administration has taken down thousands of government web pages related to public health, environmental justice, and scientific research. The mass takedowns stem from the new administration’s push to remove government information related to diversity and “gender ideology,” as well as scrutiny of various government agencies’ practices.

USAID’s website is down. So are sites related to it, like childreninadversity.gov, as well as thousands of pages from the Census Bureau, the Centers for Disease Control and Prevention, and the Office of Justice Programs.

“We’ve never seen anything like this,” says David Kaye, professor of law at the University of California, Irvine, and the former UN Special Rapporteur for freedom of opinion and expression. “I don’t think any of us know exactly what is happening. What we can see is government websites coming down, databases of essential public interest. The entirety of the USAID website.”

But as government web pages go dark, a collection of organizations are trying to archive as much data and information as possible before it’s gone for good. The hope is to keep a record of what has been lost for scientists and historians to be able to use in the future.

Data archiving is generally considered to be nonpartisan, but the recent actions of the administration have spurred some in the preservation community to stand up.

“I consider the actions of the current administration an assault on the entire scientific enterprise,” says Margaret Hedstrom, professor emerita of information at the University of Michigan.

Various organizations are trying to scrounge up as much data as possible. One of the largest projects is the End of Term Web Archive, a nonpartisan coalition of many organizations that aims to make a copy of all government data at the end of each presidential term. The EoT Archive allows individuals to nominate specific websites or data sets for preservation.

“All we can do is collect what has been published and archive it and make sure it’s publicly accessible for the future,” says James Jacobs, US government information librarian at Stanford University, who is one of the people running the EoT Archive.

Other organizations are taking a specific angle on data collection. For example, the Open Environmental Data Project (OEDP) is trying to capture data related to climate science and environmental justice. “We’re trying to track what’s getting taken down,” says Katie Hoeberling, director of policy initiatives at OEDP. “I can’t say with certainty exactly how much of what used to be up is still up, but we’re seeing, especially in the last couple weeks, an accelerating rate of data getting taken down.”

In addition to tracking what’s happening, OEDP is actively backing up relevant data. It actually began this process in November, to capture the data at the end of former president Biden’s term. But efforts have ramped up in the last couple weeks. “Things were a lot calmer prior to the inauguration,” says Cathy Richards, a technologist at OEDP. “It was the second day of the new administration that the first platform went down. At that moment, everyone realized, ‘Oh, no—we have to keep doing this, and we have to keep working our way down this list of data sets.’”

This kind of work is crucial because the US government holds invaluable international and national data relating to climate. “These are irreplaceable repositories of important climate information,” says Lauren Kurtz, executive director of the Climate Science Legal Defense Fund. “So fiddling with them or deleting them means the irreplaceable loss of critical information. It’s really quite tragic.”

Like the OEDP, the Catalyst Cooperative is trying to make sure data related to climate and energy is stored and accessible for researchers. Both are part of the Public Environmental Data Partners, a collective of organizations dedicated to preserving federal environmental data. ”We have tried to identify data sets that we know our communities make use of to make decisions about what electricity we should procure or to make decisions about resiliency in our infrastructure planning,” says Christina Gosnell, cofounder and president of Catalyst.

Archiving can be a difficult task; there is no one easy way to store all the US government’s data. “Various federal agencies and departments handle data preservation and archiving in a myriad of ways,” says Gosnell. There’s also no one who has a complete list of all the government websites in existence.

This hodgepodge of data means that in addition to using web crawlers, which are tools used to capture snapshots of websites and data, archivists often have to manually scrape data as well. Additionally, sometimes a data set will be behind a login address or captcha to prevent scraper tools from pulling the data. Web scrapers also sometimes miss key features on a site. For example, sites will often have plenty of links to other pieces of information that aren’t captured in a scrape. Or the scrape may just not work because of something to do with a website’s structure. Therefore, having a person in the loop double-checking the scraper’s work or capturing data manually is often the only way to ensure that the information is properly collected.

And there are questions about whether scraping the data will really be enough. Restoring websites and complex data sets is often not a simple process. “It becomes extraordinarily difficult and costly to attempt to rescue and salvage the data,” says Hedstrom. “It is like draining a body of blood and expecting the body to continue to function. The repairs and attempts to recover are sometimes insurmountable where we need continuous readings of data.”

“All of this data archiving work is a temporary Band-Aid,” says Gosnell. “If data sets are removed and are no longer updated, our archived data will become increasingly stale and thus ineffective at informing decisions over time.”

These effects may be long-lasting. “You won’t see the impact of that until 10 years from now, when you notice that there’s a gap of four years of data,” says Jacobs.

Many digital archivists stress the importance of understanding our past. “We can all think about our own family photos that have been passed down to us and how important those different documents are,” says Trevor Owens, chief research officer at the American Institute of Physics and former director of digital services at the Library of Congress. “That chain of connection to the past is really important.”

“It’s our library; it’s our history,” says Richards. “This data is funded by taxpayers, so we definitely don’t want all that knowledge to be lost when we can keep it, store it, potentially do something with it and continue to learn from it.”

About The Author

TopTechTrends

See author's posts

Others · February 7, 2025

Inside the race to archive the US government’s websites

About The Author

TopTechTrends

You may also like...

Recent Posts

Archives

Categories

Others · February 7, 2025

About The Author

TopTechTrends

You may also like...

5 PR Tips Any Company Can Use to Get Media Coverage

One of These Color Changing LED Lamps Is Only $57.97

Xpeng’s head of autonomous driving quits, rumored to join Nvidia

Recent Posts

Archives

Categories