tags : Storage, Scraping, peer-to-peer

Archiving formats

Web Archiving Workflows and Best Practices

Institutional Archiving

Large institutions typically employ a workflow that involves:

  1. Selection - Identifying content to preserve
  2. Acquisition - Using crawlers like Heritrix to collect content
  3. Storage - Preserving WARC files with redundancy
  4. Access - Providing replay through Wayback Machine-like interfaces
  5. Preservation - Ensuring long-term accessibility through format migration

Personal Archiving

Individual users have different needs:

  1. On-the-fly capture - Browser extensions like ArchiveWeb.page or SingleFile
  2. Local storage - Managing personal collections with tools like ReplayWeb.page
  3. Format considerations - Balancing completeness vs. convenience
  4. Sharing capabilities - Using portable formats like WACZ

Quality Assurance in Web Archiving

Critical considerations for effective archiving:

  • Completeness - Capturing all required resources
  • Fidelity - How closely the archive resembles the original
  • Replayability - Whether interactive elements function
  • Longevity - Format sustainability and migration paths

Usecases

CategoryToolDescription
Website Downloaderswget, httrackStandard tools for downloading entire sites (see offlinesavesite alias)
Skallwar/suckitAlternative to httrack
Y2Z/monolithDownloads assets as data URLs into single HTML file
WebMemex/freeze-dryLibrary (not tool) for freezing web pages; has useful “how it works” page
gildas-lormeau/SingleFileDecent browser extension/CLI for saving web pages
Offline Browsingdosyago/DownloadNetSite downloading focused on offline browsing

Tools

Enterprise/Traditional Tools

ToolDescriptionLink
ArchivematicaOpen-source digital preservation systemhttps://github.com/artefactual/archivematica
SpotlightEnabling librarians, curators, and others to create attractive, feature-rich websiteshttps://github.com/projectblacklight/spotlight

Wayback Machine Tools

ToolDescriptionLink
wayback-machine-scraperTool for scraping the Internet Archive’s Wayback Machinehttps://github.com/sangaline/wayback-machine-scraper
munaCLI tool for Internet Archive and Wayback Machine interactionhttps://github.com/uriel1998/muna
waybackurlsFetch all the URLs that the Wayback Machine knows about for a domainhttps://github.com/tomnomnom/waybackurls

Miscellaneous Legacy Tools

ToolDescriptionLink
mixtapeSelf-hosted archiving toolhttps://github.com/danderson/mixtape

Other Archiving Solutions

ToolDescriptionLink
RrwebRecord and replay debugger for the webhttps://news.ycombinator.com/item?id=41030862
ArchiveBoxSelf-hosted internet archiving solutionhttps://news.ycombinator.com/item?id=41860909
Perma.ccPermanent Link Servicehttps://news.ycombinator.com/item?id=42972622

YouTube Archiving Tools

ToolDescriptionLink
TubearchivistYour self-hosted YouTube media serverhttps://www.tubearchivist.com/
YouTube archiving scriptScript for archiving YouTube contenthttps://pastebin.com/s6kSzXrL
RSS feed for YouTube channelsGuide on creating RSS feeds for YouTube channelshttps://danielmiessler.com/p/rss-feed-youtube-channel/
ytdl-pvrYouTube-DL based PVRhttps://github.com/jchv/ytdl-pvr

Digital Archiving Organizations and Tools

Major Digital Archives

OrganizationFoundedDescription
Internet Archive2001American digital library with the stated mission of “universal access to all knowledge.”
Archive Team2012 (archive.is)A loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.
Sci-Hub-Research paper repository providing free access to paywalled academic papers.
Z-Library-Book repository, initially a clone of LibGen with more accessible UX and monetization.
Anna’s Archive-Open-source data library related to Z-Library.

Regional Archives

OrganizationWebsiteDescription
Digital India Archivergithub.com/DigitalIndiaArchiverProject focused on archiving digital content related to India.

Smaller Archives & Tools

NameWebsiteDescription
Perma.ccperma.ccService that creates permanent archived versions of web pages.
Megalodon-Web archiving tool.
Bitsaversbitsavers.orgArchive focusing on historical computer software and documentation.
Bellingcat Auto-Archivergithub.com/bellingcat/auto-archiverAutomated archiving tool from Bellingcat (investigative journalism organization).
Component/ToolDescriptionLink
WikiTextThe markup language that MediaWiki uses.-
MediaWikiIncludes a parser for WikiText into HTML to create displayed pages.-
MWOfflinierTool for creating offline Wikipedia versions.github.com/openzim/mwoffliner
Wikipedia QLQuery tool for Wikipedia.github.com/zverok/wikipedia_ql
WTF WikipediaJavaScript parser for Wikipedia.github.com/spencermountain/wtf_wikipedia
PlainTextWikipediaTool for converting Wikipedia to plain text.github.com/daveshap/PlainTextWikipedia
DeletionpediaArchive of deleted Wikipedia articles.deletionpedia.dbatley.com

Physical Archival

Other notes

  • Use the Webrecorder tool suite https://webrecorder.net! It uses a new package file format for web archivss called WACZ (Web Archive Zipped) which produces a single file which you can store anywhere and playback offline. It automatically indexes different file formats such as PDFs or media files contained on the website and is versioned. You can record WACZ using the Chrome extension ArchiveWeb.page https://archiveweb.page/ or use the Internet Archive’s Save Page Now button to preserve a website and have the WACZ file sent to you via email: https://inkdroid.org/2023/04/03/spn-wacz/. There are also more sophisticated tools like the in-browser crawler ArchiveWeb.page Express https://express.archiveweb.page or the command-line crawler BrowserTrix https://webrecorder.net/tools#browsertrix-crawler. But manually recording using the Chrome extension is definitely the easiest and most reliable way. To play back the WACZ file just open it in the offline web-app ReplayWeb.page https://replayweb.page.
  • Slightly biased (I work with Webrecorder haha) but yeah, our tools are really good at preserving complete webpages. u/CollapsedWave Give the ArchiveWebpage browser extension a shot! If you’re looking to save single pages as you come across them, it’s a good tool! Every page you capture gets its text extracted for text search. I’ll also add (because they mentioned file format standardization and longevity) that WACZ files are actually ZIP files which contain some indexing metadata that enables fast playback within a single portable file. The actual archived data is stored as a WARC wthin the WACZ and it doesn’t get much more standardized than that! Regardless of what you end up using, I’d really recommend capturing as WARCs or WACZ for cross-compatibility with other software.