tags : Storage, Scraping, peer-to-peer

38% of webpages that existed in 2013 are no longer accessible a decade later | Hacker News

Orgs and Projects

Big ones

IA (Internet Archive)

Archive Team

  • https://wiki.archiveteam.org/
  • Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.
  • archive.is launched in 2012.

Z-library

sci-hub

bellingcat

Not really an archiving org but has related projects

Others

Indian

Wikipedia

  • Wikipedia needs to have a separate section of its own
  • Wikitext is the name of the markup language that MediaWiki uses.
  • MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.

Archiving formats

Usecases

Website downloaders

  • We can use wget and httrack for downloading entire sites. See my offlinesavesite alias for more info
  • Skallwar/suckit: Alternative to httrack?
  • Y2Z/monolith: Downloads assets as data urls(unlike wget -mpk) into 1 single HTML file
  • WebMemex/freeze-dry : Not a tool, but a library. Seems outdated, but still useful. Has a nice “how it works” page.
  • gildas-lormeau/SingleFile : Decent extension/cli

Artifact extraction

  • simonw/shot-scraper
    • While this can be used to take screenshots(full/partial/can even do modifications w js before ss)
    • The ss are pixel perfect and you can specify size, so unless nothing changed, git diff will have no change to show as-well. good for us.
    • It does not do change detection but can be used for that purpose. (See Image Compression for related tools)
    • Original usecase was to keep the screenshots included in documentation site uptodate.
    • Can also be used for extraction of text data

Offline browsing

  • dosyago/DownloadNet: Does similar stuff like downloading a site but more for offline browsing

Tools

Traditional tools/enterprisey stuff

WaybackMachine

Misc old school tools

Others

Youtube