tags : Storage, Scraping, peer-to-peer
38% of webpages that existed in 2013 are no longer accessible a decade later | Hacker News
Orgs and Projects
Big ones
IA (Internet Archive)
- https://archive.org/
- American digital library with the stated mission of “universal access to all knowledge.”
- Owns waybackmachine
- archive.org launched in 2001.
- Making IIIF Official at the Internet Archive | Internet Archive Blogs
Archive Team
- https://wiki.archiveteam.org/
- Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.
- archive.is launched in 2012.
Z-library
- Z-Library was essentially a clone of LibGen with a more accessible UX and monetization. Though I believe they eventually solicited new material that was not fed back into LibGen.
- Basically libgen has more strict requirements on metadata, fiction/non fiction split, so it makes meeting difficult.
- https://annas-blog.org/help-seed-zlibrary-on-ipfs.html
- https://annas-blog.org/blog-3x-new-books.html
- The code for Anna’s Archive | Hacker News
- Anna’s Archive: Open-source data library | Hacker News
- https://annas-blog.org/putting-5,998,794-books-on-ipfs.html
- Anna’s Archive - Wikipedia
sci-hub
- Sci-Hub: knowledge as a human right | Hacker News
- Reddit - Dive into anything
- How to circumvent Sci-Hub ISP block | Hacker News
- Archivists Are Trying To Save Sci-Hub
- Sci-Hub - Wikipedia
bellingcat
Not really an archiving org but has related projects
Others
- https://perma.cc/
- Megalodon
- https://bitsavers.org/
Indian
Wikipedia
- Wikipedia needs to have a separate section of its own
- Wikitext is the name of the markup language that MediaWiki uses.
- MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.
Related tools
- https://github.com/openzim/mwoffliner
- https://github.com/zverok/wikipedia_ql
- https://github.com/spencermountain/wtf_wikipedia
- https://github.com/daveshap/PlainTextWikipedia
- Deletionpedia
Archiving formats
- WARC
- https://en.wikipedia.org/wiki/HAR_(file_format)
- WORM?
- https://en.wikipedia.org/wiki/MHTML
- https://wiki.openzim.org/wiki/OpenZIM
Usecases
Website downloaders
- We can use
wget
andhttrack
for downloading entire sites. See myofflinesavesite
alias for more info - Skallwar/suckit: Alternative to httrack?
- Y2Z/monolith: Downloads assets as data urls(unlike
wget -mpk
) into 1 single HTML file - WebMemex/freeze-dry : Not a tool, but a library. Seems outdated, but still useful. Has a nice “how it works” page.
- gildas-lormeau/SingleFile : Decent extension/cli
Artifact extraction
- simonw/shot-scraper
- While this can be used to take screenshots(full/partial/can even do modifications w js before ss)
- The ss are pixel perfect and you can specify size, so unless nothing changed, git diff will have no change to show as-well. good for us.
- It does not do change detection but can be used for that purpose. (See Image Compression for related tools)
- Original usecase was to keep the screenshots included in documentation site uptodate.
- Can also be used for extraction of text data
Offline browsing
- dosyago/DownloadNet: Does similar stuff like downloading a site but more for offline browsing
Tools
Traditional tools/enterprisey stuff
WaybackMachine
- https://github.com/sangaline/wayback-machine-scraper
- https://github.com/uriel1998/muna
- https://github.com/tomnomnom/waybackurls