This illustration was inspired by Jenny List.
Context
The internet is fragile.
Pipeline
Initial Acquisition
- Crawl
- Scrape
- Verify
Each stage typically involes an extraction operation - during crawling, this is extraction of links and during scraping it is the extraction of content. Network requests are expensive, so extraction operations are better suited on local resources.
Updates
Once content has been aquired, keeping it up to date is a challenge. At scale, some sort of automation is required. Automation logic is fragile, so monitoring is critical.
For certain sources, RSS is available and may be adapted to meet the needs of the archive.