This illustration was inspired by Jenny List.

Context

The internet is fragile.

Pipeline

Initial Acquisition

  1. Crawl
  2. Scrape
  3. Verify

Each stage typically involes an extraction operation - during crawling, this is extraction of links and during scraping it is the extraction of content. Network requests are expensive, so extraction operations are better suited on local resources.

Updates

Once content has been aquired, keeping it up to date is a challenge. At scale, some sort of automation is required. Automation logic is fragile, so monitoring is critical.

For certain sources, RSS is available and may be adapted to meet the needs of the archive.