I wrote several scrapers before building one driven off of sitemaps. I've been tending the sitemap version for several years now and treat it like a single long running application to which I occasionally apply surgery. Here I explain how it works so I don't have to read so much code every time I go in.
See Making of How Scrape Works for authoring tools used to make this site.
wiki-plugin-graph wiki-plugin-html
We're experimenting with full text search of the federation. For the moment this is both easier to code and easier to host than searching the web. search
This is a rewrite of the first ruby scraper that saves text files useful for searching instead of whole sites in export format. It incrementally refetches pages that have changed based on dates in the sitemaps. github
screenshots
We add restrictions to Scrape Pages so that it runs faster and finds more relevant content. We add a json log for downstream visualization. github