Subscribe to: Atom posts

13 Jan 2014

I am very pleased to be helping the Common Crawl Organization

I am setting aside about ten hours a week of my time to volunteer helping out with the CommonCrawl.org

Much of the information in the world is now digitized and on the web. Search engines allow people to have a tiny view of the web, sort of like shining a low powered flashlight around in the forest at night. The Common Crawl provides the data from billions of web sites as compressed web archive files in Amazon S3 storage and thus allows individuals and organizations to inexpensively access much of the web for whatever information they need - like turning the lights on :-)

The crawl is now in a different file format. My first project is working on programming examples and how-to material for using this new format.