Originally published January 13, 2014
I am setting aside some of my time to volunteer helping out with the CommonCrawl.org
Much of the information in the world is now digitized and on the web. Search engines allow people to have a tiny view of the web, sort of like shining a low powered flashlight around in the forest at night. The Common Crawl provides the data from billions of web sites as compressed web archive files in Amazon S3 storage and thus allows individuals and organizations to inexpensively access much of the web for whatever information they need - like turning the lights on :-)
The crawl is now in a different file format. My first project is working on programming examples and how-to material for using this new format.