Note on July 7, 2014 I switched my blog back to hosting on Blogspot: http://blog.markwatson.com
Subscribe to: Atom posts

13 Jan 2014

I am very pleased to be helping the Common Crawl Organization

I am setting aside about ten hours a week of my time to volunteer helping out with the CommonCrawl.org

Much of the information in the world is now digitized and on the web. Search engines allow people to have a tiny view of the web, sort of like shining a low powered flashlight around in the forest at night. The Common Crawl provides the data from billions of web sites as compressed web archive files in Amazon S3 storage and thus allows individuals and organizations to inexpensively access much of the web for whatever information they need - like turning the lights on :-)

The crawl is now in a different file format. My first project is working on programming examples and how-to material for using this new format.


Do you want to comment on this blog article? Then please email your comment. I will add your comment with your name (I will not show your email address when publishing your comment).