I hosted a meetup.com meeting today to talk about Ocean Protocol, other data sources for machine learning, and lead a group discussion of startup business ideas involving curating and selling data. The following is from a handout I created from material on the Ocean Protocol web site and other sources:

Data Trumps Software

Machine learning libraries like TensorFlow, Keras, PyTorch, etc. and people trained to use them have become a commodity. What is not a commodity yet is the availability of high quality application specific data.

Effective machine learning requires quality data

  • Ocean Protocol https://oceanprotocol.com - is a ecosystem based on blockchain for sharing data that serves needs for both data producers who want to monetize their data assets and for data consumers who need specific data that is affordable. This ecosystem is still under development but there are portions of the infrastructure (which will all be open source) already available. If you have docker installed you can quickly run their data marketplace demonstration system https://docs.oceanprotocol.com/setup/quickstart/.

  • Common Crawl http://commoncrawl.org - is a free source of web crawl data that was previously only available to large search engine companies. There are many open source libraries to access and process crawl data. You can most easily get started by downloading a few WARC data segment files to your laptop. My open source Java and Clojure libraries for processing WARC files are at https://github.com/commoncrawl/example-warc-java

  • Amazon Public Dataset Program https://aws.amazon.com/opendata/public-datasets/ - is a free service for hosting public datasets. AWS evaluates applications to contribute data quarterly if you have data to share. To access data sources search using the form at https://registry.opendata.aws to find useful datasets and use the S3 bucket URIs (or ARNs) to access. Most data sources have documentation pages and example client libraries and examples.

Overview of Ocean Protocol

Ocean Protocol is a decentralized data exchange protocol that lets people share and monetize data while providing control, auditing, transparency and compliance to both data providers and data consumers. The initial Ocean Protocol digital token sale ended March 2018 and raised $22 million. Ocean Protocol tokens will be available by trading Ethereum Ether and can be used by data consumers to purchase access to data. Data providers will be able to trade tokens back to Ethereum Ether.

Terminology

  • Publisher: is a service that provides access to data from data producers. Data producers will often also act as publishers of their own data.
  • Consumer: any person or organization who needs access to data. Access is via client libraries or web interfaces.
  • Marketplace: a service that lists assets and facilitates access to free datasets and datasets available for purchase.
  • Verifier: a software service that checks and validates steps in transactions for selling and buying data. A verifier is paid for this service.
  • Service Execution Agreement (SEA): a smart contract used by providers, consumers, and verifiers.

Software Components

  • Aquarius: is a service for storing and managing metadata for data assets that uses the off-chain database OceanDB.
  • Brizo: used by publishers for managing interactions with market places and data consumers.
  • Keeper: a service running a blockchain client and uses Ocean Protocol to process smart contracts.
  • Pleuston: an example/demo marketplace that you can run locally with Docker on your laptop.
  • Squid Libraries: client libraries to locate and access data (currently Python and JavaScript are supported).

Also of interest: SingularityNET

https://singularitynet.io is a decentralized service that supports creating, sharing, and monetizing AI services and hopes to be the world’s decentralized AI network. SingularityNET was started by my friend Ben Goertzel to create a marketplace for AI service APIs.