Data Sets and APIs

Though you are welcome to use any dataset you’d like, we’ve made it a bit easier to use a few.

Twitter’s Gardenhose

We’ve sucked in the past 3 days worth of a random sampling of all public tweets which is approximately 4M tweets. This stream is still active, so it will suck in tweets as they happen during the hackathon.

To use this dataset, take a look at the example topology here.

The data is raw JSON as described here. You can use the classes provided by twitter4j (see example) to handle the data as Java classes or parse the JSON yourself if you need to.

Semantria

Implement text/sentiment analysis into your project using Semantria’s REST API. Simply register at http://semantria.com/register and let them know if you will be processing more than 10,000 documents.

  • All hackers will have unlimited transactions during the Hackathon (1 transaction to process 1 document), and will receive a free 100k transactions after the Hackathon.
  • For SDK’s & additional information, check out the developer page at http://semantria.com/developer
  • For support please use our live chat on our website, our CTO/head developer will be answering all technical questions.

Static data sets

We preloaded 2 static datasets from our Map/Reduce hacktahons:

  • Wikipedia: a wikipedia dump (XML format). See here for an example
  • NYSE: daily highs of companies traded on the NYSE between 1970 and 2010. See here for an example usage.

Other static datasets are available and can be loaded on demand (ask one of the mentors):

Real-time APIs

Transit

Financial

Miscellaneous