blog.infochimps.org – Organizing Huge Information Sources

Something about Everything about Something

Posts Tagged ‘big

Twittersong

leave a comment »

Took the 50M twitter messages we saw between mid-November and mid-January and used Wordle to make a word cloud:  http://bit.ly/tweetcloud Fun!

(If you’re not familiar with a word cloud: the larger a word, the more often it was used. The colors & positions don’t mean anything, they’re just for fun. We stripped out the little words (a, the, with, …), leaving everything that appeared more than 10,000 times in the 50 million+ tweets we examined.)

Then I looked again at the filtered list and noticed something… just awesome.

Here are the forty most-commonly used words, in their exact order of decreasing frequency:

It’s time, Twitter. Love/Christmas blog:

Home! Thanks, people…

Night post:

Getting happy
watching morning
that’s tonight.
Tomorrow: looking news, trying nice? Check.

2009: Hope.
Week: 2008.

Little video:

snow.

Live free. Life. Awesome days!

Doing:

Feel house ready.
Look cool.
Sleep.
Yeah world!

I like your poem, Twitter.
A lot.

Read the rest of this entry »

Written by mrflip

22 Jan 2009 at 10:23 pm

Massive Scrape of Twitter’s Friend Graph

with 26 comments

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

Read the rest of this entry »

Written by mrflip

29 Dec 2008 at 8:55 pm

All of Wikipedia’s infoboxes & templates, in individual tables for each kind

with one comment

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.

Follow

Get every new post delivered to your Inbox.