blog.infochimps.org – Organizing Huge Information Sources

Something about Everything about Something

Posts Tagged ‘data

Twittersong

leave a comment »

Took the 50M twitter messages we saw between mid-November and mid-January and used Wordle to make a word cloud:  http://bit.ly/tweetcloud Fun!

(If you’re not familiar with a word cloud: the larger a word, the more often it was used. The colors & positions don’t mean anything, they’re just for fun. We stripped out the little words (a, the, with, …), leaving everything that appeared more than 10,000 times in the 50 million+ tweets we examined.)

Then I looked again at the filtered list and noticed something… just awesome.

Here are the forty most-commonly used words, in their exact order of decreasing frequency:

It’s time, Twitter. Love/Christmas blog:

Home! Thanks, people…

Night post:

Getting happy
watching morning
that’s tonight.
Tomorrow: looking news, trying nice? Check.

2009: Hope.
Week: 2008.

Little video:

snow.

Live free. Life. Awesome days!

Doing:

Feel house ready.
Look cool.
Sleep.
Yeah world!

I like your poem, Twitter.
A lot.

Read the rest of this entry »

Written by mrflip

22 Jan 2009 at 10:23 pm

Massive Scrape of Twitter’s Friend Graph

with 26 comments

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

Read the rest of this entry »

Written by mrflip

29 Dec 2008 at 8:55 pm

What’s Next: Infinite Monkeywrench starting to take form.

with one comment

We’re starting beta testing of infochimps.org v1.0 — see the following post. In order to start really populating infochimps.org with dataset payloads, the Infinite Monkeywrench is about to get some major love. The following syntax is still evolving, but we’re already using it to do some really fun stuff: here’s a preview.

One of the data set’s we’re proud to be liberating is the National Climate Data Center’s global weather data. To use that data, you need the file describing each of the NCDC weather stations. (I’ll just describe the stations metadata file — the extraction cartoon for the main dataset is basically the same but like 10 feet wide.)

The weather station metadata is found at at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.txt, it’s a flat file, it has a header of 17 lines, it contains fields describing each stations latitude, longitude, call sign and all that, and has lines that look like

# USAF   WBAN  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV(.1M)
# 010014 99999 SOERSTOKKEN                   NO NO    ENSO  +59783 +005350 +00500

Here’s what a complete Infinite Monkeywrench script to download that file, spin each line into a table row, and export as CSV, YAML, and marked-up XML would look like:

    #!/usr/bin/env ruby
    require 'imw'; include IMW
    imw_components :datamapper, :flat_file_parser

    # Stage as an in-memory Sqlite3 connection:
    DataMapper.setup(:staging_db, 'sqlite3::memory:')

    # Load the infochimps schema -- this has table and field names including type info
    ncdc_station_schema = ICSSchema.load('ncdc_station_schema.icss.yaml')

    # Create the tables from the schema
    ncdc_station_schema.auto_migrate!

    # Parse the station info file
    stations = FlatFileParser.new({
	:database  => :staging_db,
	:schema    => ncdc_station_schema,
	:each_line => :station,
	:filepaths => [:ripd, ['ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.txt']],
	:skip_head => 17,
	:cartoon   => %q{
	# USAF   WBAN  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV(.1M)
	  s6    .s5   .s30                           s2.s2.s2.s4  ..ci5   .ci6    .ci5
	},
      })

    # Dump as CSV, YAML and XML
    stations.dump_all out_file => [:fixd, "weather_station_info"], :formats => [:csv, :xml, :yaml]

Almost all of that is setup and teardown. Once the infochimps schema has field names, the only part you really have to figure out is the cartoon,

      s6    .s5   .s30                           s2.s2.s2.s4  ..ci5   .ci6    .ci5

If you’ve used perl’s unpack(), you’ll get the syntax — this says ‘take the USAF call sign from the initial 6-character string; ignore one junk character; … take one character as the latitude sign, and an integer of up to 5 digits as the scaled latitude, ….’

Rather load it into a database? Leave the last line out, and stage right into your DB. (Any of MySQL 4.x+, Potsgres 8.2+, SQLite3+ work.)

    # Load parsed files to the 'ncdc_weather' database in a remote MySQL DB store
    DataMapper.setup(:master_weather_db, 'mysql://remotedb.mycompany.com/ncdc_weather')

Surely a hand-tuned scripts will do this more thoroughly (and more quickly), but you can write this in a few minutes, set it loose on the gigabytes of data, and do all the rest from the comfort of your DB, your hadoop cluster, or a script that starts with populated datastructures given by a YAML file.

Another example. The US Nations Institute for Science and Technology (NIST) publishes an authoritative guide to conversion factors for units of measurement. It is, unhelpfully, only available as an HTML table or a PDF file.

If we feed into the InfiniteMonkeywrench

	fields:
	  - { name: unit_from,                  type: str},
	  - { name: unit_to,                    type: str},
	  - { name: conversion_mantissa,        type: float},
	  - { name: conversion_exponent,        type: float},
	  - { name: is_exact,                   type: boolean},
	  - { name: footnotes,
	      type: seq,
	      sequence: str }
  • The cartoon
	  { :each    => '//table.texttable/tr[@valign="top"]:not(:first-child)',
	    :makes   => :unit_conversion, # a UnitConversion struct
	    :mapping => [
	      '/td'      	  => { :unit_from, :unit_to, :conversion_mantissa, :conversion_exponent],
	      '/td/b'    	  => :is_exact,
	      '/td/a'    	  => :footnotes,
	    ]
	  }

We’d get back something like

  - unit_from: 		 'dyne centimeter (dyn · cm)'
    unit_to:		 ' newton meter (N · m)'
    conversion_mantissa:  1.0
    conversion_exponent: -0.7

  - unit_from: 		 'carat, metric'
    unit_to:		 'gram (g)'
    conversion_mantissa:  2.0
    conversion_exponent: -1
    is_exact: 		 true

  - unit_from: 		 'centimeter of mercury (0 °C) <a href="http://physics.nist.gov/Pubs/SP811/footnotes.html#f13">13</a>'
    unit_to:		 ' pascal (Pa)'
    conversion_mantissa: 1.33322
    conversion_exponent: 3
    footnotes:           [ '<a href="http://physics.nist.gov/Pubs/SP811/footnotes.html#f13">13</a>' ]

Now with some tweaking, you could do even more (and you’ll find you need to hand-correct a couple rows), but note:

  • Once one person’s done it nobody else has to.
  • This snippet gets you most of the way to a semantic dataset in your choice of universal formats.
  • In fact, there’s so little actual code left over we can eventually just take schema + url + cartoon as entered on the website, crawl the relevant pages, and provide each such dataset as CSV, XML, YAML, JSON, zip’d sqlite3 file … you get the idea — and we can do that without having to run code from strangers on our server.
  • Most importantly, for an end user this isn’t like trusting some random dude’s CSV file uploaded to a site named after a chimpanzee. The transformation from NIST’s data to something useful is so simple you can verify it by inspection. Of course, you can run the scripts yourself to check; or you can trace the Monkeywrench code itself; and once we have digital fingerprinting set up on infochimps.org anyone willing to stake their reputation on the veracity of a file can sign it — but it’s pretty easy to accept something this terse but expressive as valid. Our goal is to give transparent provenance of infochimps.org data to any desired degree.

Written by mrflip

10 Sep 2008 at 1:09 pm

Vote for our SxSW Panel Talk, Get People Thinking about how the Web will help tame the Data Flood

leave a comment »

Aaron Swartz of get.theinfo.org and watchdog.org, Kurt Bollacker from freebase.com, Shawn O’Connor from timepedia.org, and we infochimps have each put in panel proposals for the SxSWi 2009 conference.  Please consider clicking through to rate (and comment!) on these talks:

By my cursory count, there are about three times as many proposals this year as last that center on using the web for large-scale data exploration, data mashups, visualization, etc. Even if you are not attending, though, your vote will help get more people learning about the current state and future possibilities of massive data exploration on the web.

Descriptions of those talks:

Beyond Mashup: Weaving the Global Data Tapestry

http://panelpicker.sxsw.com/ideas/view/1500

Data mashups of not a few but a few thousand sources become possible as community efforts, enabled by new tools and Creative Commons licensing, unify the world’s exploding store of free, open data. Come find out what’s awesome, what’s hard, and what’s possible when you discover there’s really only one dataset. (P Kromer, infochimps.org)

How the Internet is Transforming Governance

http://panelpicker.sxsw.com/ideas/view/1038

The Internet is starting to revolutionize everything about politics and governance. Panelists will discuss new initiatives that harness the power of the Web to engage citizens in online activism, collaborative governance and oversight in ways that are radically shifting political power structures and fostering more transparency and accountability by elected officials. (Gabriela Schneider, Sunlight Foundation)

Petabyte as Platform – Building “Everything about Something” Sites

http://panelpicker.sxsw.com/ideas/view/1449

Find a topic some audience cares deeply about: their neighborhood, our government, every motorcycle ever made; and let visitors see, explore and understand it, and you make the world a better place. We’ll discuss how participating in the open, global data commons beneficially transform our culture and economy. (Kurt Bollacker, Freebase.com)

Powers of Often: Powers of Ten in Time

http://panelpicker.sxsw.com/ideas/view/1649

In 1977, Charles & Ray Eames made a fascinating short film, Powers of Ten, showing the relative scales in the universe: from picnic, to city, to solar system, to galaxy, and so on, back to cells, molecules, and atomic nuclei. In the same spirit, Powers of Often will explore relative scales in time using real data and hard estimates: patterns of daily life, demographics, census data, generations, long term trends, forecasts, historical cycles, high-frequency finance, and solar cycles. (Shawn O’Connor, Timepedia.org)

Among all talks with “data” in the description, these also look interesting:

If you see any other worthwhile topics please reply.

Thanks!
flip

Written by mrflip

8 Aug 2008 at 4:43 am

Austin Data Nerds RIGHT JOIN

leave a comment »

Do you work with huge datasets?  Are you interested in the opportunities that abound when the world’s free open data are integrated and the Semantic Web becomes a reality?  Come discuss data mashups, visualizations, and tools to organize, discover and explore rich information streams.

Join us at Mangia pizza on Guadalupe (3016 Guadalupe #100: free wifi, beer & great pizza) Thursday July 10th at 6:30pm and meet your fellow Austin data nerds.

(Also at Upcoming)

Written by mrflip

22 Jun 2008 at 3:26 pm

The gems of our collection — The best of what’s to come

with 4 comments

Hooray! The infochimps have been waxy’ed.  Let’s see how the server bonobos stand up.

It’s been suggested that I highlight some of the “gems” of our collection, which we’re going to spend the whole weekend shoveling into the pile. These first few are really deep, and somewhat hard to get / not widely known:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, *pitch by pitch* trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets – the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

(Incidentally, many of those datasets sell for inexcusable and malicious prices.  For those with a commercial bent, something tells me there’s room in the market if you’re willing to accept a markup of less than 10,000 times).

These are a bit silly but interesting for their ridiculous depth:
* A variety of mathematical constants (pi, e, Catalan’s number, the Golden Ratio, others) calculated to in some cases a preposterous 100 billion decimal places (I’ll probably chop them off at a still-ludicrous 500 million).
* 5000 years of solar eclipse times, 6000 years of precise lunar phase, 6000 years of venus transits.
* Odds of Dying for every Cause of Death listed in the US in a given year.

There are also, of course, the well-known collections: IMDB.com, musicbrainz, dbpedia, CIA factbook, geonames, citeseer, census, statistical abstract and the like.  So let’s see how much of the low-hanging fruit we can toss up there this weekend (the hard parts are adding metadata, and getting the non-copyrightable data out of the copyrighted screenscrapes, so what you’ll see are minimal metadata and the non-screenscraped datasets — still beats paying $1200+/GB though.)

[edit: dates for holidays by country, year-by-year odds of dying for all causes of death from the recent 8 year, NIST values for physical and chemical constants, mechanical properties of common engineering materials, and the spoken and written word frequencies for ~800,000 word tokens datasets should be up later today -- if the site is down briefly we're pushing that update to the server.  (If the site is down not-briefly we've been del.waxyslashdiggdotted)  Thanks to my friend Ned for helping do some drudge work to get those out.]

Written by mrflip

4 Apr 2008 at 3:03 pm

All of Wikipedia’s infoboxes & templates, in individual tables for each kind

with one comment

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.

Good Neighbors and Open Grazing: Datasets, Creative Works and Copyright

leave a comment »

Many people don’t know how broad our rights to factual data actually are.  Unlike the mishegaas that reigns in copyright land, the world of data is largely open (and rightfully so).  To arrive at the age of ubiquitous information with a sound policy, however, we have to exercise those rights assertively, respectfully and prudently.

Let me start with the traditional IANAL and point out that if you take legal advice from a chimpanzee you deserve what you get. Instead, read iusmentis on database law and bitlaw on compilations and databases. (In which case you can probably skip the rest of this post.) (Also, the following only applies to the US, where the database laws are actually more liberal than elsewhere; I have no idea what the situation is outside the US)

In general, a comprehensive assemblage of facts cannot be copyrighted. Copyright only applies where there is creative content. A comprehensive list of cars and retail prices cannot be copyrighted; a comprehensive collection of reviews of those cars can be copyrighted. A list of all the musical albums released each year is data; the lyrics and music within them is creative. A list of word tokens sorted by artist, genre, release date and song length is data, and a list of the top-100 selling albums by year is data. This is the important Feist Publications v. Rural Telephone Service case:

“Facts, whether alone or as part of a compilation, are not original and therefore may not be copyrighted. A factual compilation is eligible for copyright if it features an original selection or arrangement of facts, but the copyright is limited to the particular selection or arrangement. In no event may copyright extend to the facts themselves.” — Sandra Day O’Connor for the Supreme Court

“A collections of facts are not copyrightable per se … A compilation, like any other work, is copyrightable only if it satisfies the originality requirement (“an original work of authorship”). Facts are never original, so the compilation author can claim originality, if at all, only in the way the facts are presented. The facts must be selected, coordinated, or arranged “in such a way” as to render the work as a whole original.” — Sandra Day O’Connor for the Supreme Court

A presentation of data can be creative — you can’t xerox the blue book and hand that out. However, a conversion of otherwise unrestricted data into your own creative presentation satisfies this restriction. So would a presentation (original or converted) that did not arise from a creative act — you couldn’t claim copyright on a .CSV file of some dataset.

Besides “presentation” and a couple edge cases (“hot news”, “selection and arrangement”), the main one to be aware of is “Terms of Service“. If you have to agree to terms of service that restrict the data, but you take it anyway, you can be guilty of trespass. My understanding there is that if you can a) access the site by robot (no person clicks anything) AND b) there is no robots.txt, they shouldn’t be able to sustain a claim that it’s a restricted resource.

I personally go by balancing two principles:

  1. It’s our world, and we deserve access to the information that describes it.  Besides our legal rights, we have an even stronger moral claim to the chronicle of our collective story.  And we all stand to benefit: there have to be incentives to gather and organize data, but the modest benefits of making a data provider a lot richer don’t stand against the much larger marginal benefit of making the world a timy bit smarter.
  2. Be a good neighbor.  A lot of work goes in to gathering, processing, verifying, distributing an interesting dataset.  If we infochimps run around ignoring people’s requests for modest usage conditions, we’ll have a bit extra of open data and a lot extra of pissed-off ex-kindred souls who feel like we stole their cake.  Inevitably, this will mean that people won’t put data online at all for public access.

The best approach is

  • Scrupulously credit contributions, make clear that their efforts are recognized, and that we’ll link back to them for their ultimate benefit.
  • Clearly state the usage restrictions requested by the contributor, adhere to them, and ask that recipients of the data do the same.
  • Make clear the benefits to the world for making this data available.
  • Make clear the benefits to the contributor — this data will, for free, be enhanced with metadata, converted for use by diverse tools, interlinked with other rich datasets, and power interesting projects.  If your mission statement is “build reliable and exciting cars” or “make powerful music”, then your mission statement isn’t “explore and explain unexpected correlations among disparate rich information pools”.  Let someone else do it for you, and let them build the tools to do so around your data.  Consider how much Baseball has benefitted from its statistical revolution — fed by its incredibly rich ecosystem of open data.
  • Finally, as far as scientific or government prepared data that’s otherwise rights-free: gloves off, we’re taking that data.  If you’re a researcher, and you’re not openly sharing your data, you’re not only a bad scientist but also a bad person.  Ditto for data collected at taxpayer expense.

Written by mrflip

2 Apr 2008 at 4:38 pm

Stock Market dataset is up

with 2 comments

40 Years of data on every NYSE, AMEX and NASDAQ listed stock:

These links were busted before but should be worky now.

Written by mrflip

20 Mar 2008 at 7:45 am

Follow

Get every new post delivered to your Inbox.