SXSW Data Panels
We are especially excited to announce and share that big data is coming to SXSW. Here are the panels we like:
Pete Skomoroch of DataWrangling: Petabyte As Platform, Making Big Data Accessible Online – We have long been fans of Pete Skomoroch’s work, this is your chance to hear from him about web applications built on massive datasets.
Our own mrflip: Scraping the Social Web – Flip has done extensive work building massive datasets from social media sites. Hear him talk about the nuances involved and ask him about best practices.
Michael Driscoll of Dataspora: Cloud Crunching Big Data with HIVE/Hadoop and R and Become a Sexy Data Geek in One Week – Another friend of ours, Michael, will be talking about how to use the right tools to massage and produce results from big datasets, and profiles what you need to do to be a data geek.
Stu Hood of Rackspace: Using Hadoop to Manage a Ton of Data – Hadoop might be the the most important tool to know for working with terabytes and terabytes of data.
Ian Davis of Talis: Set Your Data Free – Talis does great work. Listen to Ian cover topics very relevant to Infochimps.org’s collection: data copyright and licensing.
Dave Bowker of Designing the News: Engaging Data Visualizations and Infographic Communication – Glad to see some data viz stuff at SXSW.
Casey Caplowe of GOOD: Interactive Infographics – More visualizations, GOOD stuff.
Leave a comment if you know of any other good ones.
Infochimps receives a donation from SmartBear
Smart Bear Software is an Austin-based company whose founder, Jason Cohen, is one of our favorite people. Jason grew Smart Bear from the ground up, and he has helped the Infochimps team in the past with practical advice. Jason blogs about marketing and small business at http://blog.asmartbear.com/ and he is well worth reading.
The Infochimps rely on agile methods for the building of Infochimps.org, a process which can benefit from a code review tool. Smart Bear’s product, Code Collaborator, is a well-known online peer code review tool that simplifies and expedites code reviews, helping teams produce higher-quality, tested and done code more efficiently.
Smart Bear’s latest promotion offered 5 seats of one of their code review tools for $5. As a part of this promotion, they selected a start-up company to receive the funds collected from the promotion. Infochimps won! Smart Bear has graciously donated $2220 to Infochimps to help our mission of increasing the world’s access to data. We appreciate their acknowledgment of our work and we know we can put the funds to good use.
To see how we reacted to the news, check out the video below:
Open a banana like a Monkey does
Open a Banana like a Monkey – most human primates do it wrong!
To go with open banana here is open banana data:
- The USDA Nutrient Database will help you find the nutritional value of a banana (online search | infochimps entry)
- Per Capita Consumption of Major Food Commodities: 1980 to 2005
- Fresh Fruits and Vegetables–Supply and Use: 2000 to 2006
It’s Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames.
It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.
I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.
This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.
Congrats Retrosheet – another decade of rich Baseball data online
Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.
Infochimps metadata entries for these datasets:
- Box Scores
- Game Logs (play-by-play)
- Ballparks, 1903-current
- Transactions, 1873-current
- Awards and Honors
Freebase Hack Day & Updates
Our friends at Freebase are having another Hack Day in San Francisco this July. It’s only two weeks away now and the remaining tickets can go fast, get involved http://blog.freebase.com/2009/06/26/two-weeks-til-freebase-hack-day-sign-up-now/.
Learn about the many cool things that Freebase is doing with their data, and the tools that can be built using their platform.
On a side note, http://infochimps.org/ has gotten a facelift. We’d love feedback on it: info@infochimps.org. We hope your browsing experience is better, and we will be happy to roll out new features soon!
What’s New
Infochimps has been acknowledged as a finalist by the Capital Factory for 2009.
Infochimps is also a finalist in PepsiCo’s pitch competition.
Infochimps has a Facebook page! Become a fan.
Katherine at The New Civilization is aiding us in UX design for our Beta, to be launched at the end of May. Eve Simon in Washington DC is helping us with the site design. Our two big goals for the Beta are:
1) Improved browseability of the datasets, including a search bar and better surfing through tags, categories, and collections.
2) Uploading capability. Users will be able to create accounts and upload datasets, as well as edit the descriptions of other data on the site.
Drop us a line anytime at info@infochimps.org
@mrflip’s OpenGov Talk: Data Commons and Transparent Government
Here is my (mrflip’s) SxSW OpenGov talk, “How Open Data will help build Open Government“:
There is nothing more painful than watching yourself talk. So I haven’t gone all the way through this video — if you see me don’t give away the ending. Huge thanks to Silona Bonewald (League of Technical Voters) for organizing this, and to Terry Walhus (spring.net) for taping and copying and editing and uploading the videos.
I love it when a plan comes together…
So Simon Willison (@simonw), one of the architects of The Guardian’s Open Platform and co-creator of a modestly popular web frameworks is here at SxSW and gave an informal talk (on Zeppelins, of course – what else?). Freebase community manager Kirrily Robert (@skud) saw my tweet and proposed a meetup. After iteratively solving the three body problem, we put out the word on Sunday morning for a meetup on Sunday evening… SemWebAustin @juansequeda and Freebase @jameshome each pinged their 1-neighborhood and next thing you know I’m sitting next to Jure Cuhalev of Zemanta and machine learning machine @Nikete trying to orchestrate overflow seating for 25+ data geeks.
The reason for the gossip-column style of this post is to show the size and breadth of the data geek crowd. James Home and I agree that we need to turn out this Cyrus’ army of data geeks to take over a much larger part of SxSW next year. We need talks on column-store databases and hadoop, linked data and the construction of the data commons, how NLP and machine learning can power inspiring audience-driven websites, on the developing grammar of Information Visualization, on Processing and Prefuse and R. Pete Skomoroch, Mike Driscoll and Christian Chabot all ended up skipping SxSW this year; we need them leading a panel discussion on how to visualize >10M point datasets with limited-bandwidth desktop and web interfaces. I’d like to hear Deepak Singh and one of the @cloudera’ns drop science about scalable cloud computing.
The evening was just informal mingling and conversation, but on request of request of @mndoci and @dataspora, here is our name-droppy slice of the whirlwind:
@mrflip: Learned about how Zemanta is already putting Linked Data and NLP together to make blogging better. Jure is excited about infinite monkeywrench and might be brave enough to pre-alpha its inchoate HTML munger. Got to hear what Blaine Cook of Osmosoft is doing to solve the fractured twitter/facebook/identi.ca/500M-person-strong-local-social-networks-you’ve-never-heard-of ecosystem, and he gave some great feedback on our upcoming Twitter Census. Also got to learn, after pontificating that OAuth is hard, that I was talking to its architect; a great discussion with Blaine and ENTP Uruguay Evan Henshaw-Plath followed about the Rails authorization/identity/authentication stack.
Mike Migurski of @stamen is going to get together with infochimp @dhruvbansal to push the Open Street Maps dataset into Amazon Public Data Sets collection. Harper Reed of Threadless was running off for a 6am (ugh) flight to babysit servers in Chicago by the time we chatted, but pointed towards his Chicago Transit API project. His post on Hidden APIs is a great read BTW. Ran into @Slicehost Matt Tanase at a party after; Rackspace is getting much Cloud-ier, including a 1.5cents/hour pay-as-you-go 256MB slice offering. I’m hoping to talk later about our MachetEC2 project and get his thoughts about how to put open data on tap in the cloud. Jon Pierce and I discussed the Mets’ chances this year and what he sees for big data startup possibilities. Only got to briefly intersect with Andrew Turner about open geocommons, and was chagrined to learn I was shoulder to shoulder with one of gnip but didn’t get to chat. Hope to fix that later.
This meeting alone made SxSW worth it, and I’m looking forward to more discussion later. You can stalk me on twitter as @mrflip or at http://sxsw2009.sched.org/flip. By the way, I’m giving a lightning talk on Open Data in government at Fiddler’s Hearth, 301 Barton Springs Rd at 12:30 — drop by or catch the webcast later.
Amazon Web Services hosts DBpedia, Freebase data sets
The Infochimps.org community played part in pushing DBpedia and Freebase data sets to Amazon Web Services. This is an auxiliary effort by Infochimps.org to increase access to data. It is important to have the data in places where there are the right tools for people to use it. AWS is the place, look at creating an Amazon Machine Image to start working with the new data sets. Our MachetEC2 can help, please let us know how your experience was in using it.
Thanks to Kingsley Idehen with Linked Open Data for being a good point of contact.
We will upload more data sets to AWS in the near future. Any requests?
Start hacking: machetEC2 released!
machetEC2, the Infochimps Amazon Machine Image (AMI) designed for data processing, analysis, and visualization, has been released!
Amazon’s Cloud Computing services give you transformatively cheap and scalable computing power, and their Public Data Sets (AWS/PDS) collection (which infochimps is contributing to) is helping to put the world of free, open data at your fingertips. MachetEC2 lets you summon a “batteries included” computer — or a hundred computers — from the cloud. As soon as it loads, you’re ready to start crunching and transforming and visualizing data, whether from AWS/PDS, or infochimps.org, or your own pool.
When you SSH into an instance of machetEC2 (brief instructions after the jump), check the README files: they describe what’s installed, how to deal with volumes and Amazon Public Datasets, and how to use X11-based applications. You can also visit the the machetEC2 GitHub page to see the full list of packages installed, the list of gems, and the list of programs installed from source.
This machete is only as sharp as it is complete. If there’s software that you find indispensable, we encourage you to suggest it here, or even better to help add it to the toolkit (instructions are within).
Hacking through the Amazon with a shiny new MachetEC2
Hold on to your pith helmets: the Infochimps are releasing an Amazon Machine Image designed for data processing, analysis, and visualization.
Amazon’s Elastic Compute Cloud (EC2) allows users to instantiate a virtual computer with a pre-installed operating system, software packages, and up to 1 TB of data loaded on disk, ready to work with, from a shared image (an “Amazon Machine Image”, or AMI).
MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, analysis, and visualization. If you create an instance of MachetEC2, you’ll be have an environment with tools designed for working with data ready to go. You can load in your own data, grab one of our datasets, or try grabbing the data from one of Amazon’s Public Data Sets. No matter what, you’ll be hacking in minutes.
We’re taking suggestions for what software the community would be most interested in having installed on the image (peek inside to see what we’ve thought of so far…)
The Asdrubal Cabrera Hall of Fame
Prompted by my friend’s skepticism that the ballplayer Milton Bradley is really so named, I’m exhuming this old post from elsewhere. — flip
During the 2007 baseball playoffs, announcer Tim McCarver perspicaciously observed that “Asdrubal Cabrera is the only player in the majors with that first name”. Thus inspired, I present The Asdrubal Cabrera Hall of Fame: Major League ballplayers in unique possession of their particular first name. (Some are nicknames, many are not — but these are their official names, as used in newspapers and the rolls of history. F’reals.)
You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard recounted the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith? Mul Holland, Sixto Lezcano, Welcome Gaston or Mox McQuery? There’s a bunch more after the jump, and a complete listing here, including links to each player’s baseball reference page.
For some dinnertime fun over the holidays, discuss the relative merits of naming your next child after Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, or Buttercup Dickerson. (Unfortunately, 12 other “Rusty”s keep fan favorite Rusty Kuntz off this list, and believe it or not two other “Stubby”s bar the way for Stubby Clapp. I apologize to anyone whose internet filter has or has not prevented reading this apology.)
Thanks to the Baseball Databank and Retrosheet, I had this dataset on hand, and thanks to a monastic life of nerdity I had the SQL chops to pull up this query between innings. But I should be able to do this with anything, whether or not I know a SQL Query from a Queer-Eye Sequel, for silly stunts and for changing lives alike.
Imagine instead I were a public health expert, interested in the effects of limiting medical residents to an 80-hour work week. Might lives be saved if I could effortlessly pull up historical data on rates of doctor-induced complications, board of medicine complaints, relative rates of med school and law school applications, and open-government data on medical regulations?
The long-term mission of infochimps.org is to democratize this: to put the world’s analytic data at our fingertips, supporting tools that let anyone manipulate, interrogate, visualize and explore that data. Giving baseball geeks a chance to show up Tim McCarver isn’t much of a start, but here we are.
More awesome first names after the jump….
