Big Data, Small Effort

Paul Ford on 2017-11-16

An awful lot of really interesting public data is made available in CSV format. This is good and bad. If I give you a CSV file, you can open it up in Excel and sort and search—but if it’s a 10-gigabyte CSV, that will be slow and intractable. So you probably want to import that big CSV into a database. Except now you’re doing some database programming, and everything goes wrong, and you’ll probably give up and stop exploring the data. You’ll probably just drift back to Twitter. Eventually you erase what you downloaded to make more room on your hard drive.

A couple years ago I wrote about how I wish more public data was made available in SQLite, not CSV. That was one of those things that you write and people get in touch and go, “I had that idea too!” Eventually I even went down to the Library of Congress and said it to a group of librarians and data professionals from all over, and no one threw anything at me. However, I didn’t do much besides complain, which makes sense because I’m management.

On the other hand, Simon Willison, an open-source programmer of note, recently launched something that made me very happy, and I want to make sure you know about it. It’s called Datasette. (Simon’s writeup is here.)

Datasette is an app that you run from the command line. You just point it to a SQLite database and run datasette mydb.db. And what it does is set up a simple, secure, publicly-available read-only web application for exploring that data. That’s basically it. It takes database files and makes them really easy to explore on the Web. Here it is for dogs.

If you’re a public data provider—and many large NGOs, government organizations, cultural organizations, historical archives, media organizations, medical orgs, and academic institutions are exactly that—you can publish gigabytes of data, and make it available as an API, and make it easy to browse on the web, too, with extremely low effort. Put it into SQLite, point this little guy at it, and you’ve just radically increased the accessibility and utility of your data. Because messing around in SQL from a web browser is orders of magnitude more immediately useful than downloading a CSV, processing it, and figuring out what comes next.

As sample data, Simon used some CSV files from FiveThirtyEight, including a list of congresspeople and their ages when serving. You can explore that data with regular SQL. But! I can also find out the ages at which the most congresspeople were elected—your 50s, it turns out, is the most common age to get elected to congress. And you can revise that query further and further and share links to it. So you can get a nice list of the most common decades of their lives that congresspeople are elected, should you ever want such a thing:

And you can just link to that query and it’s all really easy. I would never have done any of that if you’d given me just a CSV. It’s this mini-platform that makes it possible and fun.

There are lots of reasons why Datasette is cool, so I’ll give you five.

  1. It’s open. SQLite is public domain and all of the code used to support Datasette is open-sourced. And the entire database is available for download by default. SQLite is super-fast and reliable for read-only workloads and can scale to terabytes.
  2. It’s a real API. You can explore the results with a mouse—and you can export them as JSON. You could use this to power “live” apps and visualizations. It’s open to everyone by default. No API access keys. Just start hacking.
  3. It’s tweetable. Interesting queries are just links, and they’ll fit right into tweets. So it’s super-easy to share the results of your explorations with humans.
  4. It’s read-only and limits CPU usage. The whole system is constructed so that people can do anything they want but not write anything to your system. And there’s only so long a query will run before the system times out (but you can adjust that, too). This means that it’s pretty hard for people to take advantage of you and your servers—good behavior is enforced.
  5. It’s decentralized. Anyone can publish any data from anywhere. It’s not one big monolithic data mart or an attempt to consolidate all data into one big platform. Rather, it’s a way to take your data and share it with anyone on the Internet, or just explore it yourself, with almost no effort.

Anyway, Datasette! It is a cool thing that solves a real problem in a very flexible way. Thanks to Simon Willison for making it and all the other open-source people and data providers who make the web into such an interesting place.

Hope this was useful!