Data done wrong: The only-most-recent data model

It’s not very uncommon to encounter a database that only stores the most recent state of things. For example, say the database has one row per Danaus plexippus individual. The database could have a column called stage which would tell you if an individual is currently a caterpillar or a butterfly, for instance.

This kind of design might seem fine for some application, but you have no way of seeing what happened in the past. When did that individual become a butterfly? (Conflate, for the moment, the time of the change in the real world and the time the change is made in the database – and say that the change is instantaneous.) Disturbingly often, you find after running a timeless database for some time that you actually do need to know about how the database changed over time – but you haven’t got that information.

There are at least two approaches to this problem. One is to store transactional data. In the plexippus example this could mean storing one row per life event per individual, with a date-time of database execution. The current known state of each individual can still be extracted (or maintained as a separate table). Another approach is to use a database that tracks all changes; the idea is something like version control for databases, and one implementation with a philosophy like this is datomic.

With a record of transactional data or a database that stores all transactions, you can query back in time: what was the state of the database at such-and-such time in the past? This is much better than our original setup. We don’t forget what happened in the past, and we can reproduce our work later even if the data is added to or changed. Of course this requires that the historical records not be themselves modified – the transaction logs must be immutable.

This is where simple transactional designs on traditional databases fail. If someone erroneously enters on April 4th that an individual became a butterfly on April 3rd, when really the transformation occurred on April 2nd, and this mistake is only realized on April 5th, there has to be a way of adding another transaction to indicate the update – not altering the record entered on April 4th. This can quickly become confusing – it can be a little mind-bending to think about data about dates which changes over time. The update problem is a real headache. I would like to find a good solution to this.

Monarch_In_May

Double pie chart re-design implemented!

Back in October I mocked up a re-design for a pair of pie charts in connection with a DC Action for Children project. I haven’t been contributing to the project much since, but I was very excited to see that Margo Smith got involved through DataKind and implemented (and improved!) my mock-up using d3. Currently the project is still in progress but Margo’s work has already been incorporated.

The original double pie chart:

Screen Shot 2013-10-06 at 4.22.24 PM

My mock-up:

Screen Shot 2013-10-06 at 4.22.51 PM

Margo’s implementation:

Screen Shot 2014-03-15 at 6.28.13 PM

And it’s not just better as a static viz – the live version also animates and adaptively adjusts label positions, really making it fun and revealing to interact with the DC Action for Children map. Very cool to see this come to life!

If you’d like to contribute to this project, check the github and/or get in touch with @nickmcclellan!

And Another Thing… from the Hitchhiker’s Guide

Somehow I hadn’t known about Eoin Colfer’s addendum to Douglas Adams’ Hitchhiker’s Guide to the Galaxy series until just recently. Maybe I hadn’t heard about it because it wasn’t terribly good. I don’t know a lot about fan fiction, but I imagine on the fan fiction spectrum it was pretty good.

511vU3LKJUL

My little sister is reading the James Potter series, which is a fan fiction extension of the Harry Potter universe, naturally. That one has gotten so popular that over a million people have read it, apparently. For both these series, were the originals so mind-shatteringly good that they defy imitation? I think it may be that people (including me) fell in love with the originals for reasons not limited to the isolated merits of the work.

I watched Star Wars as a kid on the floor in my grandparents’ living room. It was warm and comfortable and amazingly good. But I don’t think I like the Star Wars movies because they’re the best films out. And I have a hard time hating the newer Star Wars movies. I feel instead a sort of impossible nostalgia for the pasts that children watching them now might recall years hence.

This is all to say, I was aware that Eoin Colfer was not Douglas Adams, but I enjoyed what he did for him.

Popularity Contest

As seems to happen when I have a lot to do, I got the itch this weekend to do something else. So I threw together a quick node app on heroku, using the twit module from npm and bootstrap with the superhero theme from bootswatch. It’s at popular.herokuapp.com, which I’m frankly amazed wasn’t taken.

It started as a sort of joke about social media analytics and the silliness of judging things by the noise on twitter, but it’s actually pretty fun. Compare whatever you want to the current Bieber rate of 198,542 tweets per day! and so on.

How does it work? It just gets the most recent 100 tweets (or all tweets) for a search and uses the time since the oldest of those and the number (usually 100) to get a time-per-tweet, which is then translated to tweets-per-day. This is the part that I think is funny: it seems like a lot of people/groups essentially make up whatever crazy calculation they want and try to sell it as social media analytics. “Well, ‘impressions’ is ‘total reach’ times a number we made up, because probably people look at the tweet that many times, right?”

Anyway, it was fun to throw this together this morning. It’s liable to break, because I haven’t done anything clever to ensure that my twitter API credentials don’t get messed up and so on. And of course it can only handle a small rate of queries before it’ll hit twitter’s limits, I think.

Bayes’ Rule for Ducks

You look at a thing.

duck?

Is it a duck?

Re-phrase: What is the probability that it’s a duck, if it looks like that?

Bayes’ rule says that the probability of it being a duck, if it looks like that, is the same as the probability of any old thing being a duck, times the probability of a duck looking like that, divided by the probability of a thing looking like that.

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

This makes sense:

  • If ducks are mythical beasts, then Pr(duck) (our “prior” on ducks) is very low, and the thing would have to be very duck-like before we’d believe it’s a duck. On the other hand, if we’re at some sort of duck farm, then Pr(duck) is high and anything that looks even a little like a duck is probably a duck.
  • If it’s very likely that a duck would look like that (Pr(looks|duck) is high) then we’re more likely to think it’s a duck. This is the “likelihood” of a duck looking like that thing. In practice it’s based on how the ducks we’ve seen before have looked.
  • The denominator Pr(looks) normalizes things. After all, we’re in some sense portioning out the probabilities of this thing being whatever it could be. If 1% of things look like this, and 1% of things look like this and are ducks, then 100% of things that look like this are ducks. So Pr(looks) is what we’re working with; it’s the denominator.

Here’s an example of a strange world to test this in:

ducks

There are ten things. Six of them are ducks. Five of them look like ducks. Four of them both look like ducks and are ducks. One thing looks like a duck but is not a duck. Maybe it’s a fake duck? Two ducks do not look like ducks. Ducks in camouflage. Test the equality of the two sides of Bayes’ rule:

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

\displaystyle \frac{4}{5} = \frac{\frac{6}{10} \cdot \frac{4}{6}}{\frac{5}{10}}

It’s true here, and it’s not hard to show that it must be true, using two ways of expressing the probability of being a duck and looking like a duck. We have both of these:

\displaystyle Pr(duck \cap looks) = Pr(duck|looks) \cdot Pr(looks)

\displaystyle Pr(duck \cap looks) = Pr(looks|duck) \cdot Pr(duck)

Check those with the example as well, if you like. Using the equality, we get:

\displaystyle Pr(duck|looks) \cdot Pr(looks) = Pr(looks|duck) \cdot Pr(duck)

Then dividing by Pr(looks) we have Bayes’ rule, as above.

\displaystyle Pr(duck | looks) = \frac{Pr(duck) \cdot Pr(looks | duck)}{Pr(looks)}

This is not a difficult proof at all, but for many people the result feels very unintuitive. I’ve tried to explain it once before in the context of statistical claims. Of course there’s a wikipedia page and many other resources. I wanted to try to do it with a unifying simple example that makes the equations easy to parse, and this is what I’ve come up with.

A micro-intro to ggmap

This describes what we did in the break-out session I facilitated for the illustrious Max Richman‘s Open Mapping workshop at Open Data Day DC. For more detail, I recommend the original paper on ggmap.

ggmap is an R package that does two main things to make our lives easier:

  • It wraps a number of APIs (chiefly the Google Maps API) to conveniently facilitate geocoding and raster map access in R.
  • It operates together with ggplot2, another R package, which means all the power and convenience of the Grammar of Graphics is available for maps.

To install ggmap in R:

install.packages("ggmap")

Then you can load the package.

library(ggmap)

## Loading required package: ggplot2

One thing that ggmap offers is easy geocoding with the geocode function. Here we get the latitude and longitude of The World Bank:

address <- "1818 H St NW, Washington, DC 20433"
(addressll <- geocode(address))

##      lon  lat
## 1 -77.04 38.9

The ggmap package makes it easy to get quick maps with the qmap function. There are a number of options available from various sources:

# A raster map from Google
qmap("Washington, DC", zoom = 13)

1

# An artistic map from Stamen
qmap("Washington, DC", zoom = 13, source = "stamen",
     maptype = "watercolor")

2

Since we were at The World Bank, here’s a quick map showing where we were. This shows for the first time how ggplot2 functions (geom_point here) work with ggmap.

bankmap <- qmap(address, zoom = 16, source = "stamen",
                maptype = "toner")
bankmap + geom_point(data = addressll,
                     aes(x = lon, y = lat),
                     color = "red",
                     size = 10)

3

To connect with Max’s demo, we can load in his data about cities in Ghana.

ghana_cities <- read.csv("ghana_city_pop.csv")

We’ll pull in a Google map of Ghana and then put dots for the cities, sized based on estimated 2013 population.

ghanamap <- qmap("Ghana", zoom = 7)
ghanamap + geom_point(data = ghana_cities,
  aes(x = longitude, y = latitude,
      size = Estimates2013), color = "red") +
  theme(legend.position = "none")

4

Another useful feature to note is the gglocator function, which let’s you click on a map and get the latitude and longitude of where you clicked.

gglocator()

This is all the tip of the iceberg. You’ll probably want to know more about ggplot2 if you’re going to make extensive use of ggmapRMaps is another (and totally different) great way to do maps in R.

This document is also available on RPubs.

doge coding: much wow

I have recently come across two more or less doge-titled educational resources for coding. This definitely constitutes a trend.

happy sun

First up is Learn You a Haskell for Great Good!. I’m pretty sure the title includes the exclamation point. It’s a free book about Haskell, of course. (You can also buy it if you want.)

Last up is Learn You The Node.js For Much Win!. Same deal with the exclamatory title. This one is a command-line interactive tutorial about node.js that runs on workshopper. I found out about this after first hearing about a similar thing for git called git-it.

I, for one, would love to see these somehow form the basis for an entire line of amusingly titled “Learn you” books (and so on).