Peeps Be Askin’ Me

Peeps be all the time askin’ me, what are the excellent tech/data things to do in DC? Where are the cool people to be found? What’s good?

Well look:

For data talks and socializing, get with the meetups listed on Data Community DC’s “Speaking Events” page.

For hanging out and hacking on projects that help the world be more awesome, you know you have to be down with Code for DC.

And of course there’s Hack and Tell, but you already know.

Scraping GNU Mailman Pipermail Email List Archives

I worked with Code for Progress fellow Casidy at a recent Code for DC civic hacknight on migrating old email list archives for the Commotion mesh network project to a new system. The source system was GNU Mailman with its Pipermail web archives for several email lists such as commotion-discuss.

We used Python‘s lxml for the first pass scraping of all the archive file URLs. The process was then made more interesting by the gzip‘ing of most monthly archives. Instead of saving the gzip’ed files to disk and then gunzip’ing them, we used Python’s gzip and StringIO modules. The result is the full text history of a specified email list, ready for further processing. Here’s the code we came up with:

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from StringIO import StringIO

listname = 'commotion-discuss'
url = 'https://lists.chambana.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
    print filename
    response = requests.get(url + filename)
    if filename[-3:] == '.gz':
        contents = gzip.GzipFile(fileobj=StringIO(response.content)).read()
    else:
        contents = response.content
    return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = "\n\n\n\n".join(contents)

with open(listname + '.txt', 'w') as filehandle:
    filehandle.write(contents)

The Information: a History, a Theory, a Flood

This is a really good book.

the_information

James Gleick is excellent. The history is beautifully researched and explained; there is so much content, and it is all fitted together very nicely.

The core topic is information theory, with the formalism of entropy, but perhaps it’s better summarized as the story of human awakening to the idea of what information is and what it means to communicate. It is a new kind of awareness. Maybe the universe is nothing but information! I’m reminded of the time I met Frederick Kantor.

I’m not sure if The Information pointed me to it, but I’ll also mention Information Theory, Inference, and Learning Algorithms by David J.C. MacKay. This book can be read in PDF for free. I haven’t gone all through it, but it seems to be a good more advanced reference.

The Information: Highly recommended for all!

Dataclysm: There’s another book

dataclysm

Dataclysm is a nicely made book. In the Coda (p. 239) we learn something of why:

Designing the charts and tables in this book, I relied on the work of the statistician and artist Edward R. Tufte. More than relied on, I tried to copy it.

The book is not unpleasant to read, and it goes quickly. It may be successful as a popularization. I rather wish it had more new interesting results. Perhaps the author agrees with me; often the cheerleading for the potential of data reads like disappointment with the actuality of the results so far.

The author’s voice was occasionally quite insufferable. He describes himself “photobombing before photobombing was a thing” in a picture with Donald Trump and Mikhail Gorbachev, for example. This anecdote is around an eighth of the text in the second chapter; perhaps more. The chapter is about the value of being polarizing, so if he alienated me there it may count as a success.

In conclusion: the OkTrends blog is fun; there’s also a book version now.

I want an edit button on everything

My blog does not have an edit button for people who are not me. This means it takes a bunch of work to fix a typo, for example: you’d have to tell me about it, describing where the typo is, somehow, and then I would have to go find the spot, and make the change. In practice, this pretty much doesn’t happen.

Wikipedia has edit buttons on everything, and so does github. I’m not entirely sure what is best between allow-all-edits-immediately and require-review-for-all-edits. Some mix is also possible, I guess. Wikipedia has ways to lock down articles, and must have corresponding permission systems for who can do/undo that. Github lets you give other people full edit permissions, so you can spread the editor pool at least. Git by itself can support even more fine-grained control, I believe.

I’d like to move my blog to something git-backed, like github pages. It’s a little work, but you can put an edit button on the rendered HTML views shown by github pages too. Advanced R has a beautiful “Edit this page” button on every page. Three.js has one in their documentation. Eric points out Ben’s blog as well, and also the trouble with comments.

Ideally I’d prefer not to be github-bound, I guess, or bound to some comment service. But I also kind of prefer to have everything text-based, so what do you do for comments then? And also I’d like to be able to do R markdown (etc.) and have that all render automagically. But also something serverless. I’m drawn to this javascript-only static static generator, but that also seems to be A Bad Idea.

So: that solves that.

How To Sort By Average Rating

Evan Miller’s well-known How Not To Sort By Average Rating points out problems with ranking by “wrong solution #1″ (by differences, upvotes minus downvotes) and “wrong solution #2″ (by average ratings, upvotes divided by total votes). Miller’s “correct solution” is to use the lower bound of a Wilson score confidence interval for a Bernoulli parameter. I think it would probably be better to use Laplace smoothing, because:

  • Laplace smoothing is much easier
  • Laplace smoothing is not always negatively biased

This is the Wilson scoring formula given in Miller’s post, which we’ll use to get 95% confidence interval lower bounds:

rating equation

(Use minus where it says plus/minus to calculate the lower bound.) Here is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.

Now here’s the formula for doing Laplace smoothing instead:

(upvotes + \alpha) / (total votes + \beta)

Here \alpha and \beta are parameters that represent our estimation of what rating is probably appropriate if we know nothing else (cf. Bayesian prior). For example, \alpha = 1 and \beta = 2 means that a post with no votes gets treated as a 0.5.

The Laplace smoothing method is much simpler to calculate – there’s no need for statistical libraries, or even square roots!

Does it successfully solve the problems of “wrong solution #1″ and “wrong solution #2″? First, the problem with “wrong solution #1″, which we might summarize as “the problem with large sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 209 50 159 0.81 0.7545 0.80
second item 118 25 93 0.83 0.7546 0.82

All the methods agree except for “wrong solution #1″ that the second item should rank higher.

Then there’s the problem with “wrong solution #2″, which we might summarize as “the problem with small sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 1 0 1 1.0 0.21 0.67
second item 534 46 488 0.92 0.90 0.92

All the methods agree except for “wrong solution #2″ that the second item should rank higher.

How similar are the results for the Wilson method and the Laplace method overall? Take a look: here color encodes the score, with blue at 0.0, white at 0.5, and red at 1.0:

plot of Wilson and Laplace methods

They’re so similar, you might say, that you would need a very good reason to justify the complexity of the calculation for the Wilson bound. But also, the differences favor the Laplace method! The Wilson method, because it’s a lower bound, is negatively biased everywhere. It’s certainly not symmetrical. Let’s zoom in:

plot of Wilson and Laplace methods - zoomed

With the Wilson method, you could have three upvotes, no downvotes and still rank lower than an item that is disliked by 50% of people over the long run. That seems strange.

The Laplace method does have its own biases. By choosing \alpha=1 and \beta=2, the bias is toward 0.5, which I think is reasonable for a ranking problem like this. But you could change it: \alpha=0 with \beta=1 biases toward zero, \alpha=1 with \beta=0 biases toward one. And \alpha=100 with \beta=200 biases toward 0.5 very strongly. With the Wilson method you can tune the size of the interval, adjusting the confidence level, but this only adjusts how strongly you’re biased toward zero.

Here’s another way of looking at the comparison. How do the two methods compare for varying numbers of upvotes with a constant number (10) of downvotes?

Wilson and Laplace methods again

Those are similar curves. Not identical – but is there a difference to justify the complexity of the Wilson score?

In conclusion: Just adding a little bit to your numerators and denominators (Laplace smoothing) gives you a scoring system that is as good or better than calculating Wilson scores.

[code for this post]

Action and Prioritization, Advertising and Intervention

Amazon can easily show you a product of their choice while you’re on their site. This is their action. Since it’s so easy to show you things, it makes sense to work a lot on choosing carefully what to show. This is their prioritization. Refer to this class of ranking problem as the advertising type.

It is fairly difficult to send food aid to a village (action) or to support and improve a challenged school (action). A deficiency of both knowledge and resources motivates a need to choose where to give attention (prioritization). Refer to this type of ranking problem as the intervention type.

Advertising problems are essentially scattershot and we only care about whether we hit something, anything. All you need is one good display, perhaps, and you make a sale. These prioritization choices also happen incredibly frequently; they demand automation and preclude individual human evaluation. It doesn’t matter because you only need to succeed on average.

Intervention problems, on the other hand, have a million challenges. Even a perfect solution to the prioritization problem does not guarantee success. Careful action will be required of multiple people after a prioritization is complete, and meaningful reasons for prioritization choices will be helpful, likely required. It is inhumane to think of “success on average” for these problems.

These two types of problems are different and demand different approaches. However, advertising-type problems and the focus and success of “advertisers” on the prioritization part of their work is influencing choices in how problems of both type are approached.

We’re spending too much effort on prioritization, sometimes even mistaking prioritization for action. Why?

1. Prioritization as a way to maximize impact. Certainly it’s good to maximize impact, but it also reflects an abhorrent reality: that we lack the capability to impact everywhere that has need. While it’s good to direct aid to the neediest villages, the need to do that prioritization is a sign that we are choosing to leave other needy villages without aid. We should not forget that we are not solving the problem but only addressing a part of it.

2. Selfish prioritization. Realizing a lack of resources (good schools, good housing, etc.) we wish to identify the best for ourselves. This can appear in guises that sound well-intentioned, but it is fundamentally about some people winning in a zero-sum game while others lose.

3. Prioritization because we don’t know how to take action. This is dangerous because we could let prioritization become our only hope while no resources are directed to solving the problem. While information can drive action eventually, there are lots of problems for which the only thing that will help is a solution (action), not just articulating and re-articulating the problem (prioritization).

I think we need to work more on actions. We need to develop solutions that do not perpetuate a zero-sum game but that improve conditions for everyone. We still need prioritization, but we should be aware of how it fits in to solving problems. Important actions are hard to figure out.