I want an edit button on everything

My blog does not have an edit button for people who are not me. This means it takes a bunch of work to fix a typo, for example: you’d have to tell me about it, describing where the typo is, somehow, and then I would have to go find the spot, and make the change. In practice, this pretty much doesn’t happen.

Wikipedia has edit buttons on everything, and so does github. I’m not entirely sure what is best between allow-all-edits-immediately and require-review-for-all-edits. Some mix is also possible, I guess. Wikipedia has ways to lock down articles, and must have corresponding permission systems for who can do/undo that. Github lets you give other people full edit permissions, so you can spread the editor pool at least. Git by itself can support even more fine-grained control, I believe.

I’d like to move my blog to something git-backed, like github pages. It’s a little work, but you can put an edit button on the rendered HTML views shown by github pages too. Advanced R has a beautiful “Edit this page” button on every page. Three.js has one in their documentation. Eric points out Ben’s blog as well, and also the trouble with comments.

Ideally I’d prefer not to be github-bound, I guess, or bound to some comment service. But I also kind of prefer to have everything text-based, so what do you do for comments then? And also I’d like to be able to do R markdown (etc.) and have that all render automagically. But also something serverless. I’m drawn to this javascript-only static static generator, but that also seems to be A Bad Idea.

So: that solves that.

How To Sort By Average Rating

Evan Miller’s well-known How Not To Sort By Average Rating points out problems with ranking by “wrong solution #1″ (by differences, upvotes minus downvotes) and “wrong solution #2″ (by average ratings, upvotes divided by total votes). Miller’s “correct solution” is to use the lower bound of a Wilson score confidence interval for a Bernoulli parameter. I think it would probably be better to use Laplace smoothing, because:

  • Laplace smoothing is much easier
  • Laplace smoothing is not always negatively biased

This is the Wilson scoring formula given in Miller’s post, which we’ll use to get 95% confidence interval lower bounds:

rating equation

(Use minus where it says plus/minus to calculate the lower bound.) Here is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.

Now here’s the formula for doing Laplace smoothing instead:

(upvotes + \alpha) / (total votes + \beta)

Here \alpha and \beta are parameters that represent our estimation of what rating is probably appropriate if we know nothing else (cf. Bayesian prior). For example, \alpha = 1 and \beta = 2 means that a post with no votes gets treated as a 0.5.

The Laplace smoothing method is much simpler to calculate – there’s no need for statistical libraries, or even square roots!

Does it successfully solve the problems of “wrong solution #1″ and “wrong solution #2″? First, the problem with “wrong solution #1″, which we might summarize as “the problem with large sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 209 50 159 0.81 0.7545 0.80
second item 118 25 93 0.83 0.7546 0.82

All the methods agree except for “wrong solution #1″ that the second item should rank higher.

Then there’s the problem with “wrong solution #2″, which we might summarize as “the problem with small sample sizes”:

upvotes downvotes wrong #1 wrong #2 Wilson Laplace
first item 1 0 1 1.0 0.21 0.67
second item 534 46 488 0.92 0.90 0.92

All the methods agree except for “wrong solution #2″ that the second item should rank higher.

How similar are the results for the Wilson method and the Laplace method overall? Take a look: here color encodes the score, with blue at 0.0, white at 0.5, and red at 1.0:

plot of Wilson and Laplace methods

They’re so similar, you might say, that you would need a very good reason to justify the complexity of the calculation for the Wilson bound. But also, the differences favor the Laplace method! The Wilson method, because it’s a lower bound, is negatively biased everywhere. It’s certainly not symmetrical. Let’s zoom in:

plot of Wilson and Laplace methods - zoomed

With the Wilson method, you could have three upvotes, no downvotes and still rank lower than an item that is disliked by 50% of people over the long run. That seems strange.

The Laplace method does have its own biases. By choosing \alpha=1 and \beta=2, the bias is toward 0.5, which I think is reasonable for a ranking problem like this. But you could change it: \alpha=0 with \beta=1 biases toward zero, \alpha=1 with \beta=0 biases toward one. And \alpha=100 with \beta=200 biases toward 0.5 very strongly. With the Wilson method you can tune the size of the interval, adjusting the confidence level, but this only adjusts how strongly you’re biased toward zero.

Here’s another way of looking at the comparison. How do the two methods compare for varying numbers of upvotes with a constant number (10) of downvotes?

Wilson and Laplace methods again

Those are similar curves. Not identical – but is there a difference to justify the complexity of the Wilson score?

In conclusion: Just adding a little bit to your numerators and denominators (Laplace smoothing) gives you a scoring system that is as good or better than calculating Wilson scores.

[code for this post]

Action and Prioritization, Advertising and Intervention

Amazon can easily show you a product of their choice while you’re on their site. This is their action. Since it’s so easy to show you things, it makes sense to work a lot on choosing carefully what to show. This is their prioritization. Refer to this class of ranking problem as the advertising type.

It is fairly difficult to send food aid to a village (action) or to support and improve a challenged school (action). A deficiency of both knowledge and resources motivates a need to choose where to give attention (prioritization). Refer to this type of ranking problem as the intervention type.

Advertising problems are essentially scattershot and we only care about whether we hit something, anything. All you need is one good display, perhaps, and you make a sale. These prioritization choices also happen incredibly frequently; they demand automation and preclude individual human evaluation. It doesn’t matter because you only need to succeed on average.

Intervention problems, on the other hand, have a million challenges. Even a perfect solution to the prioritization problem does not guarantee success. Careful action will be required of multiple people after a prioritization is complete, and meaningful reasons for prioritization choices will be helpful, likely required. It is inhumane to think of “success on average” for these problems.

These two types of problems are different and demand different approaches. However, advertising-type problems and the focus and success of “advertisers” on the prioritization part of their work is influencing choices in how problems of both type are approached.

We’re spending too much effort on prioritization, sometimes even mistaking prioritization for action. Why?

1. Prioritization as a way to maximize impact. Certainly it’s good to maximize impact, but it also reflects an abhorrent reality: that we lack the capability to impact everywhere that has need. While it’s good to direct aid to the neediest villages, the need to do that prioritization is a sign that we are choosing to leave other needy villages without aid. We should not forget that we are not solving the problem but only addressing a part of it.

2. Selfish prioritization. Realizing a lack of resources (good schools, good housing, etc.) we wish to identify the best for ourselves. This can appear in guises that sound well-intentioned, but it is fundamentally about some people winning in a zero-sum game while others lose.

3. Prioritization because we don’t know how to take action. This is dangerous because we could let prioritization become our only hope while no resources are directed to solving the problem. While information can drive action eventually, there are lots of problems for which the only thing that will help is a solution (action), not just articulating and re-articulating the problem (prioritization).

I think we need to work more on actions. We need to develop solutions that do not perpetuate a zero-sum game but that improve conditions for everyone. We still need prioritization, but we should be aware of how it fits in to solving problems. Important actions are hard to figure out.

Genocide Data

I recently became interested in preventing genocide with data. I think this is not an easy thing to do. I undertook to identify data sources that might be relevant, and thanks to many wonderful people, I can present the following results!

#1. Karen Payne’s “GIS Data Repositories“, “Conflict” section.

Karen has assembled a phenomenal collection of references to datasets, curated within a publicly accessible Google spreadsheet. I’m sure many of the other things I’ll mention are also included in her excellent collection!

This list of data repositories was compiled by Karen Payne of the University of Georgia’s Information Technologies Outreach services, with funding provided by USAID, to point to free downloadable primary geographic datasets that may be useful in international humanitarian response. The repositories are grouped according to the tabs at the bottom

#2. The Global Database of Events, Language, and Tone (GDELT)

Kalev Leetaru of Georgetown University is super helpful and runs this neat data munging effort. There is a lot of data available. The GDELT Event Database uses CAMEO codes; in this scheme, there is code “203: Engage in ethnic cleansing”. There’s also the Global Knowledge Graph (GKG) which may be better for identifying genocide, because one can identify “Material Conflict events that are connected to the genocode theme in the GKG.”

GDELT is now listed on data.gov in coordination with the World Wide Human Geography Data working group.

Jay Yonamine did some work using GDELT to forecast violence in Afghanistan.

#3. The Humanitarian Data Exchange

This new project seems very promising – Javier Teran was very helpful in describing what’s currently available: “datasets on refugees, asylum seekers and other people of concern in our HDX repository that may be useful for your research”. By the time you read this, there may be even more genocide-related data!

#4. Uppsala conflict database

The Uppsala Conflict Data Program (UCDP) offers a number of datasets on organised violence and peacemaking, all of which can be downloaded for free

#5. USHMM / Crisis in Darfur

Max writes:

the National Holocaust Museum has done quite a bit about collecting and visualizing this kind of data. In particular, a few years back they led a large mapping project around Darfur

#6. AAAS Geospatial Technology and Human rights topic

The American Association for the Advancement of Science has a collection of research related to Geospatial Technology and Human rights. Start reading!

#7. Amnesty International

I haven’t looked into what data they might have and make available, but it seems like a relevant organization.

#8. Tech Challenge for Atrocity Prevention

USAID and Humanity United ran a group of competitions in 2013 broadly around fighting atrocities against civilians. You can read about it via PR Newswire and Fast Company. I found the modeling challenge particularly interesting – it was hosted by TopCoder, as I understand it, and the winners came up with some interesting approaches for predicting atrocities with existing data.

#9. elva.org

This is a tip I haven’t followed up on, but it could be good:

Hi, I would reach out to Jonne Catshoek of elva.org, they have an awesome platform and body of work that is really unappreciated. They also have a very unique working relationship with the nation of Georgia that could serve as a model for other work.

#10. The CrisisMappers community

“The humanitarian technology network” – this group is full of experts in the field, organizes the International Conference of Crisis Mappers, and has an active and helpful Google Group. The group is closed membership but welcoming; connecting there is how I found many of the resources here. Thanks CrisisMappers!

#11. CrisisNET

The illustrious Chris Albon introduced me to CrisisNET, “the firehose of global crisis data”:

CrisisNET finds, formats and exposes crisis data in a simple, intuitive structure that’s accessible anywhere. Now developers, journalists and analysts can skip the days of tedious data processing and get to work in minutes with only a few lines of code.

Examples of what you can do with it:

Tutorials and documentation on how to do things with it:

#12. PITF

All I know is that PITF could be some sort of relevant dataset; I haven’t had time to investigate.

 

This document

I’ll post this on my blog, where it’s easy to leave comments with additions, corrections, and so on without knowing git/github, but the “official” version of this document will live on github and any updates will be made there. Document license is Creative Commons Share-Alike, let’s say.

 

More thanks:

  • Thanks of course to everyone who provided help with the resources they’re involved with providing and curating – I tried to give this kind of credit as much as possible above!
  • Special thanks to Sandra Moscoso and Johannes Kiess of the World Bank for providing pointers to number 2 and more!
  • Special thanks to Max Richman of GeoPoll for providing numbers 4, 5, 6, and 7.
  • Special thanks to Minhchau “MC” Dinh of USAID for extra help with number 8!
  • Number 9 was provided via email; all I have is an email address, and I thought people probably wouldn’t want their email addresses listed. Thanks, person!
  • Special thanks to Patrick Meier of iRevolution for connecting me first to number 10!

Wrap-up from DC Hack and Tell #4

I’ve been putting together these wrap-ups from Hack and Tell in DC for a while now. They go out to the meetup list and they’re archived on github, but I like them so much I thought I’d put them here too. Working through the back-catalog:

DC Hack and Tell

Round 4: The Christmas Invasion

Time to wrap up the Christmas Invasion and put a bow on it… Here are all the good things we saw, in non-random order!

  • Aaron talked about lots of graphs made from NYC test scores.
  • Rick showed a really neat Medicare visualization that he made, which started as a National Day of Civic Hacking project. (Cool!)
  • Julian demoed the next big programming language, MyCoolLang aka Lebowski, rich with Python and LLVM goodness.
  • Chris fought the good fight against lighswitches, automating his home via his lightbulbs’ port 80 (duh).
  • In addition to inventing languages, Julian also improves existing ones – he showed how he became a Python core dev and improved performance (timing).
  • “So your friendly neighborhood bikeshare station is out of bikes again. What are the odds?” CHRIS WILL SHOW YOU THE ODDS.
  • And Joseph showed some of the magic of saltvagrant, and of course salty vagrant.

Happy solstice, everybody! See you on January 13, 2014!

Wrap-up from DC Hack and Tell #3

I’ve been putting together these wrap-ups from Hack and Tell in DC for a while now. They go out to the meetup list and they’re archived on github, but I like them so much I thought I’d put them here too. Working through the back-catalog:

DC Hack and Tell

Round 3: Hack… to the Future!

And now, a wrap-up… in random order!

  • Mike showed the excellent audioverb for all your language in situ needs – and it even has a youtube explanation too!
  • Aaron talked about rjstat, his R package for reading and writing the JSON-stat data format.
  • Fearless leader Jonathan shared a classic Hack and Tell hack for decoding cryptograms using simple language models and SIMULATED ANNEALING! (I know, right?) It’s called cryptosolver. We miss you, Jonathan!
  • Bayan showed how to simulate fantasy football drafts/seasons in R to test theories and impress your friends! With a Prezi!
  • Tom presented not just the JS live-coding mistakes.io but also super fun interactive statistics and simple-statistics!
  • Aaron also showed this Guess the Letter thing. Oh my gosh there’s a blog post.

And there will be even more good stuff coming soon… to the future!

Unnatural Causes

Unnatural Causes is “a seven-part documentary series exploring racial & socioeconomic inequalities in health” from 2008. In a horrible irony, the episodes are not available to the public. The cheapest way to see it is to pay $24.95 to FORA.tv for streaming. There doesn’t seem to be an option for buying the DVD from the main site unless you are an organization. I believe this is a mistake. The apparent goals of the producers would be better served by making the complete materials publicly available at no cost. You can watch some clips on their YouTube channel, which is good, but why not release everything, together with information on actions to take or links to further information? I don’t even remember where I heard about the series, and it wasn’t particularly easy to track down viewing options. The audience would be so much bigger if energy was devoted to spreading the videos rather than locking them up.

As I have now been lucky enough to see the complete series, here is a brief summary of the episodes:

1. “In sickness and in wealth”: The Whitehall Study is introduced. The Whitehall Study, which is frequently referenced throughout the series, found that health is associated with wealth, not just in a binary poor-vs.rich way, but in gradations all along the levels of wealth. The importance of a sense of control and a corresponding stress of social subordination are pointed to as people at varying levels of health and wealth are introduced in an American city. Also apparently there was some experiment that gave everybody colds by putting virus right into their noses – is that seriously an experimental technique that people use?

2. “When the bough breaks”: The stress of institutional and persistent racism is identified as a determinant of health. The example of low birth weights for babies born to black mothers is given. Also I noticed that the series is dedicated to the memory of Judy Crichton.

3. “Becoming American”: It is noted that Latino immigrants to America are initially healthier than other Americans, and tight families are given as a potential explanation. Also the Pennsylvania town that hosts the examples has some community center, and a youth center, which seem nice. Then it’s brought to light that immigrant health is much worse after five years, and also there’s some mention of mental illness.

4. “Bad sugar”: A community of Native Americans is the example of the episode, relevant because of very high levels of diabetes. The stress of being displaced by US forces, not dealt fairly with and essentially forced to eat a radically different and inferior diet, as well as the attendant problems of poverty, all contribute.

5. “Place matters”: Biggest takeaway was learning about the original redlining, which gave good home loans almost exclusively to white people from around 1934 to 1962. Grrr. The episode then talks about how bad neighborhoods are stressful; violence, mold, asthma, all suck. Everything is health policy. There’s also a pointing to the failure of private developers to provide what is really needed for people.

6. “Collateral damage”: This episode centers on the Marshall Islands, where US military involvement no longer sends showers of nuclear fall-out, but a base still dominates the economy to ill effect. Overcrowding on the adjacent island, which is essentially a slum compared to the island of the US base, leads to tuberculosis and other ailments. The people of the Marshall Islands can leave their homes and move the US (Arkansas is a popular destination, it seems) but health problems can continue there.

7. “Not just a paycheck”: Electrolux is a Swedish company that moved one factory from Michigan to Mexico and another from Sweden to Hungary. In Michigan this ruined a lot of lives, while in Sweden it was a comparatively small problem. Americans are less well protected by their government and their unions than the Swedes are by theirs, and the Americans have worse health outcomes. The American setting also illustrates increasing inequality as a family laid off from the factory lives on an old family farm that is increasingly surrounded by huge second homes of the rich.

This post was made possible through the generous support of the B. R. Schumacher Foundation.