mergic

Thursday May 14, 2015

A lightning talk at the May meeting (registration) of the PyData NYC meetup group, introducing mergic.


planspace.org

@planarrowspace


Hi! I'm Aaron. This is my blog and my twitter handle. You can get from one to the other. This presentation and a corresponding write-up (you're reading it) are on my blog (which you're on).



Down to business!


Lukas Lacko             F Pennetta
Leonardo Mayer          S Williams
Marcos Baghdatis        C Wozniacki
Santiago Giraldo        E Bouchard
Juan Monaco             N.Djokovic
Dmitry Tursunov         S.Giraldo
Dudi Sela               Y-H.Lu
Fabio Fognini           T.Robredo
...                     ...

The problem often looks like this: You have either two columns with slightly different versions of identifiers, or one long list of things that you need to resolve to common names. These problems are fundamentally the same.

Do you see the match here? (It's Santiago!)


workflow support for reproducible deduplication and merging


This is what mergic is for. mergic is a simple tool designed to make it less painful when you need to merge things that don't yet merge.



A quick disclaimer!

This is John Langford's slide, about what big data is. He says that small data is data for which O(n2) algorithms are feasible. Currently mergic is strictly for this kind of "artisanal" data, where we want to ensure that our matching is correct but want to reduce the amount of human work to ensure that. And we are about to get very O(n2).


Santiago Giraldo,Leonardo Mayer
Santiago Giraldo,Dudi Sela
Santiago Giraldo,Juan Monaco
Santiago Giraldo,S Williams
Santiago Giraldo,C Wozniacki
Santiago Giraldo,S.Giraldo
Santiago Giraldo,Marcos Baghdatis
Santiago Giraldo,Y-H.Lu
...

So we make all possible pairs of identifiers! This is annoying for a computer, and awful for humans. The computer can calculate a lot of pairwise distances, but I don't want to look at all the pairs.

Do you see the match here? (It's Santiago again!)


INFO:dedupe.training:1.0
name : stanislas wawrinka

name : stanislas wawrinka

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

O.o?

The is a "screen shot" of the csvdedupe interface, which is based on the Python dedupe project, which is very cool. It could be exactly what you want for larger amounts of more complex data. There's even work on getting learnable edit distances implemented now, which would be great to see. But for very simple data sets, dedupe can be overkill. Also, you don't get much sense of the big picture of your data set, and it's still very pair-oriented.


Karolina Pliskova,K Pliskova

Aside from being a drag to look at, there's a bigger problem with verifying equality on a pairwise basis.

Do these two records refer to the same person? (Tennis fans may see where I'm going with this.)


Kristyna Pliskova,K Pliskova

Karolina has a twin sister, and Kristyna also plays professional tennis! This may well not be obvious if you only look at pairs individually. What matters is the set of names that are transitively judged as equal.


sets > pairs


Both perceptually and logically, it's better to think in sets than in a bunch of individual pairs.



Open Refine is quite good. Their interface shows you some useful diagnostics, and you can see sets of things. There's even some idea of repeatable transformations. But there's so much functionality wrapped up in a mostly graphical interface that it's hard to make it part of an easily repeatable workflow. And while there are a bunch of built-in distance functions, I'm not sure whether it's possible to use a custom distance function in Open Refine.



So the goals of mergic are to be:


demo


Here's a quick run-through of the mergic workflow. It's similar to the one in the README.

pew new pydata

I'll start by making a new virtual environment using pew.

pip install mergic

mergic is very new (version 0.0.4) and it currently installs with no extra dependencies.

mergic -h

mergic includes a command-line script based on argparse that uses a default string distance function.

usage: mergic [-h] {calc,make,check,diff,apply,table} ...

positional arguments:
  {calc,make,check,diff,apply,table}
    calc                calculate all partitions of data
    make                make a JSON partition from data
    check               check validity of JSON partition
    diff                diff two JSON partitions
    apply               apply a patch to a JSON partition
    table               make merge table from JSON partition

optional arguments:
  -h, --help            show this help message and exit

The command line script has a number of sub-commands that expose its functionality.

head -4 RLdata500.csv

We'll try mergic out with an example data set from R's RecordLinkage package.

CARSTEN,,MEIER,,1949,7,22
GERD,,BAUER,,1968,7,27
ROBERT,,HARTMANN,,1930,4,30
STEFAN,,WOLFF,,1957,9,2

The data is fabricated name and birth date from a hypothetical German hospital. It has a number of columns, but for mergic we'll just treat the rows of CSV as single strings.

mergic calc RLdata500.csv

The calc subcommand calculates all the pairwise distances and provides diagnostics about possible groupings that could be produced.

num groups, max group, num pairs, cutoff
----------------------------------------
       500,         1,         0, -0.982456140351
       497,         2,         3, 0.0175438596491

With a cutoff lower than any actual encountered string distance, every item stays separate, the maximum group size is one, and there are no pairs within those groups to evaluate.

         2,       499,    124251, 0.416666666667
         1,       500,    124750, 0.418181818182

On the other extreme, we could group every item together in a giant mega-group.

       451,         2,        49, 0.111111111111
       450,         2,        50, 0.115384615385
       449,         3,        52, 0.125

mergic gives you a choice about how big the groups it will produce will be. In this case, there's a cutoff of about 0.12 that will produce 50 groups of two items, which looks promising.

mergic make RLdata500.csv 0.12

We can make a grouping with that cutoff, and the result is a JSON-formatted partition.

{
    "MATTHIAS,,HAAS,,1955,7,8": [
        "MATTHIAS,,HAAS,,1955,7,8",
        "MATTHIAS,,HAAS,,1955,8,8"
    ],
    "HELGA,ELFRIEDE,BERGER,,1989,1,18": [
        "HELGA,ELFRIEDE,BERGER,,1989,1,18",
        "HELGA,ELFRIEDE,BERGER,,1989,1,28"
    ],

In this example, the partition at a cutoff of 0.12 happens to be exactly right and we correctly group everything. (This says something about how realistic this example data set is, something about your tool of choice if it can't easily get perfect performance on this example data set, and also something about information leakage.)

{
    "MATTHIAS,,HAAS,,1955,7,8": [
        "MATTHIAS,,HAAS,,1955,8,8"
    ],
    "HELGA,ELFRIEDE,BERGER,,1989,1,18": [
        "MATTHIAS,,HAAS,,1955,7,8",
        "HELGA,ELFRIEDE,BERGER,,1989,1,18",
        "HELGA,ELFRIEDE,BERGER,,1989,1,28"
    ],

The above would be a strange change to make, but you could make such a change and save your changed version as a new file.

mergic diff base.json edited.json > diff.json
mergic apply base.json diff.json

mergic includes functionality for creating and applying diffs that compare two partitions. You can preserve just the changes that you make by hand, which provides a record of the changes that had a human in the loop versus the changes that were computer-generated.

mergic table edited.json

To actually accomplish the desired merge or deduplication after creating a good grouping in JSON, mergic will generate a two-column merge table in CSV that can be used with most any data system.

"HANS,,SCHAEFER,,2003,6,22","HANS,,SCHAEFER,,2003,6,22"
"HARTMHUT,,HOFFMSNN,,1929,12,29","HARTMHUT,,HOFFMSNN,,1929,12,29"
"HARTMUT,,HOFFMANN,,1929,12,29","HARTMHUT,,HOFFMSNN,,1929,12,29"

These merge tables are awful to work with by hand, which is why mergic leaves their generation as a final step after humans work with the more understandable JSON groupings.



It's easy to write a script with a custom distance function and immediately use it with all the workflow support of the mergic script.

Often, a custom distance function makes or breaks your effort. It's worth thinking about and experimenting with, and mergic makes it easy!



If you're interested in this kind of thing, I'll be doing a longer talk at the New York City Open Statistical Programming Meetup next week Wednesday.



I also hope to see you at Open Data Science Con in Boston!


Thanks!


Thank you!


planspace.org

@planarrowspace


This is just me again.