Practical Mergic

planspace.org

@planarrowspace

Open Data Science Conference title slide for Practical Mergic with Aaron Schumacher

Metis

problem

your data sucks

our data sucks

   dbn grade year category num_tested
01M019     3 2010   Female         16
01M019     3 2010     Male         20
01M019     3 2010     Male          2

problems when names are the same

merge

How many rows do you get when you outer join two tables?

> nrow(first)
## [1] 3
> nrow(second)
## [1] 3
> result <- merge(first, second)
> nrow(result)
## [1] 3
 x    y1        x   y2        x    y1   y2
 1 looks        1 good        1 looks good
 2    oh        2  boy        2    oh  boy
 3  well        2   no        2    oh   no

problems when names are the same

problems when names aren't the same

let's play tennis

$ head -2 AusOpen-women-2013.csv | cut -c 1-40
## Player1,Player2,Round,Result,FNL1,FNL2,F
## Serena Williams,Ashleigh Barty,1,1,2,0,5

$ head -2 USOpen-women-2013.csv | cut -c 1-40
## Player 1,Player 2,ROUND,Result,FNL.1,FNL
## S Williams,V Azarenka,7,1,2,1,57,44,43,2
$ wc -l names.txt
## 1886

$ sort names.txt | uniq | wc -l
## 669
$ sort names.txt | uniq -c | sort -nr | head -5
##  21 Rafael Nadal
##  17 Stanislas Wawrinka
##  17 Novak Djokovic
##  17 David Ferrer
##  15 Roger Federer

single field deduplication

Lukas Lacko             F Pennetta
Leonardo Mayer          S Williams
Marcos Baghdatis        C Wozniacki
Santiago Giraldo        E Bouchard
Juan Monaco             N.Djokovic
Dmitry Tursunov         S.Giraldo
Dudi Sela               Y-H.Lu
Fabio Fognini           T.Robredo
...                     ...

Cluster ID,name
1,Lukas Lacko
2,Leonardo Mayer
3,Marcos Baghdatis
...
{
    "Ann's group": [
        "Ann"
    ],
    "Bob's group": [
        "Bob",
        "Robert"
    ]
}
new,original
Ann's group,Ann
Bob's group,Bob
Bob's group,Robert

the tool disappears

any distance function

really reproducible

Santiago Giraldo,Leonardo Mayer
Santiago Giraldo,Dudi Sela
Santiago Giraldo,Juan Monaco
Santiago Giraldo,S Williams
Santiago Giraldo,C Wozniacki
Santiago Giraldo,S.Giraldo
Santiago Giraldo,Marcos Baghdatis
Santiago Giraldo,Y-H.Lu
...
Karolina Pliskova,K Pliskova
Kristyna Pliskova,K Pliskova

sets > pairs

workflow support for reproducible deduplication and merging

demo: mergic tennis

distance matters

extension to multiple fields

name
----
Bob
Rob
Robert
name, name
----------
Bob, Bobby
Bob, Robert
Bobby, Robert

name,    hometown
-----------------
Bob,     New York
Rob,     NYC
Robert,  "NY, NY"

real clustering?

dog, doge, kitten, kitteh

       dog doge kitten kitteh
   dog   0    1      6      6
  doge   1    0      5      5
kitten   6    5      0      1
kitteh   6    5      1      0
from sklearn.manifold import MDS
mds = MDS(dissimilarity='precomputed')
coords = mds.fit_transform(distances)

open

Thanks!

planspace.org

@planarrowspace