Practical Mergic


“There are only two hard things in computer science: cache invalidation and naming things.”

What are we talking about?

when names are the same

when names aren't the same

demo: DOE data


How many rows do you get when you outer join two tables?

> nrow(first)
## [1] 3
> nrow(second)
## [1] 3
> result <- merge(first, second)
> nrow(result)
## [1] 3
 x    y1        x   y2        x    y1   y2
 1 looks        1 good        1 looks good
 2    oh        2  boy        2    oh  boy
 3  well        2   no        2    oh   no

when names are the same

when names aren't the same

demo: let's play tennis

single field deduplication

Lukas Lacko             F Pennetta
Leonardo Mayer          S Williams
Marcos Baghdatis        C Wozniacki
Santiago Giraldo        E Bouchard
Juan Monaco             N.Djokovic
Dmitry Tursunov         S.Giraldo
Dudi Sela               Y-H.Lu
Fabio Fognini           T.Robredo
...                     ...

demo: Open Refine

the tool disappears

any distance function

really reproducible

Santiago Giraldo,Leonardo Mayer
Santiago Giraldo,Dudi Sela
Santiago Giraldo,Juan Monaco
Santiago Giraldo,S Williams
Santiago Giraldo,C Wozniacki
Santiago Giraldo,S.Giraldo
Santiago Giraldo,Marcos Baghdatis
Santiago Giraldo,Y-H.Lu
Karolina Pliskova,K Pliskova
Kristyna Pliskova,K Pliskova

sets > pairs

workflow support for reproducible deduplication and merging

demo: mergic tennis

distance matters

extension to multiple fields

name, name
Bob, Bobby
Bob, Robert
Bobby, Robert

name,    hometown
Bob,     New York
Rob,     NYC
Robert,  "NY, NY"

R: RecordLinkage

demo: mergic on RecordLinkage data


demo: csvdedupe

real clustering?

dog, doge, kitten, kitteh

       dog doge kitten kitteh
   dog   0    1      6      6
  doge   1    0      5      5
kitten   6    5      0      1
kitteh   6    5      1      0
from sklearn.manifold import MDS
mds = MDS(dissimilarity='precomputed')
coords = mds.fit_transform(distances)

deep thoughts

questions for discussion