@planarrowspace
problem
your data sucks
our data sucks
dbn grade year category num_tested
01M019 3 2010 Female 16
01M019 3 2010 Male 20
01M019 3 2010 Male 2
problems when names are the same
merge
How many rows do you get when you outer join two tables?
> nrow(first)
## [1] 3
> nrow(second)
## [1] 3
> result <- merge(first, second)
> nrow(result)
## [1] 3
x y1 x y2 x y1 y2
1 looks 1 good 1 looks good
2 oh 2 boy 2 oh boy
3 well 2 no 2 oh no
problems when names are the same
problems when names aren't the same
let's play tennis
$ head -2 AusOpen-women-2013.csv | cut -c 1-40
## Player1,Player2,Round,Result,FNL1,FNL2,F
## Serena Williams,Ashleigh Barty,1,1,2,0,5
$ head -2 USOpen-women-2013.csv | cut -c 1-40
## Player 1,Player 2,ROUND,Result,FNL.1,FNL
## S Williams,V Azarenka,7,1,2,1,57,44,43,2
$ wc -l names.txt
## 1886
$ sort names.txt | uniq | wc -l
## 669
$ sort names.txt | uniq -c | sort -nr | head -5
## 21 Rafael Nadal
## 17 Stanislas Wawrinka
## 17 Novak Djokovic
## 17 David Ferrer
## 15 Roger Federer
single field deduplication
Lukas Lacko F Pennetta
Leonardo Mayer S Williams
Marcos Baghdatis C Wozniacki
Santiago Giraldo E Bouchard
Juan Monaco N.Djokovic
Dmitry Tursunov S.Giraldo
Dudi Sela Y-H.Lu
Fabio Fognini T.Robredo
... ...
Cluster ID,name
1,Lukas Lacko
2,Leonardo Mayer
3,Marcos Baghdatis
...
{
"Ann's group": [
"Ann"
],
"Bob's group": [
"Bob",
"Robert"
]
}
new,original
Ann's group,Ann
Bob's group,Bob
Bob's group,Robert
the tool disappears
any distance function
really reproducible
Santiago Giraldo,Leonardo Mayer
Santiago Giraldo,Dudi Sela
Santiago Giraldo,Juan Monaco
Santiago Giraldo,S Williams
Santiago Giraldo,C Wozniacki
Santiago Giraldo,S.Giraldo
Santiago Giraldo,Marcos Baghdatis
Santiago Giraldo,Y-H.Lu
...
Karolina Pliskova,K Pliskova
Kristyna Pliskova,K Pliskova
sets > pairs
workflow support for reproducible deduplication and merging
demo: mergic tennis
distance matters
extension to multiple fields
name
----
Bob
Rob
Robert
name, name
----------
Bob, Bobby
Bob, Robert
Bobby, Robert
name, hometown
-----------------
Bob, New York
Rob, NYC
Robert, "NY, NY"
real clustering?
dog, doge, kitten, kitteh
dog doge kitten kitteh
dog 0 1 6 6
doge 1 0 5 5
kitten 6 5 0 1
kitteh 6 5 1 0
from sklearn.manifold import MDS
mds = MDS(dissimilarity='precomputed')
coords = mds.fit_transform(distances)
open
Thanks!
@planarrowspace