@planarrowspace
“There are only two hard things in computer science: cache invalidation and naming things.”
What are we talking about?
when names are the same
when names aren't the same
demo: DOE data
merge
How many rows do you get when you outer join two tables?
> nrow(first)
## [1] 3
> nrow(second)
## [1] 3
> result <- merge(first, second)
> nrow(result)
## [1] 3
x y1 x y2 x y1 y2
1 looks 1 good 1 looks good
2 oh 2 boy 2 oh boy
3 well 2 no 2 oh no
when names are the same
when names aren't the same
demo: let's play tennis
single field deduplication
Lukas Lacko F Pennetta
Leonardo Mayer S Williams
Marcos Baghdatis C Wozniacki
Santiago Giraldo E Bouchard
Juan Monaco N.Djokovic
Dmitry Tursunov S.Giraldo
Dudi Sela Y-H.Lu
Fabio Fognini T.Robredo
... ...
demo: Open Refine
the tool disappears
any distance function
really reproducible
Santiago Giraldo,Leonardo Mayer
Santiago Giraldo,Dudi Sela
Santiago Giraldo,Juan Monaco
Santiago Giraldo,S Williams
Santiago Giraldo,C Wozniacki
Santiago Giraldo,S.Giraldo
Santiago Giraldo,Marcos Baghdatis
Santiago Giraldo,Y-H.Lu
...
Karolina Pliskova,K Pliskova
Kristyna Pliskova,K Pliskova
sets > pairs
workflow support for reproducible deduplication and merging
demo: mergic tennis
distance matters
extension to multiple fields
name
----
Bob
Rob
Robert
name, name
----------
Bob, Bobby
Bob, Robert
Bobby, Robert
name, hometown
-----------------
Bob, New York
Rob, NYC
Robert, "NY, NY"
R: RecordLinkage
demo: mergic on RecordLinkage data
dedupe
demo: csvdedupe
real clustering?
dog, doge, kitten, kitteh
dog doge kitten kitteh
dog 0 1 6 6
doge 1 0 5 5
kitten 6 5 0 1
kitteh 6 5 1 0
from sklearn.manifold import MDS
mds = MDS(dissimilarity='precomputed')
coords = mds.fit_transform(distances)
deep thoughts
questions for discussion
Thanks!
@planarrowspace