JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

Large Scale Entity Matching for theAdvisor

Davis Spradling

Dr.Saule

2 of 8

Motivations

The motivation behind the research of this project is to merge datasets through a common key so that we can combine attributes from one dataset to another.
Not only do we want to merge these datasets but figure out if we can do it a way that is least expensive.
For example: DBLP gives great meta data for paper but does not contain citations. However, MAG has unreliable metadata but citations. Therefore merging these datasets through a common key will help give us this data.

Background: Dataset

DBLP (Digital Bibliography & Library Project) and MAG (Microsoft Academic Graph) are datasets that store scholarly publications with information about a publications title, author, publication year, DOI number, paper ID, etc.
Overall these datasets are much bigger than your average dataset with DBLP having around 3 million papers and MAG having around 217 million papers.
Therefore using a standard linear search algorithm, 3 million papers times 217 million papers will give us 600 trillion comparisons.

Kmer Hashing

Querying on Kmer

When searching for a given title in our hash table that contains a mer and then ID’s associated with that mer we get the following visual.
This means our query function and our hashtable works as we have an obvious candidate when trying to query.
“ash” = [id1, id9, id167]

Current Progress

Measuring memory allocation
Measuring time to make queries for MAG papers.
Optimizing and improving our current implementation of kmer hashing and kmer querying.

Future Plans

Thank You

Q & A

dspradl1@uncc.edu