1 of 8

Large Scale Entity Matching for theAdvisor

Davis Spradling

Dr.Saule

2 of 8

Motivations

  • The motivation behind the research of this project is to merge datasets through a common key so that we can combine attributes from one dataset to another.
  • Not only do we want to merge these datasets but figure out if we can do it a way that is least expensive.
  • For example: DBLP gives great meta data for paper but does not contain citations. However, MAG has unreliable metadata but citations. Therefore merging these datasets through a common key will help give us this data.

3 of 8

Background: Dataset

  • DBLP (Digital Bibliography & Library Project) and MAG (Microsoft Academic Graph) are datasets that store scholarly publications with information about a publications title, author, publication year, DOI number, paper ID, etc.
  • Overall these datasets are much bigger than your average dataset with DBLP having around 3 million papers and MAG having around 217 million papers.
  • Therefore using a standard linear search algorithm, 3 million papers times 217 million papers will give us 600 trillion comparisons.

4 of 8

Kmer Hashing

  • Histogram represents the most common 3 character “mers.”
  • What are k-mers?
  • It is the length “k” of a substring within a string of repeated sequence.
  • For example 3-mer of “Programming”
  • “Pro,”rog”,“ogr”,”gra”,”ram”,”amm”, ”mmi”,”min”,”ing”

5 of 8

Querying on Kmer

  • When searching for a given title in our hash table that contains a mer and then ID’s associated with that mer we get the following visual.
  • This means our query function and our hashtable works as we have an obvious candidate when trying to query.
  • “ash” = [id1, id9, id167]

6 of 8

Current Progress

  • Measuring memory allocation
  • Measuring time to make queries for MAG papers.
  • Optimizing and improving our current implementation of kmer hashing and kmer querying.

7 of 8

Future Plans

  • Evaluate further if it gives us good candidate matches.
  • What would a successful candidate look like?
  • Build high quality algorithm to decide whether 2 papers are a match or not.
  • This will probably be done using more expensive string comparisons.

8 of 8

Thank You

Q & A

dspradl1@uncc.edu