The motivation behind the research of this project is to merge datasets through a common key so that we can combine attributes from one dataset to another.
Not only do we want to merge these datasets but figure out if we can do it a way that is least expensive.
For example: DBLP gives great meta data for paper but does not contain citations. However, MAG has unreliable metadata but citations. Therefore merging these datasets through a common key will help give us this data.
3 of 8
Background: Dataset
DBLP (Digital Bibliography & Library Project) and MAG (Microsoft Academic Graph) are datasets that store scholarly publications with information about a publications title, author, publication year, DOI number, paper ID, etc.
Overall these datasets are much bigger than your average dataset with DBLP having around 3 million papers and MAG having around 217 million papers.
Therefore using a standard linear search algorithm, 3 million papers times 217 million papers will give us 600 trillion comparisons.
4 of 8
Kmer Hashing
Histogram represents the most common 3 character “mers.”
What are k-mers?
It is the length “k” of a substring within a string of repeated sequence.