1 of 15

Data Cleaning, Disambiguation and Reconciliation

Aims for this session:

short demos of different reconciliation tools
capture current issues in data cleaning and disambiguation for innovation data
start building a shared repository of scripts (catalog)

Link to these slides : http://j.mp/i3-recon-slides �Shared notes: http://j.mp/i3-datathon-notes

႞³

2 of 15

Common Data Cleaning Cycle

Merge�Reconcile�Deduplicate�Enrich�Export & name derived dataset

This process can be made much easier using Tidy data principles!

Diagram adapted from here

֍ data cleaning

႞³

3 of 15

Tidy metadata

Keep a record of transformations
Link data, code, and schemas
Document the process
Determines how your dataset plays w/ others�

Named scripts and steps

Glue code + mappings
Auxiliary + derived datasets

Diagram adapted from SSDE guidelines here

֍ data cleaning

႞³

4 of 15

Tools and Projects (incomplete list)

OpenRefine

Keeps a record of data transformations, can use to automate

Wholetale

Captures reproducible environment and stages of running scripts

GitHub (see Cyril's session)

Allows for versioning of both code and data

Frictionless Data (beta)

Sharing schemas, packaging data for easy sharing + transformation

+++
add yours here!

༕ tools + projects

႞³

5 of 15

Context: steps in data cleaning + integration

※ schema alignment, entity resolution [ER], data fusion : each has a pipeline.

※ We're discussing the pipeline for ER: filtering, matching, clustering

※ Current sources for globally harmonized data, via Deyun Yin (1):

�

※ entity resolution / reconciliation

႞³

6 of 15

Current tools and services:

※ Recent overview: USPTO workshop https://patentsview.org/entityres

※ Patentsview: Inventor, Assignee, Location disambiguation� - Documentation + code; clustering w/ overlapping canopies. great rOpenSci package.

※ Lens.org: metarecord (LensID) + API (see the fields here)

※ Firmani 2021: Alaska benchmark for data integration tasks.� - Needed: better benchmarks, interpretable results from deep-learning models (CorDEL)

※ Name disambiguation: many focused datasets (Yin, Callaert, ++)

※ Other datasets: Morrison 2017 (Assignee + Inventor Disambig)

※ entity resolution / reconciliation

႞³

7 of 15

Setting up or using a reconciliation service

Submit a string, get back a list of possible IDs, w/ context and similarity score.

※ OpenRefine: the canonical tool for this. � - 35 popular Entity Reconciliation endpoints; host your own!

※ Datasette-reconcile: quick data processing in SQLite� "A service [using datasette + openrefine] could be started with something like � datasette-reconcile --canonical-data mysource.csv --search-column searchCol --id-col idCol --use-plugin datasette-jellyfish --scoring levenshtein_distance --port ${RECONCILE_SERVICE_PORT}"

※ entity resolution + reconciliation

႞³

8 of 15

Are there public reconciliation services?

- yes. ex: http://refine.codefork.com /

- catalog of 35: https://reconciliation-api.github.io/testbench /

Can you resolve against multiple ontologies at once?

Is there a shared list of scripts and benchmarks?

႞³

※ entity resolution + reconciliation

9 of 15

※ reconciliation tools

Are there public reconciliation services?

- Yes! (incomplete list) �

Can you resolve against multiple ontologies at once?

- Yes, in theory. dbpedia's databus searches many endpoints.� - Here is a public demo (no slick frontend; you can run your own)

Is there a shared list of scripts and benchmarks?

႞³

10 of 15

※ reconciliation tools

Are there public reconciliation services?

- Yes! (incomplete list) �

Can you resolve against multiple ontologies at once?

- Yes, needs more interface work to be widely used. �

Is there a shared list of scripts and benchmarks?

- Some smaller collections, we can do better!

- contribute code you use to the I3 Data Processing Scripts repository!

႞³

11 of 15

Discussion

How do data cleaning issues come up in your work? What perils are there?
How do you find publicly available benchmarks, and which do you use?
For ML approaches: where do you look for trained models and training data?
What do you rely on to measure similarity? (co-invention, spelling, location, citation network, classification, body text)

႞³

※ reconciliation tools