Data Cleaning, Disambiguation and Reconciliation
Aims for this session:
Link to these slides : http://j.mp/i3-recon-slides �Shared notes: http://j.mp/i3-datathon-notes
႞3
Common Data Cleaning Cycle
Merge�Reconcile�Deduplicate�Enrich�Export & name derived dataset
This process can be made much easier using Tidy data principles!
Diagram adapted from here
֍ data cleaning
႞3
Tidy metadata
Named scripts and steps
Diagram adapted from SSDE guidelines here
֍ data cleaning
႞3
Tools and Projects (incomplete list)
༕ tools + projects
႞3
Context: steps in data cleaning + integration
※ schema alignment, entity resolution [ER], data fusion : each has a pipeline.
※ We're discussing the pipeline for ER: filtering, matching, clustering
※ Current sources for globally harmonized data, via Deyun Yin (1):
�
※ entity resolution / reconciliation
႞3
Current tools and services:
※ Recent overview: USPTO workshop https://patentsview.org/entityres
※ Patentsview: Inventor, Assignee, Location disambiguation� - Documentation + code; clustering w/ overlapping canopies. great rOpenSci package.
※ Lens.org: metarecord (LensID) + API (see the fields here)
※ Firmani 2021: Alaska benchmark for data integration tasks.� - Needed: better benchmarks, interpretable results from deep-learning models (CorDEL)
※ Name disambiguation: many focused datasets (Yin, Callaert, ++)
※ Other datasets: Morrison 2017 (Assignee + Inventor Disambig)
※ entity resolution / reconciliation
႞3
Setting up or using a reconciliation service
Submit a string, get back a list of possible IDs, w/ context and similarity score.
※ OpenRefine: the canonical tool for this. � - 35 popular Entity Reconciliation endpoints; host your own!
※ Datasette-reconcile: quick data processing in SQLite� "A service [using datasette + openrefine] could be started with something like � datasette-reconcile --canonical-data mysource.csv --search-column searchCol --id-col idCol --use-plugin datasette-jellyfish --scoring levenshtein_distance --port ${RECONCILE_SERVICE_PORT}"
※ entity resolution + reconciliation
႞3
- yes. ex: http://refine.codefork.com/
- catalog of 35: https://reconciliation-api.github.io/testbench/
႞3
※ entity resolution + reconciliation
※ reconciliation tools
- Yes! (incomplete list) �
- Yes, in theory. dbpedia's databus searches many endpoints.� - Here is a public demo (no slick frontend; you can run your own)
႞3
※ reconciliation tools
- Yes! (incomplete list) �
- Yes, needs more interface work to be widely used. �
- Some smaller collections, we can do better!
- contribute code you use to the I3 Data Processing Scripts repository!
႞3
Discussion
႞3
※ reconciliation tools
OpenRefine Demo: cleaning + transformation
႞3
࿃ reconciliation example: OpenRefine
OpenRefine Demo: disambiguation + entity resolution
႞3
࿃ reconciliation : OpenRefine
OpenRefine Documents + Tutorials:
႞3
႞3