1 of 15

Data Cleaning, Disambiguation and Reconciliation

Aims for this session:

  • short demos of different reconciliation tools
  • capture current issues in data cleaning and disambiguation for innovation data
  • start building a shared repository of scripts (catalog)

Link to these slides : http://j.mp/i3-recon-slides �Shared notes: http://j.mp/i3-datathon-notes

3

2 of 15

Common Data Cleaning Cycle

Merge�Reconcile�Deduplicate�Enrich�Export & name derived dataset

This process can be made much easier using Tidy data principles!

Diagram adapted from here

֍ data cleaning

3

3 of 15

Tidy metadata

  • Keep a record of transformations
  • Link data, code, and schemas
  • Document the process
  • Determines how your dataset plays w/ others�

Named scripts and steps

  • Glue code + mappings
  • Auxiliary + derived datasets

Diagram adapted from SSDE guidelines here

֍ data cleaning

3

4 of 15

Tools and Projects (incomplete list)

  • OpenRefine
    • Keeps a record of data transformations, can use to automate
  • Wholetale
    • Captures reproducible environment and stages of running scripts
  • GitHub (see Cyril's session)
    • Allows for versioning of both code and data
  • Frictionless Data (beta)
    • Sharing schemas, packaging data for easy sharing + transformation
  • +++
  • add yours here!

༕ tools + projects

3

5 of 15

Context: steps in data cleaning + integration

schema alignment, entity resolution [ER], data fusion : each has a pipeline.

※ We're discussing the pipeline for ER: filtering, matching, clustering

Current sources for globally harmonized data, via Deyun Yin (1):

entity resolution / reconciliation

3

6 of 15

Current tools and services:

※ Recent overview: USPTO workshop https://patentsview.org/entityres

Patentsview: Inventor, Assignee, Location disambiguation� - Documentation + code; clustering w/ overlapping canopies. great rOpenSci package.

Lens.org: metarecord (LensID) + API (see the fields here)

※ Firmani 2021: Alaska benchmark for data integration tasks.� - Needed: better benchmarks, interpretable results from deep-learning models (CorDEL)

※ Name disambiguation: many focused datasets (Yin, Callaert, ++)

※ Other datasets: Morrison 2017 (Assignee + Inventor Disambig)

entity resolution / reconciliation

3

7 of 15

Setting up or using a reconciliation service

Submit a string, get back a list of possible IDs, w/ context and similarity score.

OpenRefine: the canonical tool for this. � - 35 popular Entity Reconciliation endpoints; host your own!

Datasette-reconcile: quick data processing in SQLite� "A service [using datasette + openrefine] could be started with something like � datasette-reconcile --canonical-data mysource.csv --search-column searchCol --id-col idCol --use-plugin datasette-jellyfish --scoring levenshtein_distance --port ${RECONCILE_SERVICE_PORT}"

entity resolution + reconciliation

3

8 of 15

  • Are there public reconciliation services?

- yes. ex: http://refine.codefork.com/

- catalog of 35: https://reconciliation-api.github.io/testbench/

  • Can you resolve against multiple ontologies at once?

  • Is there a shared list of scripts and benchmarks?

3

entity resolution + reconciliation

9 of 15

reconciliation tools

  • Are there public reconciliation services?

- Yes! (incomplete list) �

  • Can you resolve against multiple ontologies at once?

- Yes, in theory. dbpedia's databus searches many endpoints.� - Here is a public demo (no slick frontend; you can run your own)

  • Is there a shared list of scripts and benchmarks?

3

10 of 15

reconciliation tools

  • Are there public reconciliation services?

- Yes! (incomplete list) �

  • Can you resolve against multiple ontologies at once?

- Yes, needs more interface work to be widely used.

  • Is there a shared list of scripts and benchmarks?

- Some smaller collections, we can do better!

- contribute code you use to the I3 Data Processing Scripts repository!

3

11 of 15

Discussion

  • How do data cleaning issues come up in your work? What perils are there?
  • How do you find publicly available benchmarks, and which do you use?
  • For ML approaches: where do you look for trained models and training data?
  • What do you rely on to measure similarity? (co-invention, spelling, location, citation network, classification, body text)

3

reconciliation tools

12 of 15

OpenRefine Demo: cleaning + transformation

  • Data cleanup and transformation software�
  • Used for:
    • cleaning data
    • transforming data
    • extending datasets
    • automating processes
  • Advantages:
    • Doesn’t edit the original dataset
    • Takes in a wide range of files
    • More powerful processes than spreadsheets

3

࿃ reconciliation example: OpenRefine

13 of 15

OpenRefine Demo: disambiguation + entity resolution

  • sample US foreign Aid data:
  • Basic string operations
  • Clustering and faceting
  • Transformations
  • Creating plots
  • Reconciliation
  • Exporting

3

࿃ reconciliation : OpenRefine

14 of 15

OpenRefine Documents + Tutorials:

3

15 of 15

3