Journey towards the JeMPI
Jembi’s MPI Solution
Acknowledgements
Presentation breakdown
Intro (2min)
How it started (3min)
Use cases
JeMPI Overview (25min)
Architecture (5min)
Tech Stack (5min)
Engine (?)
Info flow
Strategies: Probabilistic and Deterministic (3min)
Blocking (3min)
Machine learning & Fellegi Sunter (10min)
Performance Evaluation (25min)
Synthetic Data (3min)
Recall & Precision (3min)
Pairwise and Golden Record derivation (5min)
Accuracy & Run time (5min)
SanteMPI (reference point)
JeMPI performance
Fully automated and Semi automated (2min)
Further testing and analysis (same system, real life data accuracy) (3min)
Future plans (5min)
UI
Biometrics
Improvements to the linking
EM optimisations w/ flexible matching
NN for matching accuracy comparison
Security
Synthetic dataset generator Improvements
MPI name
Demo (15min)
Q&A (10min)
2+3+25+25+5+15 = 75
Overview
How it started
How it started
MPI Toolkit:
How it started
Summary report to evaluate which system best suits the DISI requirements:
Non-functional
Functional
Where it is now
Source: OpenHIE Partners Page
Where it is now
Source: OpenHIE Partners Page
JeMPI in comparison
JeMPI
JeMPI in context
JeMPI Architecture: JeMPI Engine
1. Data is entered via the input component in batch form
2. Staging takes the batch import, applies data cleaning and sends individual records to the controller
3. Controller receives records, validates and makes decisions, and controls the flow rate of information
4. Updates to the Expectation-Maximisation algorithm for parameter refinement
5. Records go to the linker. It fetches blocked candidate records from the persistent storage, and either links query records to existing golden records or creates new golden records
6. Golden records are stored in the persistent storage
7. The journal stores all transactions by the persistent storage as a disaster recovery method
8. The API allows queries to the persistent storage. Search, update, delete, etc.
JeMPI Architecture: JeMPI Product
JeMPI Architecture: JeMPI in relation
JeMPI Architecture: Tech Stack
JeMPI Architecture: Mid-Level (synchronous)
JeMPI Architecture: Mid-Level (asynchronous)
Features: Blocking
Query record + Blocking rules
Candidate List
Features: Blocking
Sex.equals(“male”)
Features: Blocking
Sex.equals(“male”) && Age.between(25,30)
Features: Blocking
Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8)
Features: Blocking
Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8) && FirstName.startsWith(“s”)
Features: Blocking
Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8) && FirstName.startsWith(“s”) && LastName.startsWith(“m”)
Features: Blocking
IF blocking returns an empty list, a second round of blocking with broader rules applies:
Features: Blocking
Sex.equals(“male”) && ( FirstName.fuzzy_equals(“solomon”,0.75) | | LastName.fuzzy_equals(“mujuni”,0.75) )
Features: Deterministic & Probabilistic Matching
Deterministic
Probabilistic
Fellegi-Sunter�Sim-Sum
Weighted Averages
Exact Match (intersection rules)�A ⋂ B ⋂ C ⋂ D
Exact Match �(union rules)�(A ⋂ B ⋂ C) ⋃ D
Fuzzy Matching
Speed
Accuracy
Speed
Accuracy
Features: Deterministic & Probabilistic Matching
?
National ID | First name | Last name | Date of birth | Sex | City | Phone number |
Features: Deterministic & Probabilistic Matching
?
National ID | First name | Last name | Date of birth | Sex | City | Phone number |
{deterministic: {
conditions: {
nationalID: “exactMatch”,
firstName: [“jarowinkler”, 0.85],
lastName: [“jarowinkler”, 0.85]
}
}
}
GR0803
GR0975
GR1006
GR1101
GR1305
GR1389
GR1629
GR2045
Features: Deterministic & Probabilistic Matching
?
National ID | First name | Last name | Date of birth | Sex | City | Phone number |
{probabilistic: {
conditions: {
attributes: “all”,
algorithm: “fellegiSunter”,
threshold: 0.65
}
}
}
0.4562
0.5983
0.8892
0.3254
0.2812
0.6940
0.4839
0.6329
The Landscape - Open Source Systems Reviewed
Features: Fellegi-Sunter
| National ID | First name | Last name | Date of birth | Sex | City | Phone number |
Query Record | 9101065679172 | Solomun | Mujhuni | 1991-06-01 | Male | Kampala | +256 123 4567 |
GR1006 | 9101065679172 | Solomon | Mujuni | 1991-01-06 | Male | Mbale | +256 987 6543 |
The Landscape - Open Source Systems Reviewed
Features: Fellegi-Sunter
| National ID | First name | Last name | Date of birth | Sex | City | Phone number |
Query Record | 9101065679172 | Solomun | Mujhuni | 1991-06-01 | Male | Kampala | +256 123 4567 |
GR1006 | 9101065679172 | Solomon | Mujuni | 1991-01-06 | Male | Mbale | +256 987 6543 |
The Landscape - Open Source Systems Reviewed
Features: Fellegi-Sunter
| National ID | First name | Last name | Date of birth | Sex | City | Phone number |
Query Record | 9101065679172 | Solomun | Mujhuni | 1991-06-01 | Male | Kampala | +256 123 4567 |
GR1006 | 9101065679172 | Solomon | Mujuni | 1991-01-06 | Male | Mbale | +256 987 6543 |
Jaro-�Winkler | 1.000 | 0.952 | 0.967 | 0.979 | 1.000 | 0.562 | 0.502 |
The Landscape - Open Source Systems Reviewed
Features: Fellegi-Sunter
Matching values (m) represent the quality of the data
m = Pr(attrM | sameRec)
Unmatching values (u) represent the uniqueness of the data
u = Pr(attrM | diffRec)
The Landscape - Open Source Systems Reviewed
Features: Machine Learning - Expectation Maximisation
The Landscape - Open Source Systems Reviewed
Features: Fellegi-Sunter
| National ID | First name | Last name | Date of birth | Sex | City | Phone number |
Query Record | 9101065679172 | Solomun | Mujhuni | 1991-06-01 | Male | Kampala | +256 123 4567 |
GR1006 | 9101065679172 | Solomon | Mujuni | 1991-01-06 | Male | Mbale | +256 987 6543 |
m ;�u | 0.964 ; 0.001 | 0.698 ; 0.003 | 0.772 ; 0.004 | 0.965 ; 0.050 | 0.999 ; 0.501 | 0.999 ; 0.045 | 0.985 ; 0.004 |
Matching | 13.943 | 7.812 | 7.707 | 4.280 | 0.996 | 4.462 | 7.981 |
Unmatching | -4.809 | -1.723 | -2.216 | -4.771 | -16.439 | -4.186 | -6.097 |
Attribute comparison strategies
The Landscape - Open Source Systems Reviewed
Oversimplification
| National ID | First name | Last name | Date of birth | Sex | City | Phone number |
Query Record | 9101065679172 | Solomun | Mujhuni | 1991-06-01 | Male | Kampala | +256 123 4567 |
GR1006 | 9101065679172 | Solomon | Mujuni | 1991-01-06 | Male | Mbale | +256 987 6543 |
Jaro-�Winkler | 1.000 | 0.952 | 0.967 | 0.979 | 1.000 | 0.562 | 0.502 |
The Landscape - Open Source Systems Reviewed
Jaro String Similarity
S | O | L | O | M | O | N |
S | A | M | U | E | L |
s1 = 7
s2 = 6
*
m = 2
t = 0
The Landscape - Open Source Systems Reviewed
Jaro-Winkler String Similarity
simj = Jaro Similarity = 0.5397
l = the length of common prefix at the start of the string up to a maximum of 4 characters
p = constant scaling factor for how much the score is adjusted upwards for having common prefixes. p should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is 0.1
simw = simj + l p ( 1 - simj )
simw = 0.5397 + 1 * 0.1 ( 1 - 0.5397 ) = 0.5857
The Landscape - Open Source Systems Reviewed
Levenshtein Edit Distance
| _ | S | O | L | O | M | O | N |
_ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
S | 1 | | | | | | | |
A | 2 | | | | | | | |
M | 3 | | | | | | | |
U | 4 | | | | | | | |
E | 5 | | | | | | | |
L | 6 | | | | | | | |
Levenshtein counts the number of edits required to transform one string into another using inserts, deletes and swaps
DUL | U |
L | i, j |
if (s1[i] == s2[j]):
gridij = min(U, L, DUL)
else:
gridij = min(U, L, DUL) + 1
0 < i ≤ len(s1) ; 0 < j ≤ len(s2)
The Landscape - Open Source Systems Reviewed
Levenshtein Edit Distance
| _ | S | O | L | O | M | O | N |
_ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
A | 2 | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
M | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 |
U | 4 | 3 | 3 | 3 | 3 | 4 | 4 | 5 |
E | 5 | 4 | 4 | 4 | 4 | 4 | 5 | 5 |
L | 6 | 5 | 5 | 4 | 5 | 5 | 5 | 6 |
Levenshtein counts the number of edits required to transform one string into another using inserts, deletes and swaps
DUL | U |
L | i, j |
if (s1[i] == s2[j]):
gridij = min(U, L, DUL)
else:
gridij = min(U, L, DUL) + 1
0 < i < len(s1) ; 0 < j < len(s2)
The Landscape - Open Source Systems Reviewed
SoundEx
https://www.researchgate.net/publication/215529387_Cross-language_Phonetic_Similarity_Measure_on_Terms_Appeared_in_Asian_Languages
SoundEx is a phonetic matching algorithm for Anglicised names that groups similar sounding letters together and assigns them a code in order for homophones to be assigned the same representation.
SOLOMON | S455 |
MOHAMMED | M053 |
TIM | T500 |
JOHN | J500 |
The Landscape - Open Source Systems Reviewed
Double Metaphone
JeMPI Performance Evaluations
Accuracy and Runtime
Synthetic Datasets
Type of Error | Examples |
Missing Value | “” ; Null ; na |
Name Mis-spell | Camilla >> Camyla |
Edit | Chloe >> Chloé |
OCR error | cl > d ; O > 0 ; rn > m |
Keyboard error | Lisa >> Lida |
Phonetic error | Sean >> Shaun |
Synthetic Datasets
Name of dataset file | Shortened name | Number of Original Records | Total Records | Duplication Ratio |
dataset-test-32-d-005000-020000-dcab-1.csv | 5k:20k | 5 000 | 25 000 | 1: 4 |
dataset-test-32-d-020000-080000-dcab.csv | 20k:80k | 20 000 | 100 000 | 1: 4 |
dataset-test-32-d-040000-160000-dcab-1.csv | 40k:160k | 40 000 | 200 000 | 1: 4 |
dataset-test-32-d-060000-240000-dcab.csv | 60k:240k | 60 000 | 300 000 | 1: 4 |
dataset-test-32-d-001000-019000-dcab-1.csv | 1k:19k | 1 000 | 20 000 | 1:19 |
dataset-test-32-d-005000-095000-dcab.csv | 5k:95k | 5 000 | 100 000 | 1:19 |
dataset-test-32-d-020000-380000-dcab-1.csv | 20k:380k | 20 000 | 400 000 | 1:19 |
dataset-test-32-d-050000-950000-dcab.csv | 50k:950k | 50 000 | 1 000 000 | 1:19 |
dataset-test-32-d-005000-200000-dcab-1.csv | 5k:200k | 5 000 | 205 000 | 1:40* |
*
Performance Analyzer
Linkage Scores
True Positives
True Negatives
Recall and Precision
| Model Prediction | ||
Non-match (Negative) | Match (Positive) | ||
Ground Truth | True match | False Negative | True Positive |
True non-match | True Negative | False Positive | |
Model positive
Model negative
Pairwise vs Golden Record Derivation
Pairwise vs Golden Record Derivation
Pairwise vs Golden Record Derivation
Evaluation Results
Pairwise comparisons
Golden record derivations
Strict approach vs Lenient approach
Recall & Precision (False negatives and False positives)
F-score
*Fully automated vs Semi-automated
Evaluation Comparison - Accuracy: Pairwise
Accuracy of SanteMPI for pairwise matching
Evaluation Comparison - Accuracy: Pairwise
Accuracy of JeMPI for pairwise matching
Evaluation Comparison - Accuracy: Golden Records
Accuracy of SanteMPI for Golden Record Derivation employing a semi-automated matching strategy
Evaluation Comparison - Accuracy: Golden Records
Accuracy of JeMPI for Golden Record Derivation employing a fully automated matching strategy
Evaluation Comparison - Computational Time
Name of dataset file | SanteMPI | JeMPI |
dataset-test-32-d-005000-020000-dcab-1.csv | 0:44:23 | 0:10:17 |
dataset-test-32-d-020000-080000-dcab.csv | 3:11:56 | 0:40:10 |
dataset-test-32-d-040000-160000-dcab-1.csv | 7:00:50 | 1:56:31 |
dataset-test-32-d-060000-240000-dcab.csv | 16:57:09 | 3:46:23 |
dataset-test-32-d-001000-019000-dcab-1.csv | 0:43:16 | 0:03:28 |
dataset-test-32-d-005000-095000-dcab.csv | 18:23:31 | 0:35:59 |
dataset-test-32-d-020000-380000-dcab-1.csv | 35:57:19 | 2:25:59 |
dataset-test-32-d-050000-950000-dcab.csv | 174:59:25 | 9:48:01 |
dataset-test-32-d-005000-200000-dcab-1.csv | 39:30:00 | 0:54:11 |
Evaluation Comparison - Computational Time
JeMPI Development Roadmap
EM 2.0
UI: Front page log in
UI: Simple Search
UI: Import
UI: Custom search
UI: Adjudicate records
JeMPI Demo
Questions