1 of 66

Journey towards the JeMPI

Jembi’s MPI Solution

2 of 66

Acknowledgements

3 of 66

Presentation breakdown

Intro (2min)

How it started (3min)

Use cases

JeMPI Overview (25min)

Architecture (5min)

Tech Stack (5min)

Engine (?)

Info flow

Strategies: Probabilistic and Deterministic (3min)

Blocking (3min)

Machine learning & Fellegi Sunter (10min)

Performance Evaluation (25min)

Synthetic Data (3min)

Recall & Precision (3min)

Pairwise and Golden Record derivation (5min)

Accuracy & Run time (5min)

SanteMPI (reference point)

JeMPI performance

Fully automated and Semi automated (2min)

Further testing and analysis (same system, real life data accuracy) (3min)

Future plans (5min)

UI

Biometrics

Improvements to the linking

EM optimisations w/ flexible matching

NN for matching accuracy comparison

Security

Synthetic dataset generator Improvements

MPI name

Demo (15min)

Q&A (10min)

2+3+25+25+5+15 = 75

4 of 66

Overview

  • This creates a comprehensive, longitudinal view of patient healthcare records that allows for better patient estimations and better decision making at national and individual levels.
  • Patient matching is the process of identifying and linking patient data within and across different health systems.
  • A master patient index performs patient matching and stores the links between patients and all their associated records in a centralised database for querying.

5 of 66

How it started

6 of 66

How it started

MPI Toolkit:

  • Synthetic Data Generator
  • Blocking Notebook
  • FastLink R Notebook
  • MPI Interim Solution
  • Custom MPI Performance Analyser

7 of 66

How it started

Summary report to evaluate which system best suits the DISI requirements:

Non-functional

  • Software architecture
  • Interoperability/ Integration
  • Development
  • Interfaces
  • Security
  • Auditing
  • Error management
  • Performance

Functional

  • CR core functions (search, add, update)
  • Matching algorithms and configurability of rules
  • Resolving duplicates
  • Linking/unlinking
  • Reporting

8 of 66

Where it is now

  • In-house designed MPI engine that is currently in development
  • Open-Source standards-compliant MPI
  • Designed using key architecture elements, established through analysis of current DISI focused use cases
    • Unique patient matching methods
    • Utilization of a graph database and other key tooling
    • Incorporates machine learning

9 of 66

Where it is now

  • In-house designed MPI engine that is currently in development
  • Open-Source standards-compliant MPI
  • Pioneered upon afro-centric business requirements
  • Designed using key architecture elements, established through analysis of current African focused use cases
    • Unique patient matching methods
    • Utilization of a graph database and other key tooling
    • Incorporates machine learning

10 of 66

JeMPI in comparison

JeMPI

11 of 66

JeMPI in context

12 of 66

JeMPI Architecture: JeMPI Engine

1. Data is entered via the input component in batch form

2. Staging takes the batch import, applies data cleaning and sends individual records to the controller

3. Controller receives records, validates and makes decisions, and controls the flow rate of information

4. Updates to the Expectation-Maximisation algorithm for parameter refinement

5. Records go to the linker. It fetches blocked candidate records from the persistent storage, and either links query records to existing golden records or creates new golden records

6. Golden records are stored in the persistent storage

7. The journal stores all transactions by the persistent storage as a disaster recovery method

8. The API allows queries to the persistent storage. Search, update, delete, etc.

13 of 66

JeMPI Architecture: JeMPI Product

  • View records
  • User input
  • Record updates
  • Configuration
  • Statistics
  • Notifications
  • Admin control Monitoring

14 of 66

JeMPI Architecture: JeMPI in relation

15 of 66

JeMPI Architecture: Tech Stack

16 of 66

JeMPI Architecture: Mid-Level (synchronous)

17 of 66

JeMPI Architecture: Mid-Level (asynchronous)

18 of 66

Features: Blocking

Query record + Blocking rules

Candidate List

19 of 66

Features: Blocking

Sex.equals(“male”)

20 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30)

21 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8)

22 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8) && FirstName.startsWith(“s”)

23 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8) && FirstName.startsWith(“s”) && LastName.startsWith(“m”)

24 of 66

Features: Blocking

IF blocking returns an empty list, a second round of blocking with broader rules applies:

25 of 66

Features: Blocking

Sex.equals(“male”) && ( FirstName.fuzzy_equals(“solomon”,0.75) | | LastName.fuzzy_equals(“mujuni”,0.75) )

26 of 66

Features: Deterministic & Probabilistic Matching

Deterministic

Probabilistic

Fellegi-Sunter�Sim-Sum

Weighted Averages

Exact Match (intersection rules)�A ⋂ B ⋂ C ⋂ D

Exact Match �(union rules)�(A ⋂ B ⋂ C) ⋃ D

Fuzzy Matching

Speed

Accuracy

Speed

Accuracy

27 of 66

Features: Deterministic & Probabilistic Matching

?

National ID

First name

Last name

Date of birth

Sex

City

Phone number

28 of 66

Features: Deterministic & Probabilistic Matching

?

National ID

First name

Last name

Date of birth

Sex

City

Phone number

{deterministic: {

conditions: {

nationalID: “exactMatch”,

firstName: [“jarowinkler”, 0.85],

lastName: [“jarowinkler”, 0.85]

}

}

}

GR0803

GR0975

GR1006

GR1101

GR1305

GR1389

GR1629

GR2045

29 of 66

Features: Deterministic & Probabilistic Matching

?

National ID

First name

Last name

Date of birth

Sex

City

Phone number

{probabilistic: {

conditions: {

attributes: “all”,

algorithm: “fellegiSunter”,

threshold: 0.65

}

}

}

0.4562

0.5983

0.8892

0.3254

0.2812

0.6940

0.4839

0.6329

30 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

National ID

First name

Last name

Date of birth

Sex

City

Phone number

Query Record

9101065679172

Solomun

Mujhuni

1991-06-01

Male

Kampala

+256 123 4567

GR1006

9101065679172

Solomon

Mujuni

1991-01-06

Male

Mbale

+256 987 6543

31 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

National ID

First name

Last name

Date of birth

Sex

City

Phone number

Query Record

9101065679172

Solomun

Mujhuni

1991-06-01

Male

Kampala

+256 123 4567

GR1006

9101065679172

Solomon

Mujuni

1991-01-06

Male

Mbale

+256 987 6543

32 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

National ID

First name

Last name

Date of birth

Sex

City

Phone number

Query Record

9101065679172

Solomun

Mujhuni

1991-06-01

Male

Kampala

+256 123 4567

GR1006

9101065679172

Solomon

Mujuni

1991-01-06

Male

Mbale

+256 987 6543

Jaro-�Winkler

1.000

0.952

0.967

0.979

1.000

0.562

0.502

33 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

Matching values (m) represent the quality of the data

m = Pr(attrM | sameRec)

Unmatching values (u) represent the uniqueness of the data

u = Pr(attrM | diffRec)

34 of 66

The Landscape - Open Source Systems Reviewed

Features: Machine Learning - Expectation Maximisation

  • Unsupervised machine learning
  • An efficient, iterative procedure
  • Adapts parameters according to input data, and provides updates if more data is provided
  • Low accuracy loss when using a randomised subset of only 10% of total population data
  • Reduced accuracy when using very low quality data (original to duplicate ratio of up to 1:40)
  • Computationally intensive

35 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

National ID

First name

Last name

Date of birth

Sex

City

Phone number

Query Record

9101065679172

Solomun

Mujhuni

1991-06-01

Male

Kampala

+256 123 4567

GR1006

9101065679172

Solomon

Mujuni

1991-01-06

Male

Mbale

+256 987 6543

m ;�u

0.964 ; 0.001

0.698 ; 0.003

0.772 ; 0.004

0.965 ; 0.050

0.999 ; 0.501

0.999 ; 0.045

0.985 ; 0.004

Matching

13.943

7.812

7.707

4.280

0.996

4.462

7.981

Unmatching

-4.809

-1.723

-2.216

-4.771

-16.439

-4.186

-6.097

36 of 66

Attribute comparison strategies

37 of 66

The Landscape - Open Source Systems Reviewed

Oversimplification

National ID

First name

Last name

Date of birth

Sex

City

Phone number

Query Record

9101065679172

Solomun

Mujhuni

1991-06-01

Male

Kampala

+256 123 4567

GR1006

9101065679172

Solomon

Mujuni

1991-01-06

Male

Mbale

+256 987 6543

Jaro-�Winkler

1.000

0.952

0.967

0.979

1.000

0.562

0.502

38 of 66

The Landscape - Open Source Systems Reviewed

Jaro String Similarity

  • s1 and s2 is the length of string 1 and string 2
  • m is the number of matching* characters
  • t is the number of transpositions

S

O

L

O

M

O

N

S

A

M

U

E

L

s1 = 7

s2 = 6

*

m = 2

t = 0

39 of 66

The Landscape - Open Source Systems Reviewed

Jaro-Winkler String Similarity

simj = Jaro Similarity = 0.5397

l = the length of common prefix at the start of the string up to a maximum of 4 characters

p = constant scaling factor for how much the score is adjusted upwards for having common prefixes. p should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is 0.1

simw = simj + l p ( 1 - simj )

simw = 0.5397 + 1 * 0.1 ( 1 - 0.5397 ) = 0.5857

40 of 66

The Landscape - Open Source Systems Reviewed

Levenshtein Edit Distance

_

S

O

L

O

M

O

N

_

0

1

2

3

4

5

6

7

S

1

A

2

M

3

U

4

E

5

L

6

Levenshtein counts the number of edits required to transform one string into another using inserts, deletes and swaps

  • Tim -> Tom (swap ‘i’ for ‘o’)
  • Tim -> Timmy (insert ‘m’ and ‘y’)
  • Timmy -> Tim (delete an ‘m’ and ‘y’)

DUL

U

L

i, j

if (s1[i] == s2[j]):

gridij = min(U, L, DUL)

else:

gridij = min(U, L, DUL) + 1

0 < i ≤ len(s1) ; 0 < j ≤ len(s2)

41 of 66

The Landscape - Open Source Systems Reviewed

Levenshtein Edit Distance

_

S

O

L

O

M

O

N

_

0

1

2

3

4

5

6

7

S

1

0

1

2

3

4

5

6

A

2

1

1

2

3

4

5

6

M

3

2

2

2

3

3

4

5

U

4

3

3

3

3

4

4

5

E

5

4

4

4

4

4

5

5

L

6

5

5

4

5

5

5

6

Levenshtein counts the number of edits required to transform one string into another using inserts, deletes and swaps

  • Tim -> Tom (swap ‘i’ for ‘o’)
  • Tim -> Timmy (insert ‘m’ and ‘y’)
  • Timmy -> Tim (delete an ‘m’ and ‘y’)

DUL

U

L

i, j

if (s1[i] == s2[j]):

gridij = min(U, L, DUL)

else:

gridij = min(U, L, DUL) + 1

0 < i < len(s1) ; 0 < j < len(s2)

42 of 66

The Landscape - Open Source Systems Reviewed

SoundEx

https://www.researchgate.net/publication/215529387_Cross-language_Phonetic_Similarity_Measure_on_Terms_Appeared_in_Asian_Languages

SoundEx is a phonetic matching algorithm for Anglicised names that groups similar sounding letters together and assigns them a code in order for homophones to be assigned the same representation.

  1. Keep the first letter of the name and drop all other occurrences of group 0
  2. Replace consonants with digits according to the table
    1. If multiple letters with the same digit are adjacent in the original name, only retain the first letter
  3. If there are too few letters in the word to assign three numbers, append zeros until there are three numbers. If there are four or more numbers, retain only the first three.

SOLOMON

S455

MOHAMMED

M053

TIM

T500

JOHN

J500

43 of 66

The Landscape - Open Source Systems Reviewed

Double Metaphone

  1. Drop duplicate adjacent letters, except for C.
  2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
  3. Drop 'B' if after 'M' at the end of the word.
  4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
  5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
  6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
  7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'.
  8. Drop 'H' if after vowel and not before a vowel.
  9. 'CK' transforms to 'K'.
  10. 'PH' transforms to 'F'.
  11. 'Q' transforms to 'K'.
  12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
  13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
  14. 'V' transforms to 'F'.
  15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
  16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
  17. Drop 'Y' if not followed by a vowel.
  18. 'Z' transforms to 'S'.
  19. Drop all vowels unless it is the beginning.

44 of 66

JeMPI Performance Evaluations

Accuracy and Runtime

45 of 66

Synthetic Datasets

  • Used to generate sample datasets to help aid development and assess in-country name spaces and attributes.
  • Advantages of this tooling is:
    • the ability to corrupt patient identifiers to imitate real life data quality issues
    • produce synthetic datasets at scale
    • establish a reliable ground truth with minimal effort
    • no privacy laws involved
  • Can be tailored to the country name space and key attribute requirements

Type of Error

Examples

Missing Value

“” ; Null ; na

Name Mis-spell

Camilla >> Camyla

Edit

Chloe >> Chloé

OCR error

cl > d ; O > 0 ; rn > m

Keyboard error

Lisa >> Lida

Phonetic error

Sean >> Shaun

46 of 66

Synthetic Datasets

Name of dataset file

Shortened name

Number of Original Records

Total Records

Duplication Ratio

dataset-test-32-d-005000-020000-dcab-1.csv

5k:20k

5 000

25 000

1: 4

dataset-test-32-d-020000-080000-dcab.csv

20k:80k

20 000

100 000

1: 4

dataset-test-32-d-040000-160000-dcab-1.csv

40k:160k

40 000

200 000

1: 4

dataset-test-32-d-060000-240000-dcab.csv

60k:240k

60 000

300 000

1: 4

dataset-test-32-d-001000-019000-dcab-1.csv

1k:19k

1 000

20 000

1:19

dataset-test-32-d-005000-095000-dcab.csv

5k:95k

5 000

100 000

1:19

dataset-test-32-d-020000-380000-dcab-1.csv

20k:380k

20 000

400 000

1:19

dataset-test-32-d-050000-950000-dcab.csv

50k:950k

50 000

1 000 000

1:19

dataset-test-32-d-005000-200000-dcab-1.csv

5k:200k

5 000

205 000

1:40*

*

47 of 66

Performance Analyzer

  • Validates the accuracy of the MPI and ability to link patient records
  • Can be tailored to validate the results of other MPI Solutions
  • Compares the unique identifiers assigned to each patient record by the synthetic dataset generator

48 of 66

Linkage Scores

True Positives

True Negatives

Recall and Precision

Model Prediction

Non-match (Negative)

Match (Positive)

Ground Truth

True match

False Negative

True Positive

True non-match

True Negative

False Positive

Model positive

Model negative

49 of 66

Pairwise vs Golden Record Derivation

50 of 66

Pairwise vs Golden Record Derivation

51 of 66

Pairwise vs Golden Record Derivation

52 of 66

Evaluation Results

Pairwise comparisons

Golden record derivations

Strict approach vs Lenient approach

Recall & Precision (False negatives and False positives)

F-score

*Fully automated vs Semi-automated

53 of 66

Evaluation Comparison - Accuracy: Pairwise

Accuracy of SanteMPI for pairwise matching

54 of 66

Evaluation Comparison - Accuracy: Pairwise

Accuracy of JeMPI for pairwise matching

55 of 66

Evaluation Comparison - Accuracy: Golden Records

Accuracy of SanteMPI for Golden Record Derivation employing a semi-automated matching strategy

56 of 66

Evaluation Comparison - Accuracy: Golden Records

Accuracy of JeMPI for Golden Record Derivation employing a fully automated matching strategy

57 of 66

Evaluation Comparison - Computational Time

Name of dataset file

SanteMPI

JeMPI

dataset-test-32-d-005000-020000-dcab-1.csv

0:44:23

0:10:17

dataset-test-32-d-020000-080000-dcab.csv

3:11:56

0:40:10

dataset-test-32-d-040000-160000-dcab-1.csv

7:00:50

1:56:31

dataset-test-32-d-060000-240000-dcab.csv

16:57:09

3:46:23

dataset-test-32-d-001000-019000-dcab-1.csv

0:43:16

0:03:28

dataset-test-32-d-005000-095000-dcab.csv

18:23:31

0:35:59

dataset-test-32-d-020000-380000-dcab-1.csv

35:57:19

2:25:59

dataset-test-32-d-050000-950000-dcab.csv

174:59:25

9:48:01

dataset-test-32-d-005000-200000-dcab-1.csv

39:30:00

0:54:11

58 of 66

Evaluation Comparison - Computational Time

59 of 66

JeMPI Development Roadmap

EM 2.0

60 of 66

UI: Front page log in

61 of 66

UI: Simple Search

62 of 66

UI: Import

63 of 66

UI: Custom search

64 of 66

UI: Adjudicate records

65 of 66

JeMPI Demo

66 of 66

Questions