1 of 66

Journey towards the JeMPI

Jembi’s MPI Solution

2 of 66

Acknowledgements

3 of 66

Presentation breakdown

Intro (2min)

How it started (3min)

Use cases

JeMPI Overview (25min)

Architecture (5min)

Tech Stack (5min)

Engine (?)

Info flow

Strategies: Probabilistic and Deterministic (3min)

Blocking (3min)

Machine learning & Fellegi Sunter (10min)

Performance Evaluation (25min)

Synthetic Data (3min)

Recall & Precision (3min)

Pairwise and Golden Record derivation (5min)

Accuracy & Run time (5min)

SanteMPI (reference point)

JeMPI performance

Fully automated and Semi automated (2min)

Further testing and analysis (same system, real life data accuracy) (3min)

Future plans (5min)

UI

Biometrics

Improvements to the linking

EM optimisations w/ flexible matching

NN for matching accuracy comparison

Security

Synthetic dataset generator Improvements

MPI name

Demo (15min)

Q&A (10min)

2+3+25+25+5+15 = 75

4 of 66

Overview

This creates a comprehensive, longitudinal view of patient healthcare records that allows for better patient estimations and better decision making at national and individual levels.

Patient matching is the process of identifying and linking patient data within and across different health systems.
A master patient index performs patient matching and stores the links between patients and all their associated records in a centralised database for querying.

5 of 66

How it started

6 of 66

How it started

MPI Toolkit:

Synthetic Data Generator
Blocking Notebook
FastLink R Notebook
MPI Interim Solution
Custom MPI Performance Analyser

7 of 66

How it started

Summary report to evaluate which system best suits the DISI requirements:

Non-functional

Software architecture
Interoperability/ Integration
Development
Interfaces
Security
Auditing
Error management
Performance

Functional

CR core functions (search, add, update)
Matching algorithms and configurability of rules
Resolving duplicates
Linking/unlinking
Reporting

The Knowledge Elephant Review

A lot of this research and investigation culminated into applying what we had learnt into recommending an MPI for DISI. For those who don’t know:

The Data Integration Systems and Implementation (DISI) project provides the technical assistance to countries in the form of:

Practical guidance or recommendations for integrating health data from different sources and jurisdictions
Software applications or platforms with relevant documentation
Country implementation
Process to monitor and measure potential benefits and risks of implementing DISI

This assessment was applied to the two global goods MPIs available at the time and factored in a number of functional and non-functional requirements with in-depth analyses into how each MPI/CR performed.

After the assessment, whilst not perfect, Jembi began a deeper investigation and collaboration with SanteSuite to see if SanteMPI could be improved so that it satisfied all the requirements necessary for an MPI in the DISI scope.

However, this venture fell somewhat short as SanteMPI has a small team dedicated to their MPI as SanteSuite hosts a number of other services with their own use cases in mind.

This left us in a position with a host of concepts that we would have liked to have seen happen, but little capacity to have them added to SanteMPI which resulted in the development of JeMPI.

8 of 66

Where it is now

In-house designed MPI engine that is currently in development
Open-Source standards-compliant MPI
Designed using key architecture elements, established through analysis of current DISI focused use cases

Unique patient matching methods
Utilization of a graph database and other key tooling
Incorporates machine learning

Source: OpenHIE Partners Page

9 of 66

Where it is now

In-house designed MPI engine that is currently in development
Open-Source standards-compliant MPI
Pioneered upon afro-centric business requirements
Designed using key architecture elements, established through analysis of current African focused use cases

Unique patient matching methods
Utilization of a graph database and other key tooling
Incorporates machine learning

Source: OpenHIE Partners Page

10 of 66

JeMPI in comparison

JeMPI

Patient Identity Management - Choosing Tools

11 of 66

JeMPI in context

12 of 66

JeMPI Architecture: JeMPI Engine

1. Data is entered via the input component in batch form

2. Staging takes the batch import, applies data cleaning and sends individual records to the controller

3. Controller receives records, validates and makes decisions, and controls the flow rate of information

4. Updates to the Expectation-Maximisation algorithm for parameter refinement

5. Records go to the linker. It fetches blocked candidate records from the persistent storage, and either links query records to existing golden records or creates new golden records

6. Golden records are stored in the persistent storage

7. The journal stores all transactions by the persistent storage as a disaster recovery method

8. The API allows queries to the persistent storage. Search, update, delete, etc.

13 of 66

JeMPI Architecture: JeMPI Product

View records
User input
Record updates
Configuration
Statistics
Notifications
Admin control Monitoring

14 of 66

JeMPI Architecture: JeMPI in relation

15 of 66

JeMPI Architecture: Tech Stack

16 of 66

JeMPI Architecture: Mid-Level (synchronous)

17 of 66

JeMPI Architecture: Mid-Level (asynchronous)

18 of 66

Features: Blocking

Query record + Blocking rules

Candidate List

19 of 66

Features: Blocking

Sex.equals(“male”)

20 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30)

21 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8)

22 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8) && FirstName.startsWith(“s”)

23 of 66

Features: Blocking

Sex.equals(“male”) && Age.between(25,30) && City.fuzzy_equals(“kampala”,0.8) && FirstName.startsWith(“s”) && LastName.startsWith(“m”)

24 of 66

Features: Blocking

IF blocking returns an empty list, a second round of blocking with broader rules applies:

25 of 66

Features: Blocking

Sex.equals(“male”) && ( FirstName.fuzzy_equals(“solomon”,0.75) | | LastName.fuzzy_equals(“mujuni”,0.75) )

26 of 66

Features: Deterministic & Probabilistic Matching

Deterministic

Probabilistic

Fellegi-Sunter�Sim-Sum

Weighted Averages

Exact Match (intersection rules)�A ⋂ B ⋂ C ⋂ D

Exact Match �(union rules)�(A ⋂ B ⋂ C) ⋃ D

Fuzzy Matching

Speed

Accuracy

Speed

Accuracy

The terms most of us are familiar with are deterministic and probabilistic matching.

As a brief explanation, there exists an open-ended spectrum starting at deterministic matching that extends to probabilistic matching.

Deterministic rules typically follow rather rigid rules, [ 1 ] where if conditions A, B, C and D are all true, then we have a positive match, if one or more conditions aren’t true then we do not have a match.

A more flexible approach is to include union rules [ 2 ] such as if A, B and C are all true, OR if D is true then we have a positive match

An even more deterministic approach are [ 3 ] fuzzy matching, which relies on similarity scores: this includes algorithms like Jaro-Winkler, Levenshtein, Double Metaphone and many more algorithms that allow the similarity in strings to be compared even if they are not an exact match.

Finally there are very [ 4 ] probabilistic approaches such as Weighted Averages, Sim-Sum and Fellegi-Sunter that generates a matching score based on a set of intersection and union rules, and weights assigned to each attribute. So if the attributes between the records match, a matching weight is added and if the attributes do not match, an un-matching weight is subtracted. The total weight then provides evidence for a match or non-match, and a threshold is assigned to determine a match status.

[ 5 ]

As you can probably tell, the probabilistic approaches seem more complicated, and they require a lot more computations to be made, however they are associated with much higher accuracy in patient linkage. Deterministic approaches make much fewer comparisons and computations, but the rigidity of the rules can sacrifice accuracy, especially when poor data quality is used

27 of 66

Features: Deterministic & Probabilistic Matching

?

National ID	First name	Last name	Date of birth	Sex	City	Phone number

28 of 66

Features: Deterministic & Probabilistic Matching

?

National ID	First name	Last name	Date of birth	Sex	City	Phone number

{deterministic: {

conditions: {

nationalID: “exactMatch”,

firstName: [“jarowinkler”, 0.85],

lastName: [“jarowinkler”, 0.85]

}

GR0803

GR0975

GR1006

GR1101

GR1305

GR1389

GR1629

GR2045

29 of 66

Features: Deterministic & Probabilistic Matching

?

National ID	First name	Last name	Date of birth	Sex	City	Phone number

{probabilistic: {

conditions: {

attributes: “all”,

algorithm: “fellegiSunter”,

threshold: 0.65

}

0.4562

0.5983

0.8892

0.3254

0.2812

0.6940

0.4839

0.6329

If the deterministic rule does not have any matches, then the probabilistic approach of Fellegi-Sunter is implemented, where all the attributes are compared between the query record and every single record of the candidates list.

A matching score is generated and normalised between 0 and 1, with a threshold score of 0.65 chosen to determine match status. In this case, we can see that two records have exceeded the threshold. However, the candidate record with the greater score is chosen to link to the query record.

Whilst all the values are depicted here are made up, they try to reflect how the score is generated.

We can look at the lowest score of 0.2812 where the only similarities between the query record and the fifth candidate record is that they share the same sex and the same surname, which is hardly enough evidence to assume they are the same person.

The difference between both “likely” candidates is that the one has a matching national ID, whilst both candidates might have the same first name, last name, date of birth and sex. If we were to come across two records that shared these 4 attributes, that is strong evidence on it’s own to assume they are the same records, however we would look to other attributes, in this case national ID, to provide more evidence for or against matching.

30 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

	National ID	First name	Last name	Date of birth	Sex	City	Phone number
Query Record	9101065679172	Solomun	Mujhuni	1991-06-01	Male	Kampala	+256 123 4567
GR1006	9101065679172	Solomon	Mujuni	1991-01-06	Male	Mbale	+256 987 6543

31 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

	National ID	First name	Last name	Date of birth	Sex	City	Phone number
Query Record	9101065679172	Solomun	Mujhuni	1991-06-01	Male	Kampala	+256 123 4567
GR1006	9101065679172	Solomon	Mujuni	1991-01-06	Male	Mbale	+256 987 6543

32 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

	National ID	First name	Last name	Date of birth	Sex	City	Phone number
Query Record	9101065679172	Solomun	Mujhuni	1991-06-01	Male	Kampala	+256 123 4567
GR1006	9101065679172	Solomon	Mujuni	1991-01-06	Male	Mbale	+256 987 6543
Jaro-�Winkler	1.000	0.952	0.967	0.979	1.000	0.562	0.502

From a Fuzzy Matching point of view, we can see that whilst not exact matches, first name, last name and date of birth are all incredibly similar, with enough evidence to suggest they are the same. It is still apparent that city and phone number do not match though.

The question is, how should this be interpreted into a likelihood score?

Each attribute cannot be given equal weighting as sex does not provide more confidence than first name when trying to tell if two people are the same, a member will share the same gender as half the world’s population, but much fewer people will have the same first name.

Just as one is more likely to change their phone number than to change their date of birth - making data of birth more reliable

And a healthcare worker is more likely to misspell a surname than to misspell a common city - resulting in higher data quality for the city identifier

So instead of applying equal weightings, or assigning weightings for each attribute in a way that feels intuitively correct, data quality/reliability and data uniqueness have been quantified and we use those values to assign weights to each attribute.

33 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

Matching values (m) represent the quality of the data

m = Pr(attrM | sameRec)

Unmatching values (u) represent the uniqueness of the data

u = Pr(attrM | diffRec)

This leads us to m and u values.

M values represent the quality of data and would be calculated as the probability of attributes matching given they belong to the same record. What can interrupt the attributes matching are poor data capture, but also things like women changing surnames after marriage, changing first names after being baptised, moving cities, changing phone numbers and much more. Some attributes should have higher data quality like date of birth, which is unlikely to change over one’s lifetime, or a national ID that should never change either. ��U values represent the uniqueness of data and would be calculated as the probability of attributes matching given that they are from different records. Sex is the best examples of low uniqueness (0.5) whereby many different people will share the same trait, whilst a national ID is a good example of high uniqueness (0.00001) where, if implemented correctly, no one will share the same national ID

[CLICK] What is tricky here is that m and u are derived as if the matching status is known already, yet used to calculate matching status.

34 of 66

The Landscape - Open Source Systems Reviewed

Features: Machine Learning - Expectation Maximisation

Unsupervised machine learning
An efficient, iterative procedure
Adapts parameters according to input data, and provides updates if more data is provided
Low accuracy loss when using a randomised subset of only 10% of total population data
Reduced accuracy when using very low quality data (original to duplicate ratio of up to 1:40)
Computationally intensive

35 of 66

The Landscape - Open Source Systems Reviewed

Features: Fellegi-Sunter

	National ID	First name	Last name	Date of birth	Sex	City	Phone number
Query Record	9101065679172	Solomun	Mujhuni	1991-06-01	Male	Kampala	+256 123 4567
GR1006	9101065679172	Solomon	Mujuni	1991-01-06	Male	Mbale	+256 987 6543
m ;�u	0.964 ; 0.001	0.698 ; 0.003	0.772 ; 0.004	0.965 ; 0.050	0.999 ; 0.501	0.999 ; 0.045	0.985 ; 0.004
Matching	13.943	7.812	7.707	4.280	0.996	4.462	7.981
Unmatching	-4.809	-1.723	-2.216	-4.771	-16.439	-4.186	-6.097

36 of 66

Attribute comparison strategies

37 of 66

The Landscape - Open Source Systems Reviewed

Oversimplification

	National ID	First name	Last name	Date of birth	Sex	City	Phone number
Query Record	9101065679172	Solomun	Mujhuni	1991-06-01	Male	Kampala	+256 123 4567
GR1006	9101065679172	Solomon	Mujuni	1991-01-06	Male	Mbale	+256 987 6543
Jaro-�Winkler	1.000	0.952	0.967	0.979	1.000	0.562	0.502

As a side note, I just want to point out some flaws with the strategy that I have listed in this overly simplified example

[CLICK]

For starters a blanket rule of Jaro-Winkler has been applied to all the attributes which is an unwise choice because Jaro-Winkler is only suitable for comparing the similarity of names in the event of misspellings, alternative spellings, and nicknames.

This is particularly noticeable for date of birth, and if the dialling code for phone number was included too, it would be very high as well.

Date of birth can and should have it’s own custom matching whereby the difference in dates (e.g. less than 1 month) are selected for.

The date of birth for the query record does not match with the national ID format, so for many national IDs that are derived based on the person’s date of birth, this is an effective way to perform data cleaning to ensure errors do not arise.

Because of the binary options for sex, a misspelling of male and female should not be a problem and data cleaning can easily correct for it so an exact match algorithm would be more effective and efficient.

Whilst not a binary option, if a country can achieve a dropdown list of cities to choose from when capturing data, this reduces spelling errors. Data cleaning can also assist with this issue, and an exact match algorithm can be applied, or a custom geographical hashmap to select for proximal distances: this addresses a common occurrence of rural people travelling to a city to give birth, so whilst their birth city is a larger city, their hometown is their small village.

Whilst phone number only has a 50% similarity, a Damerau-Levenshtein approach of measuring the edit distance instead of the similarity can be a lot more effective and precise.

38 of 66

The Landscape - Open Source Systems Reviewed

Jaro String Similarity

s₁ and s₂ is the length of string 1 and string 2
m is the number of matching* characters
t is the number of transpositions

S	O	L	O	M	O	N

S	A	M	U	E	L

s₁= 7

s₂= 6

*

m= 2

t= 0

39 of 66

The Landscape - Open Source Systems Reviewed

Jaro-Winkler String Similarity

sim_j = Jaro Similarity = 0.5397

l = the length of common prefix at the start of the string up to a maximum of 4 characters

p = constant scaling factor for how much the score is adjusted upwards for having common prefixes. p should not exceed 0.25 (i.e. 1/4, with 4 being the maximum length of the prefix being considered), otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is 0.1

sim_w = sim_j + l p ( 1 - sim_j)

sim_w = 0.5397 + 1 * 0.1 ( 1 - 0.5397 ) = 0.5857

40 of 66

The Landscape - Open Source Systems Reviewed

Levenshtein Edit Distance

	_	S	O	L	O	M	O	N
_	0	1	2	3	4	5	6	7
S	1
A	2
M	3
U	4
E	5
L	6

Levenshtein counts the number of edits required to transform one string into another using inserts, deletes and swaps

Tim -> Tom (swap ‘i’ for ‘o’)
Tim -> Timmy (insert ‘m’ and ‘y’)
Timmy -> Tim (delete an ‘m’ and ‘y’)

D_UL	U
L	i, j

if (s₁[i] == s₂[j]):

grid_ij = min(U, L, D_UL)

else:

grid_ij = min(U, L, D_UL) + 1

0 < i ≤ len(s₁) ; 0 < j ≤ len(s₂)

41 of 66

The Landscape - Open Source Systems Reviewed

Levenshtein Edit Distance

	_	S	O	L	O	M	O	N
_	0	1	2	3	4	5	6	7
S	1	0	1	2	3	4	5	6
A	2	1	1	2	3	4	5	6
M	3	2	2	2	3	3	4	5
U	4	3	3	3	3	4	4	5
E	5	4	4	4	4	4	5	5
L	6	5	5	4	5	5	5	6

Levenshtein counts the number of edits required to transform one string into another using inserts, deletes and swaps

Tim -> Tom (swap ‘i’ for ‘o’)
Tim -> Timmy (insert ‘m’ and ‘y’)
Timmy -> Tim (delete an ‘m’ and ‘y’)

D_UL	U
L	i, j

if (s₁[i] == s₂[j]):

grid_ij = min(U, L, D_UL)

else:

grid_ij = min(U, L, D_UL) + 1

0 < i < len(s₁) ; 0 < j < len(s₂)

42 of 66

The Landscape - Open Source Systems Reviewed

SoundEx

https://www.researchgate.net/publication/215529387_Cross-language_Phonetic_Similarity_Measure_on_Terms_Appeared_in_Asian_Languages

SoundEx is a phonetic matching algorithm for Anglicised names that groups similar sounding letters together and assigns them a code in order for homophones to be assigned the same representation.

Keep the first letter of the name and drop all other occurrences of group 0
Replace consonants with digits according to the table

If multiple letters with the same digit are adjacent in the original name, only retain the first letter

If there are too few letters in the word to assign three numbers, append zeros until there are three numbers. If there are four or more numbers, retain only the first three.

SOLOMON	S455
MOHAMMED	M053
TIM	T500
JOHN	J500

43 of 66

The Landscape - Open Source Systems Reviewed

Double Metaphone

Drop duplicate adjacent letters, except for C.
If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
Drop 'B' if after 'M' at the end of the word.
'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'.
Drop 'H' if after vowel and not before a vowel.
'CK' transforms to 'K'.
'PH' transforms to 'F'.
'Q' transforms to 'K'.
'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
'V' transforms to 'F'.
'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
Drop 'Y' if not followed by a vowel.
'Z' transforms to 'S'.
Drop all vowels unless it is the beginning.

44 of 66

JeMPI Performance Evaluations

Accuracy and Runtime

45 of 66

Synthetic Datasets

Used to generate sample datasets to help aid development and assess in-country name spaces and attributes.
Advantages of this tooling is:

the ability to corrupt patient identifiers to imitate real life data quality issues
produce synthetic datasets at scale
establish a reliable ground truth with minimal effort
no privacy laws involved

Can be tailored to the country name space and key attribute requirements

Type of Error	Examples
Missing Value	“” ; Null ; na
Name Mis-spell	Camilla >> Camyla
Edit	Chloe >> Chloé
OCR error	cl > d ; O > 0 ; rn > m
Keyboard error	Lisa >> Lida
Phonetic error	Sean >> Shaun

The way our datasets were generated were loading csv files that contained country specific names, last names and cities with the appropriate distributions of how common they are (gender included for first names).

An original record is then generated from these distributions with date of birth, national ID, and phone number randomly generated.

Parameters such as the number of corruptions per attribute, corruptions per record and maximum number of duplicates per original were entered into the configurations, [CLICK] along with the proportions of the types of errors to simulate real life data: here are the 5 main types of errors that are most commonly experienced.

Unfortunately, it is hard to estimate how well synthetic data simulates real life data, with most proportions of errors to follow an intuitive sense and to test extremes of what can be expected.

46 of 66

Synthetic Datasets

Name of dataset file	Shortened name	Number of Original Records	Total Records	Duplication Ratio
dataset-test-32-d-005000-020000-dcab-1.csv	5k:20k	5 000	25 000	1: 4
dataset-test-32-d-020000-080000-dcab.csv	20k:80k	20 000	100 000	1: 4
dataset-test-32-d-040000-160000-dcab-1.csv	40k:160k	40 000	200 000	1: 4
dataset-test-32-d-060000-240000-dcab.csv	60k:240k	60 000	300 000	1: 4
dataset-test-32-d-001000-019000-dcab-1.csv	1k:19k	1 000	20 000	1:19
dataset-test-32-d-005000-095000-dcab.csv	5k:95k	5 000	100 000	1:19
dataset-test-32-d-020000-380000-dcab-1.csv	20k:380k	20 000	400 000	1:19
dataset-test-32-d-050000-950000-dcab.csv	50k:950k	50 000	1 000 000	1:19
dataset-test-32-d-005000-200000-dcab-1.csv	5k:200k	5 000	205 000	1:40*

*

47 of 66

Performance Analyzer

Validates the accuracy of the MPI and ability to link patient records
Can be tailored to validate the results of other MPI Solutions
Compares the unique identifiers assigned to each patient record by the synthetic dataset generator

48 of 66

Linkage Scores

True Positives

True Negatives

Recall and Precision

		Model Prediction
		Non-match (Negative)	Match (Positive)
Ground Truth	True match	False Negative	True Positive
Ground Truth	True non-match	True Negative	False Positive

Model positive

Model negative

49 of 66

Pairwise vs Golden Record Derivation

Whilst investigating other MPIs, we came across this strategy of linking. It suggests that all records are compared to each other, and then grouped accordingly.

In this diagram, each colour represents a group, being a derived golden record, and the same colour arrows show which records it is positively linked to - the pairwise comparison.

This particular record, 19966 belongs to a Faith Cheyech (made from the synthetic data).

What is important to note is that of the 9 records associated with Faith, 8 of them have been assigned as golden records - this does not follow the transitive property of if a=b and b=c then a=c, which can be a good thing depending on the reliability of the data. On top of that, 28 different positive pairwise comparisons have been made.

This raises the question on how to assess the accuracy most efficiently because there are two things to compare:

Firstly we know that these 9 records belong to one single patient, but 8 golden records were produced, does this count as 7 false negatives? Because in theory, after the first golden record is created, all other records should be linked to that golden record, instead they are assigned new golden records
Secondly, if the pairwise comparisons want to compare all records to each other, 9 records should produce 36 positive pairwise comparisons, however only 28 were recorded - this suggests 8 false negatives

50 of 66

Pairwise vs Golden Record Derivation

51 of 66

Pairwise vs Golden Record Derivation

However, if an error were to occur, then how to assess the lack of accuracy becomes a bit complicated.

In this example, we can see that bbb-5 did not match to bbb-0, and instead went on to create a new golden record that bbb-4 linked to, as well as a record from patient 50629 linked to bbb-5.

From a pairwise comparison perspective, both bbb-4 and bbb-5 failed to link to bbb-0, so that should count as 2 false negatives.

However, what is very likely is that bbb-4 would have linked to bbb-0 if bbb-5 did not exist, but the stronger link was recorded.

So a strict approach would classify this as 2 false negatives, but a more lenient approach would classify this as 1 false negative.

On top of that, record 50629-aaa is a false positive match to 3080-bbb-5.

From a golden record derivation perspective, there are now 2 golden records for patient 3080 when there should only be one, which means there is one false negative.

If patient 50629 does not have a golden record assigned to them, that would then be a false positive

52 of 66

Evaluation Results

Pairwise comparisons

Golden record derivations

Strict approach vs Lenient approach

Recall & Precision (False negatives and False positives)

F-score

*Fully automated vs Semi-automated

Bearing all of these things in mind, let’s get into the results.

What also became apparent is that SanteMPI has a more semi-automated response which relies on a lot of human intervention which makes it infeasible when scaling to a national level.

However, the important thing to bear in mind is that a semi-automated approach achieves a specific use case

So we had these suggestions on how to improve but unfortunately SanteMPI is just a small component for the entirety of the SanteSuite products, and lines of communications became hard to uphold. So what we did was we had all of these implementations that we thought would be beneficial, instead of waiting, we decided to implement them ourselves

We used SanteMPI as a reference point to compare with accuracy and speed. If we could not improve on what they have done at SanteSuite, then we will stay quiet and leave it up to them. If we can match what they do, then that’s good to hear, maybe there are not a huge amount of improvements to be achieved. And if we can do better, then that’s just a really good sign that we are on the right track

53 of 66

Evaluation Comparison - Accuracy: Pairwise

Accuracy of SanteMPI for pairwise matching

54 of 66

Evaluation Comparison - Accuracy: Pairwise

Accuracy of JeMPI for pairwise matching

55 of 66

Evaluation Comparison - Accuracy: Golden Records

Accuracy of SanteMPI for Golden Record Derivation employing a semi-automated matching strategy

56 of 66

Evaluation Comparison - Accuracy: Golden Records

Accuracy of JeMPI for Golden Record Derivation employing a fully automated matching strategy

57 of 66

Evaluation Comparison - Computational Time

Name of dataset file	SanteMPI	JeMPI
dataset-test-32-d-005000-020000-dcab-1.csv	0:44:23	0:10:17
dataset-test-32-d-020000-080000-dcab.csv	3:11:56	0:40:10
dataset-test-32-d-040000-160000-dcab-1.csv	7:00:50	1:56:31
dataset-test-32-d-060000-240000-dcab.csv	16:57:09	3:46:23
dataset-test-32-d-001000-019000-dcab-1.csv	0:43:16	0:03:28
dataset-test-32-d-005000-095000-dcab.csv	18:23:31	0:35:59
dataset-test-32-d-020000-380000-dcab-1.csv	35:57:19	2:25:59
dataset-test-32-d-050000-950000-dcab.csv	174:59:25	9:48:01
dataset-test-32-d-005000-200000-dcab-1.csv	39:30:00	0:54:11

58 of 66

Evaluation Comparison - Computational Time

59 of 66

JeMPI Development Roadmap

EM 2.0

60 of 66

UI: Front page log in

61 of 66

UI: Simple Search

62 of 66

UI: Import

63 of 66

UI: Custom search

64 of 66

UI: Adjudicate records

65 of 66

JeMPI Demo

66 of 66

Questions

	_	S	O	L	O	M	O	N
_	0	1	2	3	4	5	6	7
S	1	0	1	2	3	4	5	6
A	2	1	1	2	3	4	5	6
M	3	2	2	2	3	3	4	5
U	4	3	3	3	3	4	4	5
E	5	4	4	4	4	4	5	5
L	6	5	5	4	5	5	5	6

	_	S	O	L	O	M	O	N
_	0	1	2	3	4	5	6	7
S	1	0	1	2	3	4	5	6
A	2	1	1	2	3	4	5	6
M	3	2	2	2	3	3	4	5
U	4	3	3	3	3	4	4	5
E	5	4	4	4	4	4	5	5
L	6	5	5	4	5	5	5	6

	_	S	O	L	O	M	O	N
_	0	1	2	3	4	5	6	7
S	1	0	1	2	3	4	5	6
A	2	1	1	2	3	4	5	6
M	3	2	2	2	3	3	4	5
U	4	3	3	3	3	4	4	5
E	5	4	4	4	4	4	5	5
L	6	5	5	4	5	5	5	6