Analysis of the Genealogy Process in Forensic Genetic Genealogy
Lawrence M. Wein
Mine Su Erturk
Graduate School of Business, Stanford University
Colleen Fitzpatrick
Identifinders
Margaret Press
DNA Doe Project
CRIME RESEARCH AGENDA
Reduce Type I errors with respect to incarceration
- Reduce overcrowding in jails, PLOS ONE, NY Times 2015
- Effect of judge rotation on criminal sentencing
CRIME RESEARCH AGENDA
Reduce Type I errors with respect to incarceration
- Reduce overcrowding in jails, PLOS ONE, NY Times 2015
- Effect of judge rotation on criminal sentencing
Reduce Type II errors with respect to incarceration
- Ballistic imaging: Journal of Forensic Sciences 2017
AFTE Journal 2018
IMPACT OF BALLISTIC IMAGING PAPER
s
Starting in July 2018, the U.S. Bureau of Alcohol, Tobacco,
Firearms and Explosives (ATF) is requiring all NIBIN sites to:
- Enter 100% of cartridges into NIBIN
- Prioritize entry based on evidence, crime code and caliber
- Use LCFS
CRIME RESEARCH AGENDA
Reduce Type I errors with respect to incarceration
- Reduce overcrowding in jails, PLOS ONE, NY Times 2015
- Effect of judge rotation on criminal sentencing
Reduce Type II errors with respect to incarceration
- Ballistic imaging: Journal of Forensic Sciences 2017
AFTE Journal 2018
- Sexual assault kits:
Testing the backlog: Journal of Forensic Sciences 2018
cnn.com 2018
How to test the backlog: PNAS 2020
CRIME RESEARCH AGENDA
Reduce Type I errors with respect to incarceration
- Reduce overcrowding in jails, PLOS ONE, NY Times 2015
- Effect of judge rotation on criminal sentencing
Reduce Type II errors with respect to incarceration
- Ballistic imaging: Journal of Forensic Sciences 2017
AFTE Journal 2018
- Sexual assault kits:
Testing the backlog: Journal of Forensic Sciences 2018
cnn.com 2018
How to test the backlog: PNAS 2020
- Forensic Genetic Genealogy
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
Golden State Killer Case 2018
STEPS IN THE FGG PROCESS
1. Obtain DNA sample from crime scene or unidentified
remains and perform genotyping (SNPs)
STEPS IN THE FGG PROCESS
1. Obtain DNA sample from crime scene or unidentified
remains and perform genotyping (SNPs)
2. Upload SNP data to third-party service to obtain relatives
of target
STEPS IN THE FGG PROCESS
1. Obtain DNA sample from crime scene or unidentified
remains and perform genotyping (SNPs)
2. Upload SNP data to third-party service to obtain relatives
of target
3. Perform genealogy research to identify target
STEPS IN THE FGG PROCESS
1. Obtain DNA sample from crime scene or unidentified
remains and perform genotyping (SNPs)
2. Upload SNP data to third-party service to obtain relatives
of target
3. Perform genealogy research to identify target
4. Obtain confirmatory DNA sample from identified target or
family member
FGG ECOSYSTEM IS EVOLVING
GEDmatch switched to opt in (1.4M 🡪 260k in database)
FGG ECOSYSTEM IS EVOLVING
GEDmatch switched to opt in (1.4M 🡪 260k in database)
Field pioneered by four women (Colleen Fitzpatrick,
CeCe Moore, Margaret Press, Barbara Rae-Venter)
FGG ECOSYSTEM IS EVOLVING
GEDmatch switched to opt in (1.4M 🡪 260k in database)
Field pioneered by four women (Colleen Fitzpatrick,
CeCe Moore, Margaret Press, Barbara Rae-Venter)
Some small FGG companies bought by large companies
(Bode, Parabon)
FGG ECOSYSTEM IS EVOLVING
GEDmatch switched to opt in (1.4M 🡪 260k in database)
Field pioneered by four women (Colleen Fitzpatrick,
CeCe Moore, Margaret Press, Barbara Rae-Venter)
Some small FGG companies bought by large companies
(Bode, Parabon)
Most police departments outsource FGG
FGG ECOSYSTEM IS EVOLVING
GEDmatch switched to opt in (1.4M 🡪 260k in database)
Field pioneered by four women (Colleen Fitzpatrick,
CeCe Moore, Margaret Press, Barbara Rae-Venter)
Some small FGG companies bought by large companies
(Bode, Parabon)
Most police departments outsource FGG
Several hundred cold cases solved in 2018-21 (50% success)
4. Obtain confirmatory DNA sample from identified suspect
FGG ECOSYSTEM IS EVOLVING
GEDmatch switched to opt in (1.4M 🡪 260k in database)
Field pioneered by four women (Colleen Fitzpatrick,
CeCe Moore, Margaret Press, Barbara Rae-Venter)
Some small FGG companies bought by large companies
(Bode, Parabon)
Most police departments outsource FGG
Several hundred cold cases solved in 2018-21 (50% success)
DNA Doe Project is nonprofit, uses crowdsourced
volunteers (retired women), and identifies unidentified
human remains (rather than solving criminal cases)
4. Obtain confirmatory DNA sample from identified suspect
RESEARCH QUESTIONS
First mathematical analysis of the backend of the FGG process
(i.e., GEDmatch/FTDNA output 🡪 identify target)
RESEARCH QUESTIONS
First mathematical analysis of the backend of the FGG process
(i.e., GEDmatch/FTDNA output 🡪 identify target)
Performance analysis:
Given the GEDmatch/FTDNA output, compute:
Probability of identifying target
Expected workload (= size of final family tree)
RESEARCH QUESTIONS
First mathematical analysis of the backend of the FGG process
(i.e., GEDmatch/FTDNA output 🡪 identify target)
Performance analysis:
Given the GEDmatch/FTDNA output, compute:
Probability of identifying target
Expected workload (= size of final family tree)
Optimization:
How many and which matches to investigate?
When/if to descend from (possible) MRCAs
(MRCA = Most Recent Common Ancestor)
17 CASES FROM DNA DOE PROJECT
GEDmatch OUTPUT
23
Jane Doe
24
Jane Doe
Total cM
PROBABILISTIC MAPPING FROM TOTAL cM TO RELATIONSHIPS
25
https://thednageek.com/science-the-heck-out-of-your-dna-part-3/
Ball et al. (2016), Ancestry DNA Matching White Paper.
26
Jane Doe
Bob
Total cM
https://thednageek.com/science-the-heck-out-of-your-dna-part-3/
Ball et al. (2016), Ancestry DNA Matching White Paper.
182.2
AUTO CLUSTER TOOL IN GEDmatch
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
BENCHMARK STRATEGY
Search for common ancestors between two matches
BENCHMARK STRATEGY
Search for common ancestors between two matches
Descend from these ancestors to look for an intersection
(marriage) between both sides of target’s family tree
BENCHMARK STRATEGY
Search for common ancestors between two matches
Descend from these ancestors to look for an intersection
(marriage) between both sides of target’s family tree
Investigate n matches prioritized by highest total cM
BENCHMARK STRATEGY
Search for common ancestors between two matches
Descend from these ancestors to look for an intersection
(marriage) between both sides of target’s family tree
Investigate n matches prioritized by highest total cM
Vary n to generate Pr(identify target) vs. workload curve
PROPOSED vs. BENCHMARK STRATEGY
PROPOSED vs. BENCHMARK STRATEGY
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
PARAMETERS
PARAMETERS
Pr(can identify someone’s child) = 0.98 (also considered 0.90)
Pr(can identify someone’s parents) = 0.60 (by simulation)
PARAMETERS
Pr(can identify someone’s child) = 0.98 (also considered 0.90)
Pr(can identify someone’s parents) = 0.60 (by simulation)
Number of children per couple
NUMBER OF CHILDREN PER COUPLE
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
ASCENDING STAGE
ASCENDING STAGE
Target
2nd cousin
MOST RECENT COMMON ANCESTORS
Target
2nd cousin
MOST RECENT COMMON ANCESTORS
Cluster = ancestral couple of target�7 clusters in this example
ASCENDING STAGE
MOST RECENT COMMON ANCESTORS
Target
2nd cousin
List size L = 0, 1, 2, 3 or 4�
MOST RECENT COMMON ANCESTORS
Target
2nd cousin
List size L = 0, 1, 2, 3 or 4�P=0 if L=0, and P=1 if L=4�
MOST RECENT COMMON ANCESTORS
Target
2nd cousin
List size L = 0, 1, 2, 3 or 4�P=0 if L=0, and P=1 if L=4�Other P’s ≠ 1/4, 1/2, 3/4
MOST RECENT COMMON ANCESTORS
Target
2nd cousin
1st cousin
MOST RECENT COMMON ANCESTORS
Target
2nd cousin
1st cousin
Now state changes from (L=4,P=1) to (L=1,P=1)
DESCENDING STAGE
Given: State of system during ascending stage:
Goal: Find intersection of (i.e., marriage between) maternal
and paternal family trees
FIND INTERSECTION OF FAMILY TREES
Target
1st cousin
2nd cousin
FIND INTERSECTION OF FAMILY TREES
Target
1st cousin
2nd cousin
DESCENDING STAGE
Given: State of system at end of ascending stage:
Goal: Find intersection of (i.e., marriage between) maternal
and paternal family trees
Compute: 1) probability of finding intersection of family trees
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
PROPOSED vs. BENCHMARK STRATEGY
STOCHASTIC DYNAMIC PROGRAMMING
Observe state, take action, observe probabilistic transition to
new state, take new action,…to maximize objective
STOCHASTIC DYNAMIC PROGRAMMING
Observe state, take action, observe probabilistic transition to
new state, take new action,…to maximize objective
State: 1) list of uninvestigated matches with cM and cluster
2) for each cluster and generation, Pr(list contains correct
MRCA of target) and size of list of MRCAs
3) list of cluster-generation pairs for which a descending
search has been performed
STOCHASTIC DYNAMIC PROGRAMMING
Observe state, take action, observe probabilistic transition to
new state, take new action,…to maximize objective
State: 1) list of uninvestigated matches with cM and cluster
2) for each cluster and generation, Pr(list contains correct
MRCA of target) and size of list of MRCAs
3) list of cluster-generation pairs for which a descending
search has been performed
Actions: 1) start an ascending search of a new match with cM
and cluster
2) start a descending search from cluster-generation
3) end the search
STOCHASTIC DYNAMIC PROGRAMMING
Observe state, take action, observe probabilistic transition to
new state, take new action,…to maximize objective
State: 1) list of uninvestigated matches with cM and cluster
2) for each cluster and generation, Pr(list contains correct
MRCA of target) and size of list of MRCAs
3) list of cluster-generation pairs for which a descending
search has been performed
Actions: 1) start an ascending search of a new match with cM
and cluster
2) start a descending search from cluster-generation
3) end the search
Objective: maximize Pr(target identified) – (cost x E[workload])
PROPOSED STRATEGY
SDP problem was too hard to solve (huge state space)
PROPOSED STRATEGY
SDP problem was too hard to solve (huge state space)
Derive state transition probabilities for SDP
PROPOSED STRATEGY
SDP problem was too hard to solve (huge state space)
Derive state transition probabilities for SDP
Derive five structural results
PROPOSED STRATEGY
SDP problem was too hard to solve (huge state space)
Derive state transition probabilities for SDP
Derive five structural results
Construct threshold strategy that is consistent with structural results
PROPOSED STRATEGY
SDP problem was too hard to solve (huge state space)
Derive state transition probabilities for SDP
Derive five structural results
Construct threshold strategy that is consistent with structural results
Based on cost-effectiveness:
- compute ΔP (increase in probability of identifying target) and
ΔW (increase in mean workload) for each action
- cost-effectiveness of action = ΔP/ ΔW
PROPOSED STRATEGY
Use a myopic approach with parameter c
1) Given current state, compute ΔP (increase in probability of identifying target) and ΔWd(increase in E[workload]) for each undescended cluster-generation. If max ΔP/ ΔWd >c, then descend.
PROPOSED STRATEGY
Use a myopic approach with parameter c
1) Given current state, compute ΔP (increase in probability of identifying target) and ΔWd(increase in E[workload]) for each undescended cluster-generation. If max ΔP/ ΔWd >c, then descend.
2) Otherwise, for each cluster-generation, assume that we ascend n* times (= min n s.t. max ΔP/ ΔWd >c) and then descend.
PROPOSED STRATEGY
Use a myopic approach with parameter c
1) Given current state, compute ΔP (increase in probability of identifying target) and ΔWd(increase in E[workload]) for each undescended cluster-generation. If max ΔP/ ΔWd >c, then descend.
2) Otherwise, for each cluster-generation, assume that we ascend n* times (= min n s.t. max ΔP/ ΔWd >c) and then descend. If max ΔP(n*)/ [ΔWa(n*)+ ΔWd(n*)]>c, then ascend from max cM match in cluster-generation.
PROPOSED STRATEGY
Use a myopic approach with parameter c
1) Given current state, compute ΔP (increase in probability of identifying target) and ΔWd(increase in E[workload]) for each undescended cluster-generation. If max ΔP/ ΔWd >c, then descend.
2) Otherwise, for each cluster-generation, assume that we ascend n* times (= min n s.t. max ΔP/ ΔWd >c) and then descend. If max ΔP(n*)/ [ΔWa(n*)+ ΔWd(n*)]>c, then ascend from max cM match in cluster-generation.
3) Otherwise, stop.
PROPOSED VS. BENCHMARK STRATEGY
The Proposed strategy:
1) Ascends to exactly the generation of the correct MRCA
between the match and the target
PROPOSED VS. BENCHMARK STRATEGY
M1, M2, T = Match 1, Match 2, Target
g(x,y) = generation of MRCA between x and y
PROPOSED VS. BENCHMARK STRATEGY
M1, M2, T = Match 1, Match 2, Target
g(x,y) = generation of MRCA between x and y
Case 1: g(M1,M2) = min{g(M1,T), g(M2,T)}
equivalent outcomes
PROPOSED VS. BENCHMARK STRATEGY
M1, M2, T = Match 1, Match 2, Target
g(x,y) = generation of MRCA between x and y
Case 1: g(M1,M2) = min{g(M1,T), g(M2,T)}
equivalent outcomes
Case 2: g(M1,M2) < min{g(M1,T), g(M2,T)}
MRCA of 2 Matches cannot be ancestor of Target!
PROPOSED VS. BENCHMARK STRATEGY
M1, M2, T = Match 1, Match 2, Target
g(x,y) = generation of MRCA between x and y
Case 1: g(M1,M2) = min{g(M1,T), g(M2,T)}
equivalent outcomes
Case 2: g(M1,M2) < min{g(M1,T), g(M2,T)}
MRCA of 2 Matches cannot be ancestor of Target!
Case 3: g(M1,M2) > min{g(M1,T), g(M2,T)}
Overshoot: inefficient descents from g(M1,M2)
PROPOSED VS. BENCHMARK STRATEGY
The Proposed strategy:
1) Ascends to exactly the generation of the correct MRCA
between the match and the target
2) Uses the Autocluster tool to prioritize among ascending
actions based on cost-effectiveness
3) Aggressively descends from ancestral couples based on their
cost-effectiveness
BEHAVIOR OF PROPOSED STRATEGY
It aggressively descends from clusters: the average probability
that the list of potential MRCAs contains the correct MRCA is 0.38
BEHAVIOR OF PROPOSED STRATEGY
It aggressively descends from clusters: the average probability
that the list of potential MRCAs contains the correct MRCA is 0.38
Mean list size at time of descent = 5.5 (maximum = 76)
BEHAVIOR OF PROPOSED STRATEGY
It aggressively descends from clusters: the average probability
that the list of potential MRCAs contains the correct MRCA is 0.38
Mean list size at time of descent = 5.5 (maximum = 76)
12.6 descents per case
BEHAVIOR OF PROPOSED STRATEGY
It aggressively descends from clusters: the average probability
that the list of potential MRCAs contains the correct MRCA is 0.38
Mean list size at time of descent = 5.5 (maximum = 76)
12.6 descents per case
1.9 descents per cluster
BEHAVIOR OF PROPOSED STRATEGY
It aggressively descends from clusters: the average probability
that the list of potential MRCAs contains the correct MRCA is 0.36
Mean list size at time of descent = 5.5 (maximum = 76)
12.6 descents per case
1.9 descents per cluster
Great majority of work is in descents
BEHAVIOR OF PROPOSED STRATEGY
It aggressively descends from clusters: the average probability
that the list of potential MRCAs contains the correct MRCA is 0.36
Mean list size at time of descent = 5.5 (maximum = 76)
12.6 descents per case
1.9 descents per cluster
Great majority of work is in descents
Sequence of actions depends on detailed network structure in
complicated way
PROPOSED STRATEGY FOR CASE #1
Probability for
△
Distance for △
Action
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
LIMITATIONS
Cases: small sample size and not chosen randomly
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
No half-relationships
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
No half-relationships
No geographical information
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
No half-relationships
No geographical information
No ethnicity information
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
No half-relationships
No geographical information
No ethnicity information
AutoCluster information is perfect
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
No half-relationships
No geographical information
No ethnicity information
AutoCluster information is perfect
No Y-STR data to infer surname (Gymrek, Science 2013)
LIMITATIONS
Cases: small sample size and not chosen randomly
No endogamy
No half-relationships
No geographical information
No ethnicity information
AutoCluster information is perfect
No Y-STR data to infer surname (Gymrek, Science 2013)
Search probabilities do not depend on generation
OUTLINE
Introduction: research questions and data
Main results
Parameter estimation
Modeling the two-stage genealogy process
Proposed strategy
Limitations
Conclusions
CONCLUSIONS
Pr(identify target) and E[workload] are useful only in relative terms
- But police departments and FGG companies need to
assess solvability and workload upfront
CONCLUSIONS
Pr(identify target) and E[workload] are useful only in relative terms
- But police departments and FGG companies need to
assess solvability and workload upfront
Hard cases appear to be solvable but require high workload
- Tradeoff curves allow for identification of sweet spot
CONCLUSIONS
Pr(identify target) and E[workload] are useful only in relative terms
- But police departments and FGG companies need to
assess solvability and workload upfront
Hard cases appear to be solvable but require high workload
- Tradeoff curves allow for identification of sweet spot
Proposed Strategy solves cases faster by:
- looking for MRCAs between the target and each match
- aggressively descending from possible MRCAs
CONCLUSIONS
Pr(identify target) and E[workload] are useful only in relative terms
- But police departments and FGG companies need to
assess solvability and workload upfront
Hard cases appear to be solvable but require high workload
- Tradeoff curves allow for identification of sweet spot
Proposed Strategy solves cases faster by:
- looking for MRCAs between the target and each match
- aggressively descending from possible MRCAs
Proposed Strategy is meant to aid, not to replace, genealogists’
decisions
NEXT STEPS
Talks at ISHI (Sept 21), SWGDAM (Oct 21) and AAFS (Feb 22)
NEXT STEPS
Talks at ISHI (Sept 21), SWGDAM (Oct 21) and AAFS (Feb 22)
New project: Privacy (targeted testing) vs. Performance tradeoff
NEXT STEPS
Talks at ISHI (Sept 21), SWGDAM (Oct 21) and AAFS (Feb 22)
New project: Privacy (targeted testing) vs. Performance tradeoff
We are testing performance analysis tool (compute Pr(solve case)
and W50 given GEDmatch output) with 908 cases from GEDmatch
NEXT STEPS
Talks at ISHI (Sept 21), SWGDAM (Oct 21) and AAFS (Feb 22)
New project: Privacy (targeted testing) vs. Performance tradeoff
We are testing performance analysis tool (compute Pr(solve case)
and W50 given GEDmatch output) with 908 cases from GEDmatch
We are testing optimization tool on an open cold case