1 of 63

Hereditary Stratigraphy:

Genome Annotations to Enable Phylogenetic Inference over

Distributed Digital Evolution Populations

Matthew Andres Moreno @MorenoMatthewA

Emily Dolson @emilyldolson

Charles Ofria @charlesofria

Evolution Conference

June, 2022

2 of 63

Digital Evolution

10011011110100

Computer models with evolving agents (O'Neill, 2003)

  • bitstrings on a NK-Fitness landscape (Stuart and Levin, 1987)
  • Avida computer programs solving logic tasks (Lenski et al., 2003)

Idea: instantiate evolutionary processes in a computer program.

Why? Unique capabilities;

  • Fast generation turnover for small populations
  • Complete observability
  • Arbitrary experimental manipulations

@MorenoMatthewA

3 of 63

Phylogenetic Analysis in Digital Evolution

Why?

  • Anecdotal account of evolutionary history to better understand instances of evolutionary innovation (Lenski et al. 2003)
  • Detect ecological dynamics (Dolson and Ofria, 2018)
  • Characterize selection pressure (Hagstrom et al., 2004)
  • Study spatial aspects of evolutionary innovation (Dolson and Ofria, 2017)

@MorenoMatthewA

4 of 63

Perfect Phylogenetic Tracking in Digital Evolution

death event

birth event

perfect record

@MorenoMatthewA

5 of 63

Scaling Up Perfect Tracking: Challenges

6 of 63

Scaling Up Perfect Tracking: Challenges

vs.

7 of 63

Scaling Up Perfect Tracking: Challenges

vs.

8 of 63

Scaling Up Perfect Tracking: Challenges

vs.

lost data

9 of 63

Alternate Strategy: Reconstruction

10 of 63

Alternate Approach: Reconstruction

extant

organisms

11 of 63

Alternate Approach: Reconstruction

extant

organisms

12 of 63

Alternate Approach: Reconstruction

inference

extant

organisms

13 of 63

Alternate Approach: Reconstruction

❗estimated❗

not ground truth

extant

organisms

14 of 63

Alternate Approach: Reconstruction

❗estimated❗

not ground truth

extant

organisms

in digital evolution, we control how these work

15 of 63

Research Question:

How to design a genomes to maximize phylogenetic reconstructability?

@MorenoMatthewA

16 of 63

A Naive Approach

@MorenoMatthewA

17 of 63

A Naive Approach

gen 0

100110111

@MorenoMatthewA

18 of 63

A Naive Approach

gen 0

gen 1

100110111

100110011

@MorenoMatthewA

19 of 63

A Naive Approach

gen 0

gen 1

100110111

100110011

@MorenoMatthewA

20 of 63

A Naive Approach

gen 0

gen 1

gen 2

100110111

100110011

110110011

@MorenoMatthewA

21 of 63

A Naive Approach

gen 0

gen 1

gen 2

100110111

100110011

110110011

@MorenoMatthewA

22 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

100110111

100110011

110110011

110010011

110110010

@MorenoMatthewA

23 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

100110111

100110011

110110011

110010011

110110010

@MorenoMatthewA

24 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

gen 4

100110111

100110011

110110011

110010011

110000011

110110010

111110010

@MorenoMatthewA

25 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

gen 4

100110111

100110011

110110011

110010011

110000011

110110010

111110010

@MorenoMatthewA

26 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

100110111

100110011

110110011

110010011

110000011

110110010

111110010

111100010

010000011

@MorenoMatthewA

27 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

100110111

100110011

110110011

110010011

110000011

110110010

111110010

111100010

010000011

extant

@MorenoMatthewA

28 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

100110111

100110011

110110011

110010011

110000011

110110010

111110010

111100010

010000011

extant

extant

010000011

111100010

29 of 63

A Naive Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

100110111

100110011

110110011

110010011

110000011

110110010

111110010

111100010

010000011

extant

extant

010000011

111100010

measure distance => infer estimate of generations elapsed most recent common ancestor

Challenges:

  • Back mutations
  • Mutational saturation
  • Mutation rate
  • Uneven branch lengths

30 of 63

A More Clever Approach

“Hereditary Stratigraph”

@MorenoMatthewA

31 of 63

A More Clever Approach

gen 0

@MorenoMatthewA

“Fingerprint”: randomly generated packet of data

32 of 63

A More Clever Approach

gen 0

@MorenoMatthewA

33 of 63

A More Clever Approach

gen 0

gen 1

@MorenoMatthewA

34 of 63

A More Clever Approach

gen 0

gen 1

gen 2

@MorenoMatthewA

35 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

@MorenoMatthewA

36 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

gen 4

@MorenoMatthewA

37 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

@MorenoMatthewA

38 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

extant

@MorenoMatthewA

39 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

extant

extant

40 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

extant

extant

different

same

41 of 63

A More Clever Approach

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

extant

extant

different

same

end of common ancestry

42 of 63

A More Clever Approach

“hereditary stratigraph”

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

@MorenoMatthewA

43 of 63

A More Clever Approach

“hereditary stratigraph”

gen 0

gen 1

gen 2

gen 3

gen 4

gen 5

attach as neutral

“annotation” to any digital genome

@MorenoMatthewA

arbitrary digital genome

44 of 63

Hangup: genome size bloat

If we add a 64 bit “fingerprint” to the genome each generation then,

1 Million generations => 8mb per genome, 8gb total for 1k population

1 Billion generations => 8gb per genome, 8tb total for 1k population

Genome size grows linearly with number of generations elapsed!

@MorenoMatthewA

45 of 63

Solution: pruning

@MorenoMatthewA

46 of 63

Solution: pruning

⬇️ pruning ⬇️

@MorenoMatthewA

47 of 63

Solution: pruning

Tradeoff: space vs MRCA estimate uncertainty

⬇️ pruning ⬇️

Uncertainty when estimating MRCA

@MorenoMatthewA

48 of 63

Solution: pruning

Tradeoff: space vs MRCA estimate uncertainty

⬇️ pruning ⬇️

Uncertainty when estimating MRCA

the subtle & interesting part! ⏩

@MorenoMatthewA

49 of 63

Pruning: distribution

more recent

more ancient

⬆️ evenly pruned vs. recency-proportional pruned ⬇️

more retention

less retention

@MorenoMatthewA

50 of 63

Results ⏩

@MorenoMatthewA

51 of 63

Results

unweighted pair group method with arithmetic mean (UPGMA) reconstruction

interactive viz https://thrul.ink/bi

52 of 63

Conclusion

Takeaways:

  • interesting “backwards” problem: how to design genome to maximize phylogenetic reconstructibility
  • aspects of digital evolution systems making trade-offs that mirror wet lab systems

Future Work:

  • sexual populations
  • better understand theoretical expectations for tree reconstruction quality

@MorenoMatthewA

53 of 63

Images

@MorenoMatthewA

54 of 63

Bibliography

O'Neill, Bill. "Digital evolution." PLoS Biology 1.1 (2003): e18.

Dolson, Emily, and Charles Ofria. "Spatial resource heterogeneity creates local hotspots of evolutionary potential." ECAL 2017, the Fourteenth European Conference on Artificial Life. MIT Press, 2017.

Dolson, Emily, and Charles Ofria. "Ecological theory provides insights about evolutionary computation." Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2018.

Hagstrom, George I., et al. "Using Avida to test the effects of natural selection on phylogenetic reconstruction methods." Artificial life 10.2 (2004): 157-166.

Lenski, R. E., Ofria, C., Pennock, R. T., & Adami, C. (2003). The evolutionary origin of complex features. Nature, 423(6936), 139–144.

Kauffman, Stuart, and Simon Levin. "Towards a general theory of adaptive walks on rugged landscapes." Journal of theoretical Biology 128.1 (1987): 11-45.

@MorenoMatthewA

55 of 63

Acknowledgement

  • We are the Digital Evolution Laboratory and Evolutionary Control of Digital Ecologies Laboratory at the Michigan State University Department of Computer Science and Engineering, in association with the NSF BEACON Center for the Study of Evolution in Action.
  • This research was supported in part by NSF grants DEB-1655715 and DBI-0939454. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1424871.
  • Thanks to Dr. Kevin Liu for input on tree reconstruction and comparison methods.

@MorenoMatthewA @emilyldolson @charlesofria

Matthew

Emily

Charles

56 of 63

Materials are Available!

  • these slides https://thrul.ink/bl
  • hstrat Python library on pypi (`pip install hstrat`)
  • hstrat library on github https://github.com/mmore500/hstrat
  • upcoming conference paper at Artificial Life Conference 2022
  • dm or email me if you’d like a copy!
  • or just to chat!
  • mmore500@msu.edu

@MorenoMatthewA

57 of 63

58 of 63

Different Pruning Options

evenly spaced vs recency-poroportionsl

how much to prune

how to do it so you have good data no matter where you pull the stop simulation

59 of 63

60 of 63

https://cerebras.net/blog/wafer-scale-processors-the-time-has-come/

61 of 63

Scaling Up Digital Evolution Compute

Why?

  • larger, more diverse digital populations and ecosystems
  • more sophisticated digital organisms and environments

vs.

@MorenoMatthewA

62 of 63

Pruning: intensity

⬆️ lightly pruned vs. heavily pruned ⬇️

@MorenoMatthewA

63 of 63

Results

validate pairwise comparison

interactive viz https://hopth.ru/bh

true MRCA

MRCA estimate bounds

genome 1

genome 2