1 of 21

pangolin --usher

Angie Hinrichs

StaPH-B - Nov. 19, 2021

2 of 21

Outline

  • Background
    • UShER
    • UCSC’s big tree built by UShER
    • Pango lineages
  • Pangolin v3
  • pangolin --usher
    • How does it work?
    • How is it different from pangoLEARN?
  • Looking ahead

3 of 21

Ultrafast Sample placement on Existing tRees (UShER)

https://github.com/yatisht/usher/ Yatish Turakhia, UCSD

  • Precomputed Mutation Annotated Tree (MAT) data structure
  • Place new sequence in tree by Maximum Parsimony
  • Fast! (just seconds to place on tree of >5M sequences)
  • Web interface, matUtils, matOptimize, workflows, ...

https://www.nature.com/articles/s41588-021-00862-7

4 of 21

UCSC’s Big Trees

5 of 21

UCSC’s Big Trees

>5M: GISAID, GenBank, COG-UK

Not publicly shareable 😒

>2.5M: GenBank, COG-UK

Public downloads 😁

(colored by Pango lineage)

6 of 21

UCSC’s Big Trees

  • Daily update
    • Aggregate & deduplicate sequences and metadata
    • QC: remove sequences with <20000 non-N bases
    • Align new sequences to reference
    • Mask Problematic Sites
    • Use UShER to add new sequences to yesterday’s tree
    • Light optimization with matOptimize
    • QC: Remove sequences with too many equally parsimonious placements or�very long branches (>35 mutations)
    • Extract public tree

7 of 21

Browse the public tree with Taxonium

Theo Sanderson

Francis Crick Institute /

Wellcome Sanger Institute

8 of 21

Pango lineages

9 of 21

What defines a Pango lineage?

Not a set of mutations!

lineages.csv in the pango-designation github repository (>1M lines):

...

India/GJ-ICMR-NIV-INSACOG-GSEQ-3045/2021,B.1.617.2

India/PY-SEQ_294_S22_R1_001/2021,B.1.617.2

Malaysia/IMR_682164/2021,B.1.617.2

Japan/IC-1175/2021,B.1.617.2

USA/TX-CDC-ASC210037740/2021,B.1.617.2

England/WSFT-25C6539/2021,B.1.1.7

USA/MI-UM-10039543606/2021,AY.3

USA/KS-KHEL-1922/2021,AY.3

USA/KS-KHEL-1923/2021,AY.3

USA/MO-MSPHL-002099/2021,AY.3

USA/MO-MSPHL-002132/2021,AY.3

...

10 of 21

A Brief History of Pangolin

  • v1.0 (April 29, 2020): phylogenetic model (iqtree… not fast enough)
  • v2.0 (July 22, 2020): pangoLEARN model (fast! sensitive to noise)
  • v3.0 (May 27, 2021): pangoLEARN + scorpio/constellations + �--usher option

11 of 21

How does pangoLEARN work?

Figure 2, Áine O’Toole, Emily Scher, et al., Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool, Virus Evolution, Volume 7, Issue 2, November 2021, veab064, https://doi.org/10.1093/ve/veab064

12 of 21

How does pangoLEARN work?

SARS-CoV-2 genome sequences

(aligned to genome, masked)

Binary vectors

(0 = ref or N, 1 = alt)

pangoLEARN training

pango-designation/�lineages.csv

Decision tree model

1

1

1

0

0

0

Training:

13 of 21

How does pangoLEARN work?

Decision tree model

1

1

1

0

0

0

User SARS-CoV-2 sequences

Align to reference, mask

Binary vectors

Assigned lineage

Running pangolin:

14 of 21

How does pangolin --usher work?

Making lineage-annotated tree:

pango-designation/�lineages.csv

matUtils annotate

UCSC big tree

matUtils reroot,

matUtils mask -m…,

matUtils extract -r...,

matUtils mask -s...

A

B

B.1.1.7

B.1.617.2

AY.4

15 of 21

How does pangolin --usher work?

Running pangolin:

Align to reference, mask

Assigned lineage

User SARS-CoV-2 sequences

16 of 21

What’s the difference?

  • pangoLEARN is ~16x faster
  • UShER uses a mutation-annotated phylogenetic tree

17 of 21

Not all assignments come from pangoLEARN/UShER

  • Designated sequences: directly assigned, no pangoLEARN/UShER

1002005561,AY.44,,,,,,PANGO-v1.2.93,3.1.16,2021-11-09,v1.2.93,passed_qc,Assigned from designation hash.

  • Scorpio/constellations: overrides pangoLEARN/UShER

2000051407,B.1.617.2,0.0,0.9288622754491018,Delta (B.1.617.2-like),0.384600,0.076900,PLEARN-v1.2.93,3.1.16,2021-11-09,v1.2.93,passed_qc,scorpio call: Alt alleles 5; Ref alleles 1; Amb alleles 6; Oth alleles 1; scorpio replaced lineage assignment B.1.1.7

3000136426,None,,,,,,PLEARN-v1.2.93,3.1.16,2021-11-09,v1.2.93,passed_qc,pangoLEARN lineage assignment AY.4.5 was not supported by scorpio

3000137678,B.1.617.2,0.5,,Delta(B.1.617.2-like),1.000000,0.000000,PUSHER-v1.2.93,3.1.16,,v1.2.93,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; scorpio replaced lineage assignment AY.4; Usher placements: AY.4(1/2) B.1.617.2(1/2)

7000000606,None,,,,,,PUSHER-v1.2.93,3.1.16,,v1.2.93,passed_qc,usher lineage assignment AY.13 was not supported by scorpio; Usher placements: AY.13(5/6) B.1.617.2(1/6)

18 of 21

Looking forward...

  • Definitely: Ongoing updates with new lineages
  • Probably: Precomputed assignments
  • Maybe?: Expanded use of Scorpio

19 of 21

Acknowledgements

  • UCSD: Yatish Turakhia, Cheng Ye (UShER, matOptimize)
  • U. of Edinburgh: Àine O’Toole, Emily Scher, Rachel Colquhoun, Andrew Rambaut (pangolin)
  • UCSC: Russ Corbett-Detig, Jakob McBroome, Bryan Thorlow, Alex Kramer, Marc Perry (matUtils, evaluation)

20 of 21

21 of 21

UCSC’s Big Trees

>5M: GISAID, GenBank, COG-UK

Not publicly shareable

>2.5M: GenBank, COG-UK

Public downloads

(colored by country)