1 of 13

pangolin v4.0

Angie Hinrichs, UCSC Genomics Institute

Theiagen Office Hours

April 12, 2022

2 of 13

Overview

  • What’s new in pangolin v4.0
    • --analysis-mode: UShER is the default, pangoLEARN uses random forest
    • Output columns & version descriptors
    • Data repository consolidation
    • Optional cache mode for UShER
  • Questions!

3 of 13

--analysis-mode usher|pangolearn|accurate|fast|scorpio

  • UShER is default
  • pangoLEARN is still available via --analysis-mode pangolearn
  • Aliases:
    • --analysis-mode fast = pangolearn
    • --analysis-mode accurate = usher

And in pangolin v4.0.3:

  • --analysis-mode scorpio

4 of 13

PangoLEARN model change

  • v3 pangoLEARN model: decision tree
  • v4 pangoLEARN model: random forest of decision trees
    • Hopefully more robust
    • Requires more RAM (>= 16GB or 32GB or … depending on how many sequences)

5 of 13

Changes to output columns

pangolin v3 pangolin v4�---------------------------------------------�taxon taxon�lineage lineage�conflict conflict�ambiguity_score ambiguity_score�scorpio_call scorpio_call�scorpio_support scorpio_support�scorpio_conflict scorpio_conflict� > scorpio_notes scorpio call: Alt alleles 21; Ref alleles…�version version (PUSHER|PLEARN|PANGO)-v1.3�pangolin_version pangolin_version 4.0.4�pangoLEARN_version <�pango_version <� > scorpio_version 0.3.16� > constellation_version v0.1.4 status <> is_designated True|False� > qc_status pass|fail� > qc_notes Ambiguous_content:0.02 �note note Usher placements: BA.1(1/1); scorpio replaced…

6 of 13

Version descriptors

In pangolin v3:

  • pangoLEARN_version = date on which pangoLEARN training started
  • pango_version = pango-designation release used to train pangoLEARN

In pangolin v4:

  • No more date, just pango-designation release in version e.g. (PUSHER|PLEARN|PANGO)-v1.3

7 of 13

Data dependencies (github repositories)

In pangolin v3:

  • pangoLEARN: Trained pangoLEARN model and UShER tree
  • pango-designation: Alias file (e.g. AY = B.1.617.2, BA = B.1.1.529, …)
  • Confusing because pangoLEARN also depends on pango-designation and often lags it by weeks

In pangolin v4:

  • pangolin-data: Trained pangoLEARN model, UShER tree, alias file

8 of 13

Optional cache mode for UShER assignments

Usher is slower, but UCSC precomputes UShER assignments on all GISAID sequences.

Replace GISAID sequence names with sequence hash value → lookup table.

Big file (~200MB) – not included in pangolin by default.

To install the cache:

pangolin --add-assignment-cache

To use the cache:

pangolin --use-assignment-cache

9 of 13

Resources

10 of 13

Acknowledgements

  • U. of Edinburgh: Àine O’Toole, Emily Scher, Rachel Colquhoun, Andrew Rambaut (pangolin)

  • UCSD: Yatish Turakhia, Cheng Ye (UShER, matOptimize)

  • UCSC: Russ Corbett-Detig, Jakob McBroome, Bryan Thorlow, Adriano de Bernardi Schneider, Alex Kramer, Marc Perry (matUtils, evaluation)

11 of 13

What defines a Pango lineage?

Not a set of mutations!

lineages.csv in the pango-designation github repository (>1M lines):

...

India/GJ-ICMR-NIV-INSACOG-GSEQ-3045/2021,B.1.617.2

India/PY-SEQ_294_S22_R1_001/2021,B.1.617.2

Malaysia/IMR_682164/2021,B.1.617.2

Japan/IC-1175/2021,B.1.617.2

USA/TX-CDC-ASC210037740/2021,B.1.617.2

England/WSFT-25C6539/2021,B.1.1.7

USA/MI-UM-10039543606/2021,AY.3

USA/KS-KHEL-1922/2021,AY.3

USA/KS-KHEL-1923/2021,AY.3

USA/MO-MSPHL-002099/2021,AY.3

USA/MO-MSPHL-002132/2021,AY.3

...

12 of 13

13 of 13

Use UShER!

Some incongruous changes with UShER…

Many incongruous changes with pangoLEARN

When new lineages are added…