1 of 17

Short variant discovery in �M. tuberculosis

Peter van Heusden pvh@sanbi.ac.za

South African National Bioinformatics Institute

2 of 17

M. tuberculosis genome

  • 10 million cases in 2018, 1.4 million deaths
  • Reference H37Rv�is a ‘laboratory strain’
  • About ¼ proteins annotated�in H37Rv are “hypothetical
  • Slow growing (1 doubling/day)
  • No horizontal gene transfer�in M. tuberculosis
  • In the genus Mycobacteria�and the group Mycobacterium�Tuberculosis Complex �(MTBC)

2

3 of 17

The Mycobacteria

  • Diverse genus of�pathogenic and�non-pathogenic�bacteria
  • NTM (nontuberculous �mycobacteria) are�all except those causing�TB and leprosy
  • NTM found in soil�and water and�occasionally cause�human disease

3

4 of 17

M. tuberculosis diversity

4

5 of 17

Features of the M. tuberculosis genome

  • single 4.4 megabase circular chromosome�
  • 4018 coding sequences (in NC_000962.3)�
  • 56 insertion sequence (IS) sites�
  • Direct Repeat (DR) region of 36bp repeats�
  • PE/PPE/PGRS families of repetitive proteins

5

6 of 17

6

7 of 17

Genotyping M. tuberculosis

  • IS6110-RFLP
  • MIRU-VNTR
  • Spoligotyping
  • Review article: DOI 10.3346/jkms.2016.31.11.1673
  • Strains decades or even hundreds of years apart in transmission can share genotype:�DOI 10.1016/j.ebiom.2018.10.013

7

8 of 17

WGS of M. tuberculosis

  • Infer transmission and gene flow
    • no clear “SNP-threshold” though�
  • Allows for GWAS to explore genotype/phenotype links�
  • Perform in-silico drug resistance testing

8

9 of 17

M. tuberculosis WGS vs other bacteria

  • Horizontal gene transfer and recombination common in many other pathogen bacteria
  • Complicates phylogenies
    • requires identification and masking of recombination hotspots
  • Requires analysis of genes and gene flow (including on plasmids)
  • Difficult to use single “reference sequence” for species
  • Antimicrobial resistance is typically on the level of genes not point mutations

9

10 of 17

Challenges in M. tuberculosis variant discovery

  • Typical Illumina reads are < 250bp
    • limits ability to discover insertions/deletions (indels)
  • Reads are shorter than length of repetitive structures (e.g. IS, PE/PPE/PGRS genes)
  • H37Rv genome is not a “neutral target”
    • Lineage 4 genome
    • Lab isolate from 1930s, origin in 1905 patient
    • ESAT-6 system different to many clinical strains
    • RvD5 deletion in relative to many clinical strains

10

11 of 17

Use of inferred ancestral reference in M. tuberculosis SNV calling

11

12 of 17

Challenges in M. tuberculosis variant discovery (part 2)

  • Contamination of M. tb samples is common
    • especially in direct-from-sputum sequencing
    • Taxonomic filter recommended prior to variant analysis, however:
      • Commonly used tools (kraken, kraken2) memory intensive, consider centrifuge
  • Most variant calling software is tuned for human data
  • Identification of regions to mask out is different between different groups
  • Overview: DOI:10.1099/mgen.0.000418

12

13 of 17

PE/PPE/PGRS gene clustering

PE/PPE/PGRS genes,

edges in graph show

where there is

greater than 70%

identity of region

aligned with

BLAST

13

14 of 17

IS impact on mapping

Mapping

errors visible

around

IS6110

insertion

sequence

(reads from

same genome)

14

15 of 17

Long reads for M. tuberculosis WGS

  • Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) produce long but �noisy reads
  • Error rate for long reads is high (but dropping)
    • ONT error rate 20% to under 5% in 5 years
  • Consensus error rate for alignments is lower
    • homopolymer errors (e.g. GG -> GGG) remain
  • Combined long and short read technologies allow for rapid de-novo genome assembly

15

16 of 17

Some useful tools for M. tb Bioinformatics on usegalaxy.eu

  • TB variant filter
    • Applies common filtering operations to predicted variants
  • TB VCF report
  • TB Profiler
    • Drug resistance and lineage prediction tool from Jody Phelan (LSHTM)
      • note: there are other tools for this job, �e.g. Phyresse and UVP - just not on usegalaxy.eu

16

17 of 17

Acknowledgements

  • Iñaki Comas et al for ancestral reference genome and work on contamination in samples
  • Conor Meehan for all round excellent papers and M. tuberculosis bioinformatics Twitter
  • Caroline Colijn for transmission modelling
  • Jody Phelan for TB Profiler
  • The COMBAT TB group at SANBI (Thoba Lose, Ziphozake Mashologo)
  • Torsten Seemann for commenting on these slides (and snippy, and shovil, etc)
  • South African National Research Foundation and Medical Research Council (SANBI funders)
  • Many many more I forgot to name

17