1 of 202

Immunoinformatics:

immunology meets computing

Yana Safonova

Center for Information Theory and Applications

University of California at San Diego

2 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

2

3 of 202

Reconstructing Strings from Random Traces

3

Input. A collection of random subsequences (traces) of a string t, where each trace is obtained by deleting each symbol in the string with probability q

Output. The string t

Batu, Kannan, Khanna, McGregor, SODA 2004

4 of 202

Reconstructing Strings from Random Traces

4

Input. A collection of random subsequences (traces) of a string t, where each trace is obtained by deleting each symbol in the string with probability q

Output. The string t

Batu, Kannan, Khanna, McGregor, SODA 2004

T C G G G G G T T T T T

T C G G G G G T T T T T

T C G G G G G T T T T T

T C G G G G G T T T T T

T C G G G G G T T T T T

T C G G G G G T T T T T

T C G G G G G T T T T T

5 of 202

Reconstructing Strings from Random Traces

5

Input. A collection of random subsequences (traces) of a string t, where each trace is obtained by deleting each symbol in the string with probability q

Output. The string t

Batu, Kannan, Khanna, McGregor, SODA 2004

T G G G G T T T T

C G G G G T T T T

T G G G T T T T

G G G G T T T

T C G G G T T T

T G G G G G T T T

C G G G T T T T T

6 of 202

Reconstructing Strings from Random Traces

6

Input. A collection of random subsequences (traces) of a string t, where each trace is obtained by deleting each symbol in the string with probability q

Output. The string t

Batu, Kannan, Khanna, McGregor, SODA 2004

T G G G G T T T T

C G G G G T T T T

T G G G T T T T

G G G G T T T

T C G G G T T T

T G G G G G T T T

C G G G T T T T T

7 of 202

Reconstructing Strings from Random Traces

7

Input. A collection of random subsequences (traces) of a string t, where each trace is obtained by deleting each symbol in the string with probability q

Output. The string t

DNA sequencing

  • Ion Torrent: error 2%, length 500 nt

  • Pacific Biosciences: error 10%, length 15 Kb

  • Oxford Nanopore: error 14%, length 30 Kb

8 of 202

Information Theory Meets Cancer Biology

8

Input. A collection of random subsequences (traces) of a string t, where each trace is obtained by deleting, inserting, or substituting each symbol in the string according to a complex probabilistic model with poorly understood parameters.

Output. The string t

tumor

development

Blackburn & Langenau. Disease Models & Mechanisms, 2014

9 of 202

Information Theory Meets Immunology

9

Input. A collection of random subsequences (traces) derived from a set of strings T, where each trace is obtained by deleting, inserting, or substituting each symbol in one of the strings in T according to a complex probabilistic model with poorly understood parameters.

Output. The set of strings T

10 of 202

Information Theory Meets Immunology

10

Input. A collection of random subsequences (traces) derived from a set of strings T, where each trace is obtained by deleting, inserting, or substituting each symbol in one of the strings in T according to a complex probabilistic model with poorly understood parameters.

Output. The set of strings T

The reality is even more complex: traces are derived not from a set of strings, but rather from some concatenates of some strings in an unknown set of strings T.

11 of 202

Information Theory Meets Immunology

11

Input. A collection of random subsequences (traces) derived from a set of strings T, where each trace is obtained by deleting, inserting, or substituting each symbol in one of the strings in T according to a complex probabilistic model with poorly understood parameters.

Output. The set of strings T

The reality is even more complex: traces are derived not from a set of strings, but rather from some concatenates of some strings in an unknown set of strings T.

Time to learn Immunology 101!

12 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

12

13 of 202

Adaptive immune system

  • Variety of threats to human body is huge and unpredictable

  • Genome is too small to encode defences against all these threats

  • Immune system has an ability to adapt to various threats using agents (e.g., antibodies) that are not encoded in the genome.

13

14 of 202

Antibodies

  • Antibodies are proteins that bind to a specific treat (called antigen) and cause its neutralization

  • Immune system generates millions of different antibodies (repertoire) to neutralize various antigens

14

Specificity rule:

one antibody – one antigen

(not necessarily true)

Antibody

Antigen

15 of 202

Generation of antibodies

Before recombination, the genome of an antibody-producing cell (B cell) looks exactly like genomes of all other cells:

15

Immunoglobulin locus (Chr 14), length ~1.25 Mb

V

165-305 nt

avg. 291 nt

D

11-37 nt

avg. 24 nt

J

48-63 nt

avg. 54 nt

16 of 202

Selection of J segment...

16

17 of 202

Left cleavage of J segment...

17

18 of 202

Selection of D segment...

18

19 of 202

Right cleavage of D segment...

19

20 of 202

Concatenation of D and J segments...

20

Newly created unique genomic region

21 of 202

Left cleavage of DJ fragment...

21

22 of 202

Selection of V segment...

22

23 of 202

Right cleavage of V segment...

23

24 of 202

VDJ concatenation (variable region of antibody)

24

360 nt of VDJ + 1000 nt of constant region

instead of original 1.25 Mb

Constant region

25 of 202

Variable region of antibodies contains antigen binding sites

25

Constant region

360 nt of VDJ + 1000 nt of constant region

instead of original 1.25 Mb

26 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

26

27 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

27

Recombination process is imperfect and includes many random processes:

  • Palindromic insertions

28 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

28

Recombination process is imperfect and includes many random processes:

  • Palindromic insertions
  • Segment cleavage

29 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

29

Recombination process is imperfect and includes many random processes:

  • Palindromic insertions
  • Segment cleavage

30 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

30

Recombination process is imperfect and includes many random processes:

  • Palindromic insertions
  • Segment cleavage
  • Non-genomic insertions

31 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

31

Recombination process is imperfect and includes many random processes:

  • Palindromic insertions
  • Segment cleavage
  • Non-genomic insertions

Recombined antibodies may undergo somatic mutagenesis

32 of 202

Why are antibodies so diverse if there are only 55×23×6 VDJ recombinations?

32

Recombination process is imperfect and includes many random processes:

  • Palindromic insertions
  • Segment cleavage
  • Non-genomic insertions

Recombined antibodies may undergo somatic mutagenesis

33 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

33

34 of 202

From biological problems to computational challenges

34

VDJ classification problem. Given an antibody generated from a known set of V, D, and J segments, identify what specific V, D, and J segments generated this antibody

35 of 202

From biological problems to computational challenges

35

Model organisms in immunology with still unknown sets of V, D, and J segments

VDJ classification problem. Given an antibody generated from a known set of V, D, and J segments, identify what specific V, D, and J segments generated this antibody

36 of 202

From biological problems to computational challenges

36

VDJ reconstruction problem. Given antibodies generated from an unknown set of V, D, and J segments, reconstruct these sets

VDJ classification problem. Given an antibody generated from a known set of V, D, and J segments, identify what specific V, D, and J segments generated this antibody

37 of 202

From biological problems to computational challenges

37

VDJ reconstruction problem. Given error-prone reads representing antibodies generated from an unknown set of V, D, and J segments, reconstruct these sets

VDJ classification problem. Given an antibody generated from a known set of V, D, and J segments, identify what specific V, D, and J segments generated this antibody

38 of 202

VDJ classification problem is solved!

38

IMGT/V-QUEST

Brochet et al, Nucleic Acids Res, 2008

IgBlast

Ye et al, Nucleic Acids Res, 2013

iHMMune-align

Gaeta et al, Bioinformatics, 2007

Antibody graph

Bonissone and Pevzner, RECOMB 2015

VDJ classification problem. Given an antibody generated from a known set of V, D, and J segments, identify what specific V, D, and J segments generated this antibody

includes database of V, D, J segments

39 of 202

VDJ classification problem is solved!

39

VDJ classification problem. Given an antibody generated from a known set of V, D, and J segments, identify what specific V, D, and J segments generated this antibody

Only if VDJ segments do not vary widely between individuals!

40 of 202

How VDJ segments vary across population?

40

VDJ variants problem. Given reference V, D, and J segments and antibody repertoire from an individual, reconstruct how V, D, and J segments in this individual differ from the reference and discover new V, D, and J segments.

41 of 202

How accurate is the database of human

V, D, and J segments?

  • Annotation of Ig loci in human genome contains long tandem repeats and is likely assembled with errors
  • D segments were manually (!) extracted in 1990s (!) from antibodies sequenced using old technology and without assembled human genome

41

Watson et al, AJHG, 2013

Watson et al, Genes & Immunity, 2014

42 of 202

Finding novel V segments

42

Germline V segment

43 of 202

Finding novel V segments

43

Germline V segment

✪ ❃ ❇

✪ ❃ ❇

Novel V segments:

44 of 202

Finding novel V segments

44

Germline V segment

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

Novel V segments:

45 of 202

Finding novel V segments

45

Germline V segment

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

Novel V segments:

Our analysis revealed:

  • > 100 segments with < 4 mismatches

  • 19 segments with 4-6 mismatches

  • 17 segments with by 7-13 mismatches

  • 6 segments with 14-89 mismatches

Novel segments in rhesus macaques were discovered by

Corcoran et al., Nat Communications, 2017

46 of 202

Crossover causes genetic diversity

46

father chr.

mother chr.

child chr.

child chr.

47 of 202

Crossover causes genetic diversity

47

father chr.

mother chr.

48 of 202

Unequal crossover may produce new segments

48

father chr.

mother chr.

49 of 202

Unequal crossover may produce new segments

Length of Ig loci: ~1.25 Mbp

Frequency of crossing over: 1 point per 1 Mbp

49

father chr.

mother chr.

child chr.

child chr.

50 of 202

Variations in Ig loci: alternative method for population analysis?

  • In contrast to protein-coding regions, the Ig loci may accumulate many mutations since it is not under direct evolutionary pressure

  • Unequal crossover may be responsible for Ig loci diversification

50

Ig loci reconstruction problem. Given error-prone reads representing antibodies and reads sampled from the genome, assemble the Ig loci

51 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

51

52 of 202

Antibody repertoire sequencing (Rep-seq)

52

V

D

J

Length: ~360 nt

Left read

Right read

VDJ from DNA or RNA

Error-prone immunosequencing reads

Turchaninova et al, Nat Protocols, 2016

53 of 202

Repertoire construction problem

53

× 7

× 3

× 2

× 2

Antibody repertoire

× 1

Rep-seq reads

Antibody repertoire is the set of VDJ sequences with their abundances

54 of 202

Repertoire construction problem

54

× 7

× 3

× 2

× 2

Antibody repertoire

Antibody repertoire is the set of V(D)J sequences with their abundances

× 1

Rep-seq reads

Colors of reads and positions of errors are unknown!

55 of 202

Repertoire construction problem

55

Repertoire construction problem combines

clustering and error-correction

× 7

× 3

× 2

× 2

Antibody repertoire

× 1

Rep-seq reads

56 of 202

Repertoire construction problem

56

pRESTO

Vander-Heiden et al,

Bioinformatics 2014

MiXCR

Bolotin et al,

Nat Methods 2015

IgRepertoireConstructor

Safonova et al,

Bioinformatics 2015

× 7

× 3

× 2

× 2

Antibody repertoire

× 1

Rep-seq reads

57 of 202

What makes this problem difficult?

57

x 1 ~ 1018

Similarity of antibody sequences: different antibodies may share V

Difficult to distinguish between

errors ( ) and natural variations ( )

Number of clusters is unknown

Uneven distribution of abundances

58 of 202

Difficulty of clustering problem

58

59 of 202

Difficulty of clustering problem

59

60 of 202

Difficulty of clustering problem

60

61 of 202

Difficulty of clustering problem

61

62 of 202

Small antibody repertoire

Six antibodies with large abundances and many singleton antibodies (real data)

62

63 of 202

Each antibody forms a dense subgraph in the ideal Hamming graph

63

Reads derived from the same antibody differ from

each other due to sequencing errors (Hamming distance=3)

64 of 202

Real Hamming graph

107 nodes, 1426 edges

64

Similarity of natural variation and sequencing errors makes clusters highly connected

65 of 202

Actually, the real Hamming graph is not colored

65

Repertoire construction problem. Given error-prone Rep-seq reads representing antibodies, reconstruct sequences of antibodies

66 of 202

Repertoire construction problem

66

Repertoire construction problem. Given error-prone Rep-seq reads representing antibodies, reconstruct sequences of antibodies

67 of 202

Repertoire construction problem

67

Repertoire construction problem. Given a Hamming graph constructed from error-prone Rep-seq reads, find its dense subgraphs

68 of 202

Finding dense subgraphs in the Hamming graph

68

Hamming graph HG

Triangulated Hamming graph THG

triangulated Hamming graph

original Hamming graph

Each cycle > 3 in triangulated graph has a chord

69 of 202

Finding dense subgraphs in the Hamming graph

69

Hamming graph HG

Triangulated Hamming graph THG

Our Hamming graphs are “almost” triangulated

Minimum Triangulation Problem: find the minimum number of edge additions to convert a graph into a triangulated graph

triangulated Hamming graph

70 of 202

Finding dense subgraphs in the Hamming graph

70

Hamming graph HG

Triangulated Hamming graph THG

The minimum triangulation problem has effective approximate solutions, e.g., METIS by Karypis and Kumar, SIAM J Sci Comput, 1999

triangulated Hamming graph

71 of 202

Finding dense subgraphs in the Hamming graph

Maximal cliques in a triangulated graph can be computed in polynomial time (Galinier et al, LNCS, 1995)

71

Hamming graph HG

Triangulated Hamming graph THG

Maximal cliques in THG

triangulated Hamming graph

72 of 202

Finding dense subgraphs in the Hamming graph

72

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

Hamming graph HG

73 of 202

Finding dense subgraphs in the Hamming graph

73

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

Hamming graph HG

74 of 202

Finding dense subgraphs in the Hamming graph

74

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

Hamming graph HG

75 of 202

Finding dense subgraphs in the Hamming graph

75

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

Hamming graph HG

76 of 202

Finding dense subgraphs in the Hamming graph

76

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

Hamming graph HG

77 of 202

Finding dense subgraphs in the Hamming graph

77

Hamming graph HG

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

78 of 202

Finding dense subgraphs in the Hamming graph

78

Hamming graph HG

Triangulated Hamming graph THG

Maximal cliques in THG

clique graph

triangulated Hamming graph

79 of 202

Finding dense subgraphs in the Hamming graph

79

Hamming graph HG

Triangulated

Hamming graph THG

Maximal cliques in THG

Dense subgraphs in THG

clique graph

80 of 202

Finding dense subgraphs in the Hamming graph

80

Hamming graph HG

Triangulated

Hamming graph THG

Maximal cliques in THG

Dense subgraphs in THG

clique graph

81 of 202

Finding dense subgraphs in the Hamming graph

81

Hamming graph HG

Triangulated

Hamming graph THG

Maximal cliques in THG

Dense subgraphs in THG

clique graph

82 of 202

Finding dense subgraphs in the Hamming graph

82

Hamming graph HG

Triangulated

Hamming graph THG

Maximal cliques in THG

Dense subgraphs in THG

Dense subgraphs in HG

original Hamming graph

83 of 202

IgRepertoireConstructor in action

83

adjacency matrix

84 of 202

IgRepertoireConstructor in action

84

85 of 202

A closer look at a dense subgraph

85

Is it a single antibody or multiple similar antibodies?

86 of 202

A closer look at a dense subgraph

86

Is it a single antibody or multiple similar antibodies?

87 of 202

A closer look at a dense subgraph

87

Is it a single antibody or multiple similar antibodies?

88 of 202

A closer look at a dense subgraph

88

Is it a single antibody or multiple similar antibodies?

89 of 202

A closer look at a dense subgraph

89

Is it a single antibody or multiple similar antibodies?

90 of 202

Dense subgraphs in more details

Dense subgraphs correspond to identical or very similar antibodies

Detection of variations within dense subgraphs helps to construct subpartition dense subgraphs into individual antibodies

90

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CCGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTTTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCTGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

*.******.********.**********************.*******

Problem: how can we distinguish edges corresponding to variations from edges corresponding to sequencing errors?

91 of 202

Dense subgraphs in more details

91

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CCGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTTTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCTGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

*.******.********.**********************.*******

Each edge in the dense subgraph corresponds to either an error or to a variation. Errors correspond to spurious positions in the multiple alignment, while variations correspond to solid positions

92 of 202

Dense subgraph can combine similar antibodies

92

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CCGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTACCTATGAT

CTGAGACTTTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCTGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

CTGAGACTCTCCTGTTCAGCCTCTGGATTCACCTTCAGTAGCTATGAT

*.******.********.**********************.*******

93 of 202

Final decomposition of the Hamming graph

93

35 trivial and 7 non-trivial clusters

94 of 202

94

Tool for constructing antibody repertoire from Rep-seq reads:

  • Safonova et al., Bioinformatics 2015;
  • Shlemov et al., RECOMB 2017

Works on both barcoded and non-barcoded polyclonal datasets

Performs well for both

antibody and TCR repertoires

Available at

Illumina BaseSpace

Works slow on very complex Rep-seq datasets

IgRepertoireConstructor / IgReC

95 of 202

IgQUAST: quality assessment of antibody repertoires

  • IgReC improved upon state-of-the art tools

  • Works well in terms of both sensitivity and precision

  • Works well for different types of Rep-seq data

95

2017

96 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

96

97 of 202

Antibody maturation

97

target antigen

mutation process

98 of 202

Antibody maturation

98

Antibody gains better specificity to the target antigen

target antigen

mutation process

99 of 202

Antibody maturation

99

target antigen

mutation process

Good! Positive selection

100 of 202

Antibody maturation

100

target antigen

mutation process

101 of 202

Antibody maturation

101

Antibody gains specificity to an antigen of the host (self-reactivity)

target antigen

mutation process

102 of 202

Antibody maturation

102

target antigen

mutation process

Bad: Negative selection

103 of 202

Antibody maturation

103

target antigen

mutation process

104 of 202

Antibody maturation

104

Antibody loses specificity to the target antigen

target antigen

mutation process

105 of 202

Antibody maturation

105

target antigen

mutation process

Bad: Negative selection

106 of 202

Turning antibodies into vertices...

A directed edge connects parental antibody and its mutated copy

107 of 202

Antibody expansion...

The higher is the specificity of an antibody, the more likely it is to be expanded

108 of 202

Antibody expansion...

109 of 202

Evolutionary tree!

109

110 of 202

Evolutionary tree = clonal tree

110

  • Vertices correspond to mutated clones of ancestral B-cell

111 of 202

Evolutionary tree of antibodies = clonal tree

111

  • Vertices correspond to mutated clones of ancestral B-cell

  • All vertices have the same VDJ regions and differ only by mutations

112 of 202

Evolutionary tree of antibodies = clonal tree

112

  • Vertices correspond to mutated clones of ancestral B-cell

  • All vertices have the same VDJ regions and differ only by mutations

  • Some vertices are missing!

113 of 202

Evolution of antibody repertoire

Initially, antibody repertoire consists of many “naive” VDJ recombinations (without mutations)

113

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

114 of 202

Binding to antigens initiates selection

114

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

V

D

J

115 of 202

Clonal expansion and mutagenesis turn single antibodies into lineages

115

V

D

J

V

D

J

V

D

J

116 of 202

Clonal expansion and mutagenesis turn single antibodies into lineages

116

V

D

J

V

D

J

V

D

J

117 of 202

Each lineage is described by its clonal tree

117

V

D

J

V

D

J

V

D

J

118 of 202

Antibody repertoire is described by a set of clonal trees

Standard phylogenetics algorithms do not work here!

118

119 of 202

Two faces of clonal reconstruction

Clonal lineage assignment

Clonal tree construction

Change-O

Gupta et al, Bioinformatics 2015

Clonify

Briney et al,

Sci Rep 2016

partis

Ralph and Matsen,

PLOS CB 2016

repgenHMM

Elhanati et al,

Bioinformatics 2016

119

Given antibody repertoire, decompose it into clonal lineages

Given clonal lineage, construct its clonal tree

Can we apply existing phylogenetic tree construction algorithms to solve this problem?

120 of 202

AntEvolo: clonal tree constructor

120

121 of 202

Application of clonal analysis

121

HIV, 94th week, largest tree

  • Design of drug and vaccines

  • Monitoring adaptive immune response and analysis of treatment efficiency

Haynes et al., Nat Biotech, 2012

Liao et al., Nature, 2013

Laserson et al., PNAS, 2014

Galson et al., Genome Med, 2016

During immune response, antibodies specific to an active antigen are expanded into huge clonal trees

122 of 202

Antibodies gain mutations during maturation

122

V

D

J

123 of 202

Antibodies gain mutations during maturation

123

V

D

J

124 of 202

Antibodies gain mutations during maturation

124

98% – random substitutions; 2% – short random indels

V

D

J

125 of 202

VDJ classification helps to reveal mutations

125

V

D

J

V

D

J

Closest germline segments from database

Mutations

126 of 202

Who is the ancestor here?

126

V

D

J

V

D

J

Closest germline segments from database

V

D

J

127 of 202

Nested mutations define evolutionary direction

127

V

D

J

V

D

J

Closest germline segments from database

V

D

J

New mutations

128 of 202

Who is the ancestor here?

128

V

D

J

V

D

J

Closest germline segments from database

V

D

J

129 of 202

Who is the ancestor here?

129

V

D

J

V

D

J

Closest germline segments from database

V

D

J

Individual mutations 1

Individual mutations 2

130 of 202

Ancestral antibody is missing in the repertoire

130

V

D

J

V

D

J

Closest germline segments from database

V

D

J

Individual mutations 1

Individual mutations 2

V

D

J

131 of 202

What is the evolutionary tree?

131

132 of 202

Nested mutations define directions

132

133 of 202

Repetitive mutations complicates construction of clonal tree

133

134 of 202

Repetitive mutations complicates construction of clonal tree

134

135 of 202

Homoplasy in antibodies

What is the structure of evolutionary tree connecting a, b, and c?

135

136 of 202

Homoplasy in antibodies

136

Parallel evolution of

137 of 202

Homoplasy in antibodies

137

Parallel evolution of

138 of 202

Homoplasy in antibodies

Reverse of

139 of 202

We do not know which tree is correct

Reverse of

Parallel evolution of

Parallel evolution of

140 of 202

Analyzing homoplasy in antibodies using clonal graph

  • Vertices – sequences from the same clonal lineage
  • Vertices a and b are connected by an edge

if mutations of a are nested into mutations of b

  • Transitive edges are removed

140

Connected component of clonal graph constructed from lymphoma repertoire

141 of 202

Analyzing homoplasy in antibodies using clonal graph

  • Vertices – sequences from the same clonal lineage
  • Vertices a and b are connected by an edge

if mutations of a are nested into mutations of b

  • Transitive edges are removed

Connected component of clonal graph constructed from lymphoma repertoire

142 of 202

Counting parallel mutations

142

# non-trivial SHMs

max frequency

# synonymous non-trivial SHMs

# RGYW/WRCY non-trivial SHMs

Lymphoma

172

(72.52%)

20

47

50

HIV, 94th week

544

(82.05%)

79

120

132

Plasma cells,

positive to flu

90

(90%)

7

37

15

Plasma cells,

negative to flu

3

(3.3%)

2

3

1

  • List all unique mutations using clonal graph
  • Compute frequency of an unique mutation as a number of edges it appears
  • Mutation is non-trivial if its frequency > 1

143 of 202

Are non-trivial mutations important?

143

Hypothesis 1

Non-trivial mutations are fixed in antibodies after multiple encounters with antigens

Conclusion: Such mutations are very important for affinity and thus for drug design

144 of 202

Are non-trivial mutations important?

144

Hypothesis 2

Non-trivial mutations occurred during multiple cell divisions after a single encounter with antigen

Conclusion: Such mutations are important for mutagenesis modeling

Hypothesis 1

Non-trivial mutations are fixed in antibodies after multiple encounters with antigens

Conclusion: Such mutations are very important for affinity and thus for drug design

145 of 202

145

Largest tree for HIV, 94 week

146 of 202

146

Largest tree for HIV, 94 week

Probably, products of a single encounter with an antigen

147 of 202

147

Largest tree for HIV, 94 week

Probably, products of multiple encounters with antigens

148 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

148

149 of 202

Three types of antibody chains

Human antibody have one type of heavy chain (IGH) and two types of light chains:

  • κ encoded by IGK locus (chr 2)
  • λ encoded by IGL locus (chr 22)

149

150 of 202

VDJ recombination randomly selects between IGK and IGL

150

151 of 202

Resulting antibody may be self-reactive

151

In this case, immune system gives B cell producing self-reactive antibody a second chance

152 of 202

Receptor editing process is intended to fix self-reactive chains

152

153 of 202

Newly recombined antibody may be helpful

153

154 of 202

Sometimes receptor editing affect another light chain locus (isotypic inclusion)

154

B cell produces 2 antibodies

155 of 202

Self-reactive chain still works, but it is suppressed by immune system

155

156 of 202

Receptor editing may also affect alternative allele (allelic inclusion)

156

157 of 202

In the worst case, B cell may produce 6 different chains

157

It is still unknown how many antibodies can be produced by such B cell and how many of them are self-reactive

158 of 202

Multichain effect corrupts results of clonal analysis

  • If antibody binds to antigen, genome of B cell undergoes mutagenesis

  • If B cell produce several chains, all of them undergoes mutagenesis, including self-reactive ones

158

159 of 202

Multichain effect corrupts results of clonal analysis

  • If antibody binds to antigen, genome of B cell undergoes mutagenesis

  • If B cell produce several chains, all of them undergoes mutagenesis, including self-reactive ones

  • As a result of multichain effect, huge clonal trees may correspond to non-functional or even self-reactive chains

159

160 of 202

Multichain effect corrupts results of clonal analysis

  • If antibody binds to antigen, genome of B cell undergoes mutagenesis

  • If B cell produce several chains, all of them undergoes mutagenesis, including self-reactive ones

  • As a result of multichain effect, huge clonal trees may correspond to non-functional or even self-reactive chains

160

Without information about correspondence between chains, clonal analysis may be useless

161 of 202

  • From information theory to immunology
  • Introduction to immunology
  • Population-level analysis of immune system
    • VDJ classification problem
    • VDJ reconstruction problem
    • VDJ variants problem
    • VDJ modeling problem
    • Ig loci reconstruction problem
  • Repertoire construction problem
  • Clonal tree construction problem
  • Multi-chain effect in lymphocytes
  • Paired repertoire construction problem

161

162 of 202

Single-cell RNA-seq of adaptive immune repertoires

162

McDaniel et al., Nat Protocols, 2016

163 of 202

Single-cell RNA-seq of adaptive immune repertoires

163

McDaniel et al., Nat Protocols, 2016

164 of 202

Single-cell RNA-seq of adaptive immune repertoires

164

McDaniel et al., Nat Protocols, 2016

165 of 202

Single-cell RNA-seq of adaptive immune repertoires

165

McDaniel et al., Nat Protocols, 2016

> 97% precision,

3% of collisions:

single cell barcode corresponds to several cells

166 of 202

Clonal analysis for paired antibody repertoire

166

167 of 202

AntEvolo → PairAntEvolo

167

168 of 202

PairAntEvolo

168

Information about correspondence between chains improves quality of clonal trees

169 of 202

Paired data & multi-chain effect

169

Cell collision

170 of 202

Paired data & multi-chain effect

170

Antibody tree construction problem. Given an antibody repertoire, mutagenesis model, and cell barcoding, resolve cell collisions and construct antibody trees

171 of 202

Paired data & multi-chain effect

171

Antibody prediction problem. Given an antibody tree, predict pairing of chains producing functional antibody

172 of 202

How accurate is the database of human

V, D, and J segments?

TGTGCGGGGGGTAGCAGTGGCTGGATTGACTACTGG

AGTGGCT

TAGCAGTGGCTGG

TAGCAG

  • Annotation of Ig loci in human genome contains long tandem repeats and is likely assembled with errors
  • D segments were manually (!) extracted in mid 1990s (!) from antibodies sequenced using old technology

172

V

J

D1

D2

D3

173 of 202

How accurate is the database of human

V, D, and J segments?

TGTGCGGGGGGTAGCAGTGGCTGGATTGACTACTGG

AGTGGCT

TAGCAGTGGCTGG

TAGCAG

  • Annotation of Ig loci in human genome contains long tandem repeats and is likely assembled with errors
  • D segments were manually (!) extracted in mid 1990s (!) from antibodies sequenced using old technology

173

V

J

D1

D2

D3

174 of 202

How accurate is the database of human

V, D, and J segments?

TGTGCGGGGGGTAGCAGTGGCTGGATTGACTACTGG

AGTGGCT

TAGCAGTGGCTGG

TAGCAG

  • Annotation of Ig loci in human genome contains long tandem repeats and is likely assembled with errors
  • D segments were manually (!) extracted in mid 1990s (!) from antibodies sequenced using old technology

174

V

J

D1

D2

D3

175 of 202

How accurate is the database of human

V, D, and J segments?

TGTGCGGGGGGTAGCAGTGGCTGGATTGACTACTGG

AGTGGCT

TAGCAGTGGCTGG

TAGCAG

  • Annotation of Ig loci in human genome contains long tandem repeats and is likely assembled with errors
  • D segments were manually (!) extracted in mid 1990s (!) from antibodies sequenced using old technology

175

V

J

D1

D2

D3

176 of 202

How accurate is the database of human

V, D, and J segments?

TGTGCGGGGGGTAGCAGTGGCTGGATTGACTACTGG

AGTGGCT

TAGCAGTGGCTGG

TAGCAG

  • Annotation of Ig loci in human genome contains long tandem repeats and is likely assembled with errors
  • D segments were manually (!) extracted in mid 1990s (!) from antibodies sequenced using old technology

176

V

J

Cropped versions of D2?

D1

D2

D3

177 of 202

Two approaches to finding new V, D, J segments

177

Antibody repertoires from 5 pairs of twins

Rubelt et al, Nat Communications, 2016

Genomes Antibodies (Rep-seq)

Watson et al, AJHG, 2013

Watson et al, Genes & Immunity, 2014

178 of 202

CDRs are antigen-binding sites of antibody

(complementarity determining regions)

CDR1

CDR2

CDR3

  • CDR3 covers V-D-J junction

  • CDR3 is the most variable segment of the variable region

V

D

J

Heavy chain

Antibody

179 of 202

CDR3 plots

179

Dots correspond to consecutive 10-mers extracted from CDR3

Frequency of 10-mer – number of times this 10-mer appeared in unique CDR3s from 5 * 2 twins

10-mer position

10-mer frequency

180 of 202

CDR3 plots

180

Dots correspond to consecutive 10-mers extracted from CDR3

Frequency of 10-mer – number of times this 10-mer appeared in unique CDR3s from 5 * 2 twins

10-mer position

10-mer frequency

Points corresponding to 10-mers from D-segments are painted red

ATTACTATGG

TTACTATGGT

TACTATGGTT

ACTATGGTTC

CTATGGTTCG

TATGGTTCGG

181 of 202

Peaks in CDR3 plots reveal (parts of) D segments!

181

Dots correspond to consecutive 10-mers extracted from CDR3

Frequency of 10-mer – number of times this 10-mer appeared in unique CDR3s from 5 * 2 twins

10-mer position

10-mer frequency

Points corresponding to 10-mers from D-segments are painted red

ATTACTATGGTTCGG

Germline D: GTATTACTATGGTTCGGGGAGTTATTATAAC

182 of 202

Toward modeling recombination

182

Some fragments have tendency to be cropped on the right side

183 of 202

Toward modeling recombination process

183

Some fragments have tendency to be cropped on both sides

184 of 202

Toward modeling recombination process

184

Some fragments have tendency to be not cropped

185 of 202

Finding novel segments

185

ACCACAGATTATATCGAGAGGGGATATGATGAAGGGGACTAC

186 of 202

Aligning antibody against V, D, and J segments

186

ACCACAGATTATATCGAGAGGGGATATGATGAAGGGGACTAC

End of V

D

Start of J

D alignment generated by IgBlast looks suspiciously short

187 of 202

187

ACCACAGATTATATCGAGAGGGGATATGATGAAGGGGACTAC

7-mer position

7-mer frequency

188 of 202

Novel D segment?

188

ACCACAGATTATATCGAGAGGGGATATGATGAAGGGGACTAC

GGATAT

Our prediction does not match the D-alignment computed by IgBlast and may correspond to novel D segment

189 of 202

One more CDR3 and VDJ alignment

189

ACAAGCGGGGGCGAGAGGTACTCTCATACTAATGGTTATCCAAACTACTTTGACTAC

End of V

D

Start of J

190 of 202

One more CDR3 and VDJ alignment

190

ACAAGCGGGGGCGAGAGGTACTCTCATACTAATGGTTATCCAAACTACTTTGACTAC

End of V

D

Start of J

VD insertion

DJ insertion

191 of 202

191

ACAAGCGGGGGCGAGAGGTACTCTCATACTAATGGTTATCCAAACTACTTTGACTAC

VD insertion

DJ insertion

Two D-segments in a single CDR3?

Double D segments were reported in HIV specific antibodies

Larimore et al, J. of Immun 2012

192 of 202

192

ACAAGCGGGGGCGAGAGGTACTCTCATACTAATGGTTATCCAAACTACTTTGACTAC

TACTAATGGT

First peak falls in VD junction, second peak confirms D-alignment computed by IgBlast

VD insertion

193 of 202

193

ACAAGCGGGGGCGAGAGGTACTCTCATACTAATGGTTATCCAAACTACTTTGACTAC

TACTAATGGT

First peak falls in VD junction, second peak confirms D-alignment computed by IgBlast

VD insertion

D1D2 insertion

D1

D2

Open problem: Are double D-segments common in healthy individuals?

194 of 202

Extension of a known D segment?

194

195 of 202

False positive reference D segment?

195

196 of 202

Triple D segment???

196

197 of 202

Modeling VDJ recombination

197

V1

V2

V3

V4

p1

p2

p3

p4

-2

-1

0

1

2

3

4

p(V1, –2)

p(V1, –1)

p(V1, 0)

p(V1, 1)

p(V1, 2)

p(V1, 3)

p(V1, 4)

Probability of cleavage / palindromic insertion of length l:

Probability of segment usage:

V1:

1 nt

2 nt

3 nt

4 nt

5 nt

p(1)

p(2)

p(3)

p(4)

p(5)

Probability of VD/DJ insertion:

Previous studies used incomplete database of V, D, J segments:

Murugan PNAS 2012; Elhanati Trans R Soc Biol Sci 2015; Ralph PLOS 2016

198 of 202

Modeling VDJ recombination

198

V1

V2

V3

V4

p1

p2

p3

p4

-2

-1

0

1

2

3

4

p(V1, –2)

p(V1, –1)

p(V1, 0)

p(V1, 1)

p(V1, 2)

p(V1, 3)

p(V1, 4)

Probability of cleavage / palindromic insertion of length l:

Probability of segment usage:

V1:

Probability of VD/DJ insertion:

VDJ modeling problem. Given multiple antibody repertoires generated from a set of known and unknown V, D, and J segments, develop a more accurate statistical model of VDJ recombination

1 nt

2 nt

3 nt

4 nt

5 nt

p(1)

p(2)

p(3)

p(4)

p(5)

199 of 202

Finding novel V segments Animate

199

Germline V segment

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

✪ ❃ ❇

Novel V segments:

Our analysis revealed:

  • > 100 segments with < 4 mismatches

  • 19 segments with 4-6 mismatches

  • 17 segments with by 7-13 mismatches

  • 6 segments with 14-89 mismatches

Novel segments in rhesus macaques were discovered by

Corcoran et al., Nat Communications, 2017

200 of 202

Crossover causes genetic diversity Repaint

200

father chr.

mother chr.

child chr.

child chr.

201 of 202

Unequal crossover may produce new segments

Length of Ig loci: ~1.25 Mbp

Frequency of crossing over: 1 point per 1 Mbp

201

202 of 202

Variations in Ig loci: alternative method for population analysis?

  • In contrast to protein-coding regions, the Ig loci may accumulate many mutations since it is not under direct evolutionary pressure

  • Unequal crossover may be responsible for Ig loci diversification

202

Ig loci reconstruction problem. Given error-prone reads representing antibodies and reads sampled from the genome, assemble the Ig loci