Bacteriocin detection with distributed biological sequence representation
Md Nafiz Hamid, Iddo Friedberg
Bioinformatics and Computational Biology program
Department of Veterinary Microbiology and Preventive Medicine
1
2
Antibiotic Resistance
“A problem so serious that it threatens the achievements
of modern medicine... A post-antibiotic era, in which common infections and
minor injuries can kill, far from being an apocalyptic fantasy,
is instead a very real possibility for the 21st century.”
--WHO (2014)
3
Antibiotic Production history
4
Combating Antibiotic Resistance
5
Bacteriocins
Figure: Phylogenetic breadth of bacteriocin killing.
KO, Klebsiella oxytoca; KP, Klebsiella pneumoniae; EB, Enterobacter cloacae; CF, Citrobacter freundii; EC, E. coli; SM, Serratia marcescens; HA, Hafnia alvei; VC, Vibrio cholerae
Image from Riley et.al., Annual Review of Microbiology, 2002
Bacteriocin Gene Clusters
6
Figure: Examples of bacteriocin gene clusters
Image from Fields et al., Biochemical Pharmacology, 2016
Context genes
7
A
B
T
C
I
P
R
K
F
E
G
Bacteriocin
Immunity
Transporter
Modifier
Regulator
8
A
B
T
C
I
P
R
K
F
E
G
Bacteriocin
Immunity
Transporter
Modifier
Regulator
9
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
Bacteriocin
Immunity
Transporter
Modifier
Regulator
10
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
Bacteriocin
Immunity
Transporter
Modifier
Regulator
11
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
Bacteriocin
Immunity
Transporter
Modifier
Regulator
12
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
Bacteriocin
Immunity
Transporter
Modifier
Regulator
13
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
NisF
NisF
NisE
NisG
NisI
Bacteriocin
Immunity
Transporter
Modifier
Regulator
14
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
NisF
NisF
NisE
NisG
NisI
NisK
NisR
Bacteriocin
Immunity
Transporter
Modifier
Regulator
15
A
B
T
C
I
P
R
K
F
E
G
NisB
NisC
NisT
NisP
NisF
NisF
NisE
NisG
NisI
NisK
NisR
Bacteriocin
Immunity
Transporter
Modifier
Regulator
16
Computational work on detecting Bacteriocins
1: Heel et. al.(2013) 2: Hammami et.al.(2007), 3: Morton et.al.(2015)
Multiple sequence alignment
Hidden markov model
17
How BOA works
Jamie Morton
18
How BOA works
19
?
How BOA works
20
Our work - Motivation
?
21
How to represent the protein sequences?
MKK =
-0.1211
-0.0381
-0.0228
-0.0833
.
.
.
.
.
.
-0.0945
Distributed representation
Figure: 3-grams
MKKAVIVENKGCATCSIGAACGLFGLWG
22
word2vec in Natural Language Processing(NLP)
“You shall know a word by the company it keeps”
-- J.R. Firth
Unsupervised Learning
Specially helps when we have small labeled data
23
Cool word2vec effects
Company-CEO
Image from lecture slides of Socher et.al.
24
Cool word2vec effects
Analogies
Image from lecture slides of Socher et.al.
Applications in Natural Language Processing(NLP)
25
Words: This, movie, subtle, yet, imposing, short, hurried, taxing, neither, nor
26
How training happens? - Training samples for Neural Network
Corpus: Bacteriocins are much fun and important.
Bacteriocins | are | much | fun | and | important |
Bacteriocins | are | much | fun | and | important |
Bacteriocins | are | much | fun | and | important |
Input
Output
Bacteriocins are
Bacteriocins much
are Bacteriocins
are much
are fun
Context window: 2
much Bacteriocins
much are
much fun
much and
27
How training happens? - Neural Network
Vocabulary: Bacteriocins, are, much, fun, and, important 6 words
0 |
0 |
0 |
1 |
0 |
0 |
Length: 6
fun
1 |
0 |
0 |
0 |
0 |
0 |
For training sample: (fun, bacteriocins)
bacteriocins
Hidden layer weight matrix
Output layer weight matrix
Dense word vector
The thing we want
From NLP to biology
28
Asgari et.al.(2015)1 showed efficiency of word2vec in protein family classification.
Asgari et.al., PLOS One, 2015
Figure: t-sne plot for all 3-grams(words) by their biochemical and biophysical properties.
Our word2vec training - Uniprot Trembl
29
MKLIGPTFTNTST
3-gram
1st sequence of words: MKL, IGP, TFT, NTS
2nd sequence of words: KLI, GPT, FTN, TST
3rd sequence of words: LIG, PTF, TNT
1st sequence start
2nd sequence start
3rd sequence start
Our word2vec training - Uniprot Trembl
30
MKL =
-0.1211
-0.0381
-0.0228
-0.0833
.
.
.
.
.
.
-0.0945
-0.0145
Machine learning classification - Training
31
MKLIGPTFTNTST
MKL + KLI + LIG + IGP + …………………..+TST
-0.1211
-0.0381
-0.0228
-0.0833
.
.
.
.
.
.
-0.0945
-0.0145
0.0784
0.0262
0.0529
-0.1338
.
.
.
.
.
.
-0.0988
-0.0381
-0.1222 -0.0069 -0.2823
-0.2173
.
.
.
.
.
.
-0.3728
-0.1597
0.3240
-0.0407
-0.3316� 0.1062
.
.
.
.
.
.
-0.0260
-0.0282
0.2565
-0.0558
0.0221
0.0971
.
.
.
.
.
.
-0.1413
-0.1344
+
+
+
+
1.0211
3.6549
-0.3279
-0.9871
.
.
.
.
.
.
5.7712
-3.7610
+
+
.
.
.
.
.
.
Training - building a negative set
32
Result on the whole dataset
33
Nested 10-fold cross-validation on 692 samples
Methods | Avg. Precision | Avg. Recall | Average F1 |
Support vector machines | 0.844 | 0.843 | 0.842 |
Logistic Regression | 0.845 | 0.831 | 0.837 |
Decision Tree | 0.759 | 0.759 | 0.757 |
Random Forest | 0.811 | 0.820 | 0.813 |
Precision =
TP
TP + FP
Recall =
TP
TP + FN
F1 = 2
Precision. Recall
Precision + Recall
ROC curves - SVM
34
Applying trained model on Genbank protein data
35
Potential new bacteriocins found!!!
36
Lactobacillus acidophilus La-14
Putative bacteriocin
Regulator
Transporter
Regulator
Potential new bacteriocins found!!!
37
Lactobacillus acidophilus 30SC
Putative bacteriocin
Regulator
Regulator
Transporter
Potential new bacteriocins found!!!
38
Lactobacillus acidophilus NCFM
Putative bacteriocin
Transporter
Immunity
Transporter
Transporter
Coming up
39
Thanks!!
40
Iddo Friedberg
Stefan Freed
Shaun Lee
Follow me
@naf1zh
Questions!!!
41
42
Lactobacillus acidophilus La-14
Putative bacteriocin
Regulator
Transporter
Regulator
Lactobacillus acidophilus 30SC
Putative bacteriocin
Regulator
Regulator
Transporter
Lactobacillus acidophilus NCFM
Putative bacteriocin
Transporter
Immunity
Transporter
Transporter
43
Lactobacillus helveticus CNRZ32 (Locus: NC_021744, protein id: YP_008235573.1)
Putative bacteriocin
Lactobacillus acidophilus NCFM (Locus: NC_006814, protein id: YP_193080.1)
Putative bacteriocin
Lactobacillus acidophilus NCFM (Locus: NC_006814, protein id: YP_193019.1)
Putative bacteriocin
44
Lactobacillus acidophilus 30SC
Putative bacteriocin
Regulator
Regulator
Transporter
45
Lactobacillus acidophilus NCFM
Putative bacteriocin
Transporter
Immunity
Transporter
Transporter
46
Bacteriocin
Transporter
Immunity
Regulator
Modifier
Other
Corpus: MKLIGPTFTNTSTLSNSV
MKL | IGP | TFT | NTS | TLS | NSV |
Input
Output
MKL IGP
MKL TFT
IGP MKL
IGP TFT
IGP NTS
Context window: 2
TFT MKL
TFT IGP
TFT NTS
TFT TLS
MKL | IGP | TFT | NTS | TLS | NSV |
MKL | IGP | TFT | NTS | TLS | NSV |
Corpus: MKL IGP TFT NTS TLS NSV 6 3-grams
1 |
0 |
0 |
0 |
0 |
0 |
Length: 6
MKL
0 |
1 |
0 |
0 |
0 |
0 |
Example training sample: (MKL, IGP)
IGP
Hidden layer weight matrix
Output layer weight matrix
Dense word vector
Corpus: MKL IGP TFT NTS TLS NSV 6 3-grams
1 |
0 |
0 |
0 |
0 |
0 |
Length: 6
MKL
0 |
1 |
0 |
0 |
0 |
0 |
Example training sample: (MKL, IGP)
IGP
Hidden layer weight matrix
Output layer weight matrix
Dense word vector
50
MKLIGPTFTNTST
3-gram
1st sequence of words: MKL, IGP, TFT, NTS
2nd sequence of words: KLI, GPT, FTN, TST
3rd sequence of words: LIG, PTF, TNT
1st sequence start
2nd sequence start
3rd sequence start
51
MKL =
-0.1211
-0.0381
-0.0228
-0.0833
.
.
.
.
.
.
-0.0945
-0.0145
52
MKLIGPTFTNTST
MKL + KLI + LIG + IGP + …………………..+TST
-0.1211
-0.0381
-0.0228
-0.0833
.
.
.
.
.
.
-0.0945
-0.0145
0.0784
0.0262
0.0529
-0.1338
.
.
.
.
.
.
-0.0988
-0.0381
-0.1222 -0.0069 -0.2823
-0.2173
.
.
.
.
.
.
-0.3728
-0.1597
0.3240
-0.0407
-0.3316� 0.1062
.
.
.
.
.
.
-0.0260
-0.0282
0.2565
-0.0558
0.0221
0.0971
.
.
.
.
.
.
-0.1413
-0.1344
1.0211
3.6549
-0.3279
-0.9871
.
.
.
.
.
.
5.7712
-3.7610
53
| word2vec | k-mer | ||||
Methods | Avg. Precision | Avg. Recall | Average F1 | Avg. Precision | Avg. Recall | Average F1 |
Support vector machines | 0.877 | 0.863 | 0.869 | 0.875 | 0.808 | 0.838 |
Logistic Regression | 0.850 | 0.823 | 0.834 | 0.869 | 0.839 | 0.850 |
Decision Tree | 0.731 | 0.747 | 0.736 | 0.755 | 0.719 | 0.733 |
Random Forest | 0.820 | 0.813 | 0.814 | 0.833 | 0.773 | 0.799 |
Nested 10-fold cross-validation on 692 samples done 50 times