1 of 53

Bacteriocin detection with distributed biological sequence representation

Md Nafiz Hamid, Iddo Friedberg

Bioinformatics and Computational Biology program

Department of Veterinary Microbiology and Preventive Medicine

1

2 of 53

2

Antibiotic Resistance

“A problem so serious that it threatens the achievements

of modern medicine... A post-antibiotic era, in which common infections and

minor injuries can kill, far from being an apocalyptic fantasy,

is instead a very real possibility for the 21st century.”

--WHO (2014)

3 of 53

3

Antibiotic Production history

4 of 53

4

Combating Antibiotic Resistance

  • Harvest more natural compounds
  • narrow spectrum

5 of 53

5

Bacteriocins

  • Genetically encoded antimicrobial peptides
  • Narrow killing spectrum - ‘designer drug
  • Truly diverse - more potential

Figure: Phylogenetic breadth of bacteriocin killing.

KO, Klebsiella oxytoca; KP, Klebsiella pneumoniae; EB, Enterobacter cloacae; CF, Citrobacter freundii; EC, E. coli; SM, Serratia marcescens; HA, Hafnia alvei; VC, Vibrio cholerae

Image from Riley et.al., Annual Review of Microbiology, 2002

6 of 53

Bacteriocin Gene Clusters

6

Figure: Examples of bacteriocin gene clusters

Image from Fields et al., Biochemical Pharmacology, 2016

Context genes

7 of 53

7

A

B

T

C

I

P

R

K

F

E

G

Bacteriocin

Immunity

Transporter

Modifier

Regulator

8 of 53

8

A

B

T

C

I

P

R

K

F

E

G

Bacteriocin

Immunity

Transporter

Modifier

Regulator

9 of 53

9

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

Bacteriocin

Immunity

Transporter

Modifier

Regulator

10 of 53

10

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

Bacteriocin

Immunity

Transporter

Modifier

Regulator

11 of 53

11

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

Bacteriocin

Immunity

Transporter

Modifier

Regulator

12 of 53

12

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

Bacteriocin

Immunity

Transporter

Modifier

Regulator

13 of 53

13

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

NisF

NisF

NisE

NisG

NisI

Bacteriocin

Immunity

Transporter

Modifier

Regulator

14 of 53

14

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

NisF

NisF

NisE

NisG

NisI

NisK

NisR

Bacteriocin

Immunity

Transporter

Modifier

Regulator

15 of 53

15

A

B

T

C

I

P

R

K

F

E

G

NisB

NisC

NisT

NisP

NisF

NisF

NisE

NisG

NisI

NisK

NisR

Bacteriocin

Immunity

Transporter

Modifier

Regulator

16 of 53

16

Computational work on detecting Bacteriocins

  • BAGEL1: Database. Also a bacteriocin finding tool that does genome mining with homology based methods.
  • Bactibase2: Database. Also offers limited identification facilities.

1: Heel et. al.(2013) 2: Hammami et.al.(2007), 3: Morton et.al.(2015)

  • BOA3: Uses context genes to find potential areas for bacteriocins and applies profile hidden markov models.

Multiple sequence alignment

Hidden markov model

17 of 53

17

How BOA works

Jamie Morton

18 of 53

18

How BOA works

19 of 53

19

?

How BOA works

20 of 53

20

Our work - Motivation

  • Identification of novel bacteriocins from only primary protein sequences using machine learning classification.

?

21 of 53

21

How to represent the protein sequences?

  • Distributed representation for each word (in our case - each 3-gram)

MKK =

-0.1211

-0.0381

-0.0228

-0.0833

.

.

.

.

.

.

-0.0945

Distributed representation

Figure: 3-grams

MKKAVIVENKGCATCSIGAACGLFGLWG

22 of 53

22

word2vec in Natural Language Processing(NLP)

You shall know a word by the company it keeps

-- J.R. Firth

  • Goal: Representing a word with a dense vector through training of a Neural Network. Words with similar ‘context’ have similar types of vectors.
  • Example: ‘Sophisticated’ and ‘Complicated’. ‘Android’ and ‘iOS’.

Unsupervised Learning

Specially helps when we have small labeled data

23 of 53

23

Cool word2vec effects

Company-CEO

Image from lecture slides of Socher et.al.

24 of 53

24

Cool word2vec effects

Analogies

Image from lecture slides of Socher et.al.

25 of 53

Applications in Natural Language Processing(NLP)

25

  • Sentiment classification:
    • Positive example: This movie is subtle yet imposing, short yet not hurried.
    • Negative example: This movie is short yet taxing, neither subtle nor imposing.

Words: This, movie, subtle, yet, imposing, short, hurried, taxing, neither, nor

26 of 53

26

How training happens? - Training samples for Neural Network

Corpus: Bacteriocins are much fun and important.

Bacteriocins

are

much

fun

and

important

Bacteriocins

are

much

fun

and

important

Bacteriocins

are

much

fun

and

important

Input

Output

Bacteriocins are

Bacteriocins much

are Bacteriocins

are much

are fun

Context window: 2

much Bacteriocins

much are

much fun

much and

27 of 53

27

How training happens? - Neural Network

Vocabulary: Bacteriocins, are, much, fun, and, important 6 words

0

0

0

1

0

0

Length: 6

fun

1

0

0

0

0

0

For training sample: (fun, bacteriocins)

bacteriocins

Hidden layer weight matrix

Output layer weight matrix

Dense word vector

The thing we want

28 of 53

From NLP to biology

28

Asgari et.al.(2015)1 showed efficiency of word2vec in protein family classification.

Asgari et.al., PLOS One, 2015

Figure: t-sne plot for all 3-grams(words) by their biochemical and biophysical properties.

29 of 53

Our word2vec training - Uniprot Trembl

29

MKLIGPTFTNTST

  • Corpus: Uniprot Trembl database - 55,899,422 sequences
  • Vocabulary: All 3-grams (20 * 20 * 20 = 8000 words)
  • Generating 3 sequences from each sequence in the database

3-gram

1st sequence of words: MKL, IGP, TFT, NTS

2nd sequence of words: KLI, GPT, FTN, TST

3rd sequence of words: LIG, PTF, TNT

1st sequence start

2nd sequence start

3rd sequence start

30 of 53

Our word2vec training - Uniprot Trembl

30

  • At the end of training - you get a size 100 dense vector for each 3-gram.

MKL =

-0.1211

-0.0381

-0.0228

-0.0833

.

.

.

.

.

.

-0.0945

-0.0145

31 of 53

Machine learning classification - Training

31

  • 346 positive bacteriocins from BAGEL dataset.
  • Each sequence is represented by a size 100 dense vector which is the summation of all the size 100 dense vector of the overlapping 3-grams.

MKLIGPTFTNTST

MKL + KLI + LIG + IGP + …………………..+TST

-0.1211

-0.0381

-0.0228

-0.0833

.

.

.

.

.

.

-0.0945

-0.0145

0.0784

0.0262

0.0529

-0.1338

.

.

.

.

.

.

-0.0988

-0.0381

-0.1222 -0.0069 -0.2823

-0.2173

.

.

.

.

.

.

-0.3728

-0.1597

0.3240

-0.0407

-0.3316� 0.1062

.

.

.

.

.

.

-0.0260

-0.0282

0.2565

-0.0558

0.0221

0.0971

.

.

.

.

.

.

-0.1413

-0.1344

+

+

+

+

1.0211

3.6549

-0.3279

-0.9871

.

.

.

.

.

.

5.7712

-3.7610

+

+

.

.

.

.

.

.

32 of 53

Training - building a negative set

32

  • Use the swissprot db - manually reviewed
  • CD-HIT to reduce redundancy.
  • Get 346 sequences that have the same length distribution as the 346 positive bacteriocins.
  • NOT anti-microbial, NOT antibiotic, NOT in plasmid.

33 of 53

Result on the whole dataset

33

Nested 10-fold cross-validation on 692 samples

Methods

Avg. Precision

Avg. Recall

Average F1

Support vector machines

0.844

0.843

0.842

Logistic Regression

0.845

0.831

0.837

Decision Tree

0.759

0.759

0.757

Random Forest

0.811

0.820

0.813

Precision =

TP

TP + FP

Recall =

TP

TP + FN

F1 = 2

Precision. Recall

Precision + Recall

34 of 53

ROC curves - SVM

34

35 of 53

Applying trained model on Genbank protein data

35

  • 11 high probability bacteriocins from Lactobacillus species
  • Collaborators working on experimental validation

36 of 53

Potential new bacteriocins found!!!

36

Lactobacillus acidophilus La-14

Putative bacteriocin

Regulator

Transporter

Regulator

37 of 53

Potential new bacteriocins found!!!

37

Lactobacillus acidophilus 30SC

Putative bacteriocin

Regulator

Regulator

Transporter

38 of 53

Potential new bacteriocins found!!!

38

Lactobacillus acidophilus NCFM

Putative bacteriocin

Transporter

Immunity

Transporter

Transporter

39 of 53

Coming up

39

  • Paper
  • Software for the community

40 of 53

Thanks!!

40

Iddo Friedberg

Stefan Freed

Shaun Lee

Follow me

@naf1zh

41 of 53

Questions!!!

41

42 of 53

42

Lactobacillus acidophilus La-14

Putative bacteriocin

Regulator

Transporter

Regulator

Lactobacillus acidophilus 30SC

Putative bacteriocin

Regulator

Regulator

Transporter

Lactobacillus acidophilus NCFM

Putative bacteriocin

Transporter

Immunity

Transporter

Transporter

43 of 53

43

Lactobacillus helveticus CNRZ32 (Locus: NC_021744, protein id: YP_008235573.1)

Putative bacteriocin

Lactobacillus acidophilus NCFM (Locus: NC_006814, protein id: YP_193080.1)

Putative bacteriocin

Lactobacillus acidophilus NCFM (Locus: NC_006814, protein id: YP_193019.1)

Putative bacteriocin

44 of 53

44

Lactobacillus acidophilus 30SC

Putative bacteriocin

Regulator

Regulator

Transporter

45 of 53

45

Lactobacillus acidophilus NCFM

Putative bacteriocin

Transporter

Immunity

Transporter

Transporter

46 of 53

46

Bacteriocin

Transporter

Immunity

Regulator

Modifier

Other

47 of 53

Corpus: MKLIGPTFTNTSTLSNSV

MKL

IGP

TFT

NTS

TLS

NSV

Input

Output

MKL IGP

MKL TFT

IGP MKL

IGP TFT

IGP NTS

Context window: 2

TFT MKL

TFT IGP

TFT NTS

TFT TLS

MKL

IGP

TFT

NTS

TLS

NSV

MKL

IGP

TFT

NTS

TLS

NSV

48 of 53

Corpus: MKL IGP TFT NTS TLS NSV 6 3-grams

1

0

0

0

0

0

Length: 6

MKL

0

1

0

0

0

0

Example training sample: (MKL, IGP)

IGP

Hidden layer weight matrix

Output layer weight matrix

Dense word vector

49 of 53

Corpus: MKL IGP TFT NTS TLS NSV 6 3-grams

1

0

0

0

0

0

Length: 6

MKL

0

1

0

0

0

0

Example training sample: (MKL, IGP)

IGP

Hidden layer weight matrix

Output layer weight matrix

Dense word vector

50 of 53

50

MKLIGPTFTNTST

3-gram

1st sequence of words: MKL, IGP, TFT, NTS

2nd sequence of words: KLI, GPT, FTN, TST

3rd sequence of words: LIG, PTF, TNT

1st sequence start

2nd sequence start

3rd sequence start

51 of 53

51

MKL =

-0.1211

-0.0381

-0.0228

-0.0833

.

.

.

.

.

.

-0.0945

-0.0145

52 of 53

52

MKLIGPTFTNTST

MKL + KLI + LIG + IGP + …………………..+TST

-0.1211

-0.0381

-0.0228

-0.0833

.

.

.

.

.

.

-0.0945

-0.0145

0.0784

0.0262

0.0529

-0.1338

.

.

.

.

.

.

-0.0988

-0.0381

-0.1222 -0.0069 -0.2823

-0.2173

.

.

.

.

.

.

-0.3728

-0.1597

0.3240

-0.0407

-0.3316� 0.1062

.

.

.

.

.

.

-0.0260

-0.0282

0.2565

-0.0558

0.0221

0.0971

.

.

.

.

.

.

-0.1413

-0.1344

1.0211

3.6549

-0.3279

-0.9871

.

.

.

.

.

.

5.7712

-3.7610

53 of 53

53

word2vec

k-mer

Methods

Avg. Precision

Avg. Recall

Average F1

Avg. Precision

Avg. Recall

Average F1

Support vector machines

0.877

0.863

0.869

0.875

0.808

0.838

Logistic Regression

0.850

0.823

0.834

0.869

0.839

0.850

Decision Tree

0.731

0.747

0.736

0.755

0.719

0.733

Random Forest

0.820

0.813

0.814

0.833

0.773

0.799

Nested 10-fold cross-validation on 692 samples done 50 times