1 of 16

1

Non-canonical start-codons and where to find them

Students:

Valeria Rubinova (BI)

Bogdan Sotnikov (KRSU, BI)

Supervisor:

Olga Bochkareva (ISTA)

2 of 16

2

Introduction

Most genes in bacteria have a canonical ATG start codon. But in some genes in bacterial DNA, non-canonical GTG and TTG start codons have been observed. We decided to study this issue and research which genes prefer the non-canonical start codons.

Is there a relationship between the function of a gene and its start codon?

Or maybe bacteria that differ in ecological niches prefer different start codons?

3 of 16

3

Aim and objectives of the project:

Aims:

studying the occurrence of non-canonical start codons

  • in genes with various functions and
  • in different organisms
  • reversibility of these changes

Objectives:

  1. Prepare pipeline: from downloading genomes from GenBank to calculating statistics, assigning COG categories to genes and tree visualization
  2. Compare the distribution of start codons in genes of bacteria from different ecological niches
  3. Study the associations between gene function (COG category) and its start codons

4 of 16

4

Pipeline:

Fig. 1 Principal scheme of our workflow

5 of 16

5

GenBank parsing:

We downloaded whole genome and scaffold assemblies of 34 bacteria (generalists and specialists) from GenBank.

List of bacteria studied in the project:

Generalists:

Specialists:

Bacillus pumilus, Bacillus subtilis, Bacillus thuringiensis, Lactococcus lactis, Enterococcus faecalis, Agrobacterium tumefaciens, Klebsiella pneumoniae, Escherichia coli, Acinetobacter baumannii, Stenotrophomonas maltophilia, Pseudomonas aeruginosa, Pseudomonas fluorescens, Pseudomonas putida

Salinibacter ruber, Mycoplasma bovis, Corynebacterium pseudotuberculosis, Bifidobacterium bifidum, Staphylococcus haemolyticus, Staphylococcus simulans, Weissella cibaria, Lactobacillus salivarius, Chlamydia trachomatis, Burkholderia cenocepacia, Neisseria gonorrhoeae, Neisseria meningitidis, Histophilus somni, Vibrio anguillarum, Vibrio campbellii, Vibrio cholerae, Enterobacter hormaechei, Xanthomonas campestris, Xylella fastidiosa, Francisella tularensis, Leptospira interrogans

6 of 16

6

Next, we divided all bacteria into 3 groups:

  1. bacteria with more than 100 whole genome assemblies in NCBI;
  2. bacteria with less than 100 whole genome assemblies in NCBI, but more than 100 whole genome plus scaffold assemblies;
  3. bacteria with less than 100 whole genomes plus scaffold assemblies in NCBI.

The list of bacteria from group 1 was submitted to the input of the pipeline PanACoTA (PanACoTA prepare) to filter out related assemblies, using the Mash genetic distance.

Fig. 2 Group 1

Fig. 3 Group 3

Assembly filtering:

7 of 16

7

Re-annotation and construction of orthologous groups:

We have used proteinortho for orthologous groups construction

We have re-annotated the sequences with Prokka for avoiding the batch effect

8 of 16

8

Sequences’ alignment:

We have used two aligners: PRANK and MUSCLE

Complete results have been received using MUSCLE

Fig. 4 Fragment of Vibrio Campbelli alignment (screenshot from AliView tool)

9 of 16

9

Descriptive statistics:

Fig. 5 Proportion of Escherichia coli orthologous rows of given start-codons in pangenome fractions

Fig. 6 E. coli gene frequency spectrum

10 of 16

10

Computing difference between start-codons into COGs:

Unit of observation is this case is an assembly with given start codon in given COG.

We have used Kruskal — Wallis test for finding difference between three start-codon groups.

Post-hoc analysis was done using Mann-Whitney U-test with Bonferroni correction.

Start-codon frequency in given COG with given start-codon was downsampled.

COG

Start-codon

Weighted frequency

11 of 16

11

Results:

We had 3 hypotheses about the dependence of the distribution of start codons on:

  1. the size of the genome
  2. phylogeny
  3. life style (specialists or generalists)

Distribution of start codons

Hypotheses 1 and 3 were not confirmed: no correlation was found between the representation of start codons depending on the size of the genome and whether the bacterium belongs to generalists and specialists.

Fig. 7 Distribution of non-canonical SCs proportion

On the figure, bacteria are grouped by taxa. Representatives of taxon Pseudomonadota had a lower percentage of genes with non-canonical start codons compared to representatives of taxon Bacillota considered in this work.

12 of 16

12

The y-axis shows the proportion of genes with a start codon of a given type that have a given function. For example, 5% of all genes with the GTG start codon are responsible for cell wall biogenesis.

Fig. 8 Proportion of Vibrio campbellii SCs vs COG function

Distribution of start codons in genes with different COG category:

We analyzed the distribution of genes by COGs within each organism (we took all the genes with a given start codon (ATG, GTG, TTG) and looked at what proportion of them belonged to one or another COG) (the relative percentage was calculated).

13 of 16

13

Fig. 8 Proportion of Vibrio campbellii SCs vs COG function

Distribution of start codons in genes with different COG category:

In genes responsible for:

  • amino acid metabolism
  • carbohydrate metabolism
  • energy production and conversion

the relative percentage of the canonical ATG start codon is greater than that of non-canonical ones.

In genes responsible for:

  • the central dogma of molecular biology
  • the cell cycle
  • mobility
  • cell wall
  • intracellular transport
  • vesicle formation

the relative percentage of non-canonical start codons GTG or TTG is greater than the canonical one.

14 of 16

14

Future plans:

  1. Quality control at different stages of the pipeline: genome assembly, annotation, building orthologous rows, filtering low-quality assemblies, etc.
  2. Identification of individual COGs and genes that prefer non-canonical start codons for further experimental verification
  3. Study of the distribution of non-canonical start codons in a larger number of bacteria from different ecological niches

15 of 16

VL created function for a filtering “bad” alignments

VL filtered alignments using the PanACoTA pipeline

BS worked on statistical analysis

Other objectives have been contributed by authors equally

15

Contacts:

GitHub:

Media:

Valeria Rubinova: Telegram, e-mail

Bogdan Sotnikov: Telegram, e-mail

16 of 16

16

Acknowledgements:

  • Daria Nikolaeva: list of generalists and specialists with data about their ecological niches
  • Lavrentii Danilov: help with finding the formal criterion for evaluating the difference between start-codons “preferability” in different COGs