1 of 19

1

SNP Grouping and evaluation of epistatic interaction for phenotypic traits in agricultural plants

Anastasia Zolotar

Ann Balan

Sergei Volkov��Supervisors:Elena Grigoreva, NOVA PLANT�Lavrentiy Danilov, NOVA PLANT

2 of 19

2

Background

By analyzing the interaction graph between SNPs and genes, it is possible to identify those that have the greatest number of links and to modify these genes in the first place. And by modifying several significant genes at once, it is possible to change complex traits.

Such data can also help to identify the most promising lines and varieties. If an organism has a large number of genes from such a graph, the phenotypic trait is brighter or most likely to show up. This approach will be of great use to breeders.

3 of 19

The aim and objectives

Aim: Finding approaches and tools for SNP grouping applicable to genomic variant analysis of agricultural plants and analysis of soybean data.

Objectives:

  1. Conduct literature analysis and choose the best tools and algorithms for our goal.
  2. Preprocess our data and apply the tools.
  3. Analyze the output, annotate each SNP and map them on genes
  4. Build the graphs and analyze gene-gene interactions, as well as the interaction of metabolic pathways

4 of 19

4

Datasets

Dataset properties

Free-access

Commercial

# SNPs

23278

37676

# samples

248

97

phenotype

a/a concentrations

some complex trait

We carried out all the tool tests in our study on two datasets for Glycine max (soybean):

  1. with open-access dataset with a phenotypic trait - the alanine content of the soybeans;
  2. commercial dataset which is currently being worked on in the Nova Plant company with a complex quantitative phenotypic trait (Nova Plant trade secret)

5 of 19

5

Distribution of phenotypic traits

Alanine content

Commercial trait

6 of 19

The tools

In our review, we have found information on more than 22 tools for this task. We identified those that seemed to be the most relevant.

The most convenient (in a sense of analysis potential, amount of received data, and successful launch) were MIDESP, AnEpiSeeker, Martini, SHEsisPlus.

You could see the full table here

7 of 19

7

AntEpiSeeker

Ant colony algorithms (ACO), proposed first by Dorigio and Gambardella, are tools to solve difficult optimization problems. In ACO, artificial ants work as parallel units that communicate through a probability distribution function (PDF), which is updated by weights. As the PDF is updated, "paths" that perform better will be sampled at higher rates by subsequent artificial ants, and in turn deposit more weights.

AntEpiSeeker has been developed to search for epistatic interactions in large-scale association studies. The main algorithm consists of:

  1. search of SNP sets with sufficient size using the ACO;
  2. exhaustive search of epistatic interactions within the highly suspected SNP sets, and within the reduced set of SNPs with top ranking weight levels

8 of 19

8

AntEpiSeeker

The main limitation of the utility:

Unfortunately, only qualitative phenotypes could be analyzed with this utility. To tackle this obstacle, we converted quantitative phenotypes into binary via calculating the median value and assigning “0” for samples with the value less than median and “1” in the opposite case.

The genome in AntEpiSeeker input should contain genotype in “012” format and binary phenotype.

9 of 19

9

AntEpiSeeker: Alanine content results

Two-loci interaction mode: 96 interacting SNP pairs ⟶ 25 gene pairs with probable epistatic interactions;

Three-loci interaction mode: 8 interacting SNP triplets 1 gene triplet with previously described genes

GO and KEGG annotation for each SNP in pairs were performed using ShinyGO 0.77 web tool to perform

10 of 19

10

AntEpiSeeker: commercial trait results

High level GO terms (two-loci interaction mode)

Two-loci interaction mode: 198 interacting SNP pairs ⟶ 42 gene pairs with probable epistatic interactions

11 of 19

11

Three-loci interaction mode: 8 interacting SNP triplets 2 gene triplet with previously described genes

High level GO terms (three-loci interaction mode)

AntEpiSeeker: commercial trait results

12 of 19

12

MIDESP

The algorithm implemented in MIDESP is based on information theory. The basic idea is to calculate the amount of mutual information (MI) that is shared between SNPs and phenotype.

It also normalises the mutual information values, and corrects them with APC (the average product correction) to reduce background or noise interactions.

13 of 19

Initial filtration secret phenotype data via PLINK software

--vcf INPUT.vcf

--pheno INPUT.tsv

--allow-no-sex

--allow-extra-chr

--double-id

--maf 0.05

--prune

--recode transpose

--chr-set 20

--out TPED and TFAM files

22214 variants and 248 samples pass filters and QC

--allow-no-sex

--indep-pairwise 10000 5 0.99

--make-founders

--allow-extra-chr

--double-id

--chr-set 20

--out Filtered_PruneInfo

--tfile TPED and TFAM files

5662 of 22214 variants removed

--allow-no-sex

--allow-extra-chr

--double-id

--extract Filtered_PruneInfo.prune.in

--make-founders

--recode transpose

--chr-set 20

--out Filtered_Pruned

--tfile TPED and TFAM files

16552 variants and 248 samples pass filters and QC

28847 variants and 97 samples pass filters and QC

4967 of 28847 variants removed

23880 variants and 97 samples pass filters and QC

14 of 19

14

MIDESP: Alanine content results

Using mutual information between SNP pairs and phenotype, we converted them to a gene-gene interaction network.

For this. we annotated each SNP, mapped them on genes and built the weighted graph based on filtering by z-score of mutual information. The most significant interactions are signed in red on the picture.

15 of 19

15

MIDESP: Alanine content results

KEGG enrichment analysis for these genes showed the most presented metabolism pathways.

This information can help to find out the most significant metabolic processes for the trait expression.

16 of 19

16

MIDESP:

Commercial trait results

This graph represents the gene-gene interactions.

Metabolic pathways are shown with nodes colors, and the edge thickness shows the weight (or significance) of interactions. The most significantly interacting genes are signed with red.

There are a few nodes in a square shape. These are genes found by MIDESP and AntEpiSeeker as well.

17 of 19

17

These are the most presented metabolism pathways.

The most significant metabolic processes for the trait expression appeared to be very diverse.

MIDESP: Commercial trait results

18 of 19

18

BHIT tool is based on Markov chain Monte-Carlo search (Metropolis-Hastings algorithm). This tool supports analysis of datasets with continuous phenotypic traits.

We went through a pipeline described in the paper:

  1. Imputation of data was performed with Beagle. Data had been already filtered by MAF. Commercial dataset phenotype did not follow normal distribution.
  2. Combining of genotype and phenotype (012 genotype format) performed with PLINK and Python3 library Pandas.
  3. Filtering SNPs by feature selection methods namely LASSO regression

Performed with function from sklearn Python3 package.

  1. Running BHIT tool on selected SNPs.

After applying of LASSO regression only 14 SNPs were left in case with commercial data and there were none in case with open data. Final BHIT analysis of commercial dataset SNPs lead to 2 pairs of SNPs.

All in all, even though the algorithm under this tool seems to be promising, filtering by LASSO regression leaves so few SNPs that the output can’t be analyzed properly. Also the obligation for phenotype to be normally distributed can’t be fulfilled every time.

BHIT: instrument based on MCMC

19 of 19

19

  1. During the course of this work, we analysed various tools for grouping SNPs. We found MIDESP to be the most promising tool for analysis, being able to quickly analyse large datasets with continuous phenotypic traits.
  2. AntEpiSeeker tool performed well overall with only downside being the lack of support for continuous phenotypes. We are considering a possibility of creating our own algorithm based on this one to analyze quantitative traits.
  3. The results allowed us to construct interaction graphs of soybean genes.

We are planning to create our own handy and easy-to-use pipeline or tool for analyzing the data, implementing other omics resources and visualizing results as a system of graphs.

For binary phenotypic traits, many tools have been developed to help with this kind of analysis. But it is much more difficult with continuous traits, which are particularly present in plants (protein content, degree of resistance to pathogens, yield, etc.). Few tools can perform this kind of analysis for continuous traits.

Plans

Conclusions