1
SNP Grouping and evaluation of epistatic interaction for phenotypic traits in agricultural plants
Anastasia Zolotar
Ann Balan
Sergei Volkov��Supervisors:�Elena Grigoreva, NOVA PLANT�Lavrentiy Danilov, NOVA PLANT
2
Background
By analyzing the interaction graph between SNPs and genes, it is possible to identify those that have the greatest number of links and to modify these genes in the first place. And by modifying several significant genes at once, it is possible to change complex traits.
Such data can also help to identify the most promising lines and varieties. If an organism has a large number of genes from such a graph, the phenotypic trait is brighter or most likely to show up. This approach will be of great use to breeders.
The aim and objectives
Aim: Finding approaches and tools for SNP grouping applicable to genomic variant analysis of agricultural plants and analysis of soybean data.
Objectives:
4
Datasets
Dataset properties | Free-access | Commercial |
# SNPs | 23278 | 37676 |
# samples | 248 | 97 |
phenotype | a/a concentrations | some complex trait |
We carried out all the tool tests in our study on two datasets for Glycine max (soybean):
5
Distribution of phenotypic traits
Alanine content
Commercial trait
The tools
In our review, we have found information on more than 22 tools for this task. We identified those that seemed to be the most relevant.
The most convenient (in a sense of analysis potential, amount of received data, and successful launch) were MIDESP, AnEpiSeeker, Martini, SHEsisPlus.
You could see the full table here
7
AntEpiSeeker
Ant colony algorithms (ACO), proposed first by Dorigio and Gambardella, are tools to solve difficult optimization problems. In ACO, artificial ants work as parallel units that communicate through a probability distribution function (PDF), which is updated by weights. As the PDF is updated, "paths" that perform better will be sampled at higher rates by subsequent artificial ants, and in turn deposit more weights.
AntEpiSeeker has been developed to search for epistatic interactions in large-scale association studies. The main algorithm consists of:
8
AntEpiSeeker
The main limitation of the utility:
Unfortunately, only qualitative phenotypes could be analyzed with this utility. To tackle this obstacle, we converted quantitative phenotypes into binary via calculating the median value and assigning “0” for samples with the value less than median and “1” in the opposite case.
The genome in AntEpiSeeker input should contain genotype in “012” format and binary phenotype.
9
AntEpiSeeker: Alanine content results
Two-loci interaction mode: 96 interacting SNP pairs ⟶ 25 gene pairs with probable epistatic interactions;
Three-loci interaction mode: 8 interacting SNP triplets ⟶ 1 gene triplet with previously described genes
GO and KEGG annotation for each SNP in pairs were performed using ShinyGO 0.77 web tool to perform
10
AntEpiSeeker: commercial trait results
High level GO terms (two-loci interaction mode)
Two-loci interaction mode: 198 interacting SNP pairs ⟶ 42 gene pairs with probable epistatic interactions
11
Three-loci interaction mode: 8 interacting SNP triplets ⟶ 2 gene triplet with previously described genes
High level GO terms (three-loci interaction mode)
AntEpiSeeker: commercial trait results
12
MIDESP
The algorithm implemented in MIDESP is based on information theory. The basic idea is to calculate the amount of mutual information (MI) that is shared between SNPs and phenotype.
It also normalises the mutual information values, and corrects them with APC (the average product correction) to reduce background or noise interactions.
Initial filtration secret phenotype data via PLINK software
--vcf INPUT.vcf
--pheno INPUT.tsv
--allow-no-sex
--allow-extra-chr
--double-id
--maf 0.05
--prune
--recode transpose
--chr-set 20
--out TPED and TFAM files
22214 variants and 248 samples pass filters and QC
--allow-no-sex
--indep-pairwise 10000 5 0.99
--make-founders
--allow-extra-chr
--double-id
--chr-set 20
--out Filtered_PruneInfo
--tfile TPED and TFAM files
5662 of 22214 variants removed
--allow-no-sex
--allow-extra-chr
--double-id
--extract Filtered_PruneInfo.prune.in
--make-founders
--recode transpose
--chr-set 20
--out Filtered_Pruned
--tfile TPED and TFAM files
16552 variants and 248 samples pass filters and QC
28847 variants and 97 samples pass filters and QC
4967 of 28847 variants removed
23880 variants and 97 samples pass filters and QC
14
MIDESP: Alanine content results
Using mutual information between SNP pairs and phenotype, we converted them to a gene-gene interaction network.
For this. we annotated each SNP, mapped them on genes and built the weighted graph based on filtering by z-score of mutual information. The most significant interactions are signed in red on the picture.
15
MIDESP: Alanine content results
KEGG enrichment analysis for these genes showed the most presented metabolism pathways.
This information can help to find out the most significant metabolic processes for the trait expression.
16
MIDESP:
Commercial trait results
This graph represents the gene-gene interactions.
Metabolic pathways are shown with nodes colors, and the edge thickness shows the weight (or significance) of interactions. The most significantly interacting genes are signed with red.
There are a few nodes in a square shape. These are genes found by MIDESP and AntEpiSeeker as well.
17
These are the most presented metabolism pathways.
The most significant metabolic processes for the trait expression appeared to be very diverse.
MIDESP: Commercial trait results
18
BHIT tool is based on Markov chain Monte-Carlo search (Metropolis-Hastings algorithm). This tool supports analysis of datasets with continuous phenotypic traits.
We went through a pipeline described in the paper:
Performed with function from sklearn Python3 package.
After applying of LASSO regression only 14 SNPs were left in case with commercial data and there were none in case with open data. Final BHIT analysis of commercial dataset SNPs lead to 2 pairs of SNPs.
All in all, even though the algorithm under this tool seems to be promising, filtering by LASSO regression leaves so few SNPs that the output can’t be analyzed properly. Also the obligation for phenotype to be normally distributed can’t be fulfilled every time.
BHIT: instrument based on MCMC
19
We are planning to create our own handy and easy-to-use pipeline or tool for analyzing the data, implementing other omics resources and visualizing results as a system of graphs.
For binary phenotypic traits, many tools have been developed to help with this kind of analysis. But it is much more difficult with continuous traits, which are particularly present in plants (protein content, degree of resistance to pathogens, yield, etc.). Few tools can perform this kind of analysis for continuous traits.
Plans
Conclusions