SCS2016 poster numbers
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
KU Leuven
Evolutionary conservation of Ebola virus proteins predicts important functions at residue level
Ahmed Arslan
Ebola Virus (EBOV), a member of Filoviridae family, causes severe hemorrhagic fever known as Ebola Virus Disease (EVD) with a mortality rate of up to 90%. Due to the recent devastation and lack of medication, EBOV has attracted renewed interest as model for virus evolution. Recent literature on EBOV has improved our understanding of the underlying genetics and its scope with reference to the 2014 outbreak. But no study yet, has focused on the conservation patterns of EBOV proteins that can contribute to the virus fitness and onset of EVD. The aim was to identify evolutionary conserved parts of proteins since the first outbreak, in order to gain more insight into the molecular biology of EVD. Secondly, to map functional information to those conserved residues with computational biology tools; i) we created a collection of large number of proteins sequences from based on EBOV genomes sequenced during recent and previous outbreaks and correlated proteins conservation to functions ii) collect known and predict novel post-translational modifications on EBOV proteins ii) collect PPI between virus and host proteins iii) map conserved residues onto three-dimensional structures and proteins complexes to identify modified residues present at interaction interfaces and iv) find motifs that may mediate protein-protein interactions. We identified the most conserved residues in EBOV proteins and protein complexes and explored their functional attributes by predicting (a) 78 post-translationally modified sites (b) presence of eight conserved PTMs in protein-protein interactions and (c) two linear motifs. Phosphorylation is the most frequent PTM-type in our analysis and we predicted three potential kinases responsible for these modification events in EBOV. The presence of ATM kinase motifs in all EBOV proteins is the most important finding in our analyses, the ATM is a DNA damage response element and the presence of its recognition motifs in EBOV proteins suggests an intersection of the closely associated pathways in EVD like PI3K/Akt and MAPK signalling pathways with DDR pathways. Based on our results and current understanding of EVD dependent activated pathways, an association is anticipated of ATM kinase with these pathways and kinases through which Ebola pathology is achieved.
University of Cambridge
Great Britain
An interactive three-dimensional eukaryotic model - A comparative tool for the evolution of the core components of the exon junction complex and intron density in eukaryotes.
Bridget Bannerman Richard Dorrell, Mark Carrington
Background: The exon junction complex (EJC) has a central role in marking splice sites in eukaryotic mRNA transcripts. Previous comparative genomics studies of the four core components of the EJC complex, Magoh, Y14, eIF4AIII and MNL51, have been performed mainly within the Opisthokonta and Archaeplastida eukaryotic super-groups. Many eukaryotic pathogens, such as African trypanosomes and Plasmodium (the causative agents of Human Trypanososomiasis and malaria respectively), fall outside the super-groups containing animals, fungi and plants; trypanosomes are in the Excavata super-group and Plasmodium falls in the SAR group. Both are single celled parasites whose unique ability to survive in their hosts can be attributed to their versatile gene expression strategies. Identifying unique differences in mRNA metabolism pathways between both trypanosomatid and apicomplexan parasites and their free-living relatives would provide insight into the evolution of parasites in eukaryotes. Description: I have performed comparative genomic analyses and phylogenetic studies of the core components of the exon junction complex (EJC) with homologues from organisms of the eukaryotic super-groups: Archaeplastida, Opisthokonta, Amoebozoa, SAR, Excavata and CCTH. Several trypanosomes and related free-living organisms in the Excavata eukaryotic super-group as well as parasitic apicomplexans and their free-living relatives from the SAR super-group were included. I have demonstrated that a core protein of the exon junction complex, eIF4AIII, is conserved in all eukaryotes and was present in the last eukaryotic common ancestor (LECA). Magoh and Y14 were present in the LECA, but were selectively lost in intron-poor species. Y14 has undergone a founder effect within the trypanosome lineage. Conclusions: I have designed an interactive 3-D model/eukaryotic map illustrating 1. The distribution of the EJC amongst the six eukaryotic super-groups. 2. The correlation of the core components of the EJC to intron density amongst both parasites and non-parasites. The interactive feature of the model can be used to illustrate the variation of intron densities in comparison to the presence of the different EJC core units amongst different eukaryotic species as well as an instantaneous display of the position of any newly sequenced genome of both parasitic and non-parasitic eukaryotes on the intron density scale. '
Texas Tech University
United States of America
Computational Transposable element annotation using de novo base repeat identification.
Laura Blanco-Berdugo
Transposable elements (TEs) are genetic elements that have the ability to replicate and relocate themselves around the host genome. The number of reference genomes has increased at a faster rate than the effort to annotate transposable elements from non-model species, methods of identification of these elements vary significantly from project to project, increasing the variation in TE annotation when less than optimal methods are used. It was found that across a variety of taxa, it becomes more difficult to identify TEs based only on homology as the phylogenetic distance between the queried genome and the reference genome increased. We annotated the repeats using both homology alone, as it is usually done with new genome analyses, and a combination of homology and the de novo methods as well as an additional manual curation step. The used of this methods showed a substantial number of new TE subfamilies in genomes that were previously characterized, recognized a higher proportion of the genome as repetitive, and decreased the average genetic distance within TE families, implying recent TE accumulation. Lastly, the findings were confirmed via analysis of the postman butterfly (Heliconius Melpomene). These observations imply that complete TE annotation relies on a combination of homology and de novo base repeat identification, manual curation and classification and that relying on simple, homology-based methods is insufficient to accurately describe the TE landscape of newly sequenced genome.
Universidad Pablo de Olavide
cICB: a modular high-throughput computational pipeline for the annotation of proteins of unknown function by Integrative Cell Biology
Nicola Bordin Juan Carlos Gonzalez Sanchez, Damien Devos
During last decade, the gap between sequence determination and functional annotation has increased dramatically, resulting in an incomplete understanding of the data we have generated. Automatic annotation pipelines ease the burden of manual annotation, but are limited in scope and coverage. Computational tools for proteome annotation are intrinsically conservative in assigning a definitive function (76% of proteins in UniProt/TrEMBL are annotated as “unknown” or “uncharacterized”) and tend to focus on specific aspects of the protein such as functional domains, signal peptide prediction, or the presence of transmembrane helices. Compartmentalizing the annotation gives a very specific characterization of one aspect of the protein, at the cost of losing the general overview of the protein's function and role in its environment. Integrating results from several databases and tools allows us to simultaneously question several related aspects of a protein's function. We have created a computational pipeline that combines the advantages of manual curation with the speed and power of bioinformatics. The pipeline allows the characterization of whole proteomes as well as single proteins. We applied the Integrative Cell Biology (ICB) pipeline to 40 bacterial proteomes belonging to the PVC superphylum and were able to increase the average number of annotated proteins from 54% to 78%. The pipeline is modular, open, and can be installed at your location or run on our server. We illustrate the advantages of ICB with detailed results. The system and results will be available at
Univesity of South Florida
United States of America
Quantifying Conformational Ensemble Changes in Proteins Using Inverse Machine Learning
Mohsen Botlani Mohsen Botlani, Ahnaf Siddiqui, Sameer Varma
Background: Protein activities are regulated tightly in biological environments. An understanding of their regulatory mechanisms entails assessment of their various states, including active and inactive states. For many proteins, their states can be distinguished based on their minimum-energy conformations since, the magnitudes of thermal fluctuations, or dynamics, are negligible compared to the differences in minimum-energy structures. This approximation, however, breaks down for several other proteins. The states of these proteins can only be distinguished categorically from each other when their finite-temperature conformational ensembles are considered alongside their minimum-energy structures. The list of such proteins has grown rapidly in the last decade, which now includes GPCRs,PDZ domains, nuclear transcription factors, heat shock proteins, T-cell receptors and viral attachment proteins. Applicability of molecular simulations toward understanding mechanisms in this latter category of proteins requires development of new methods that can deal with high-dimensional conformational ensemble data. Description: The traditional approach to compare protein conformational ensembles is to compare their respective summary statistics. However, if a subset of the summary statistics from the two ensembles is found to be identical, it does not imply that the remaining summary statistics will also be identical. The general problem of finding and choosing a feature that appropriately distinguishes ensembles can be overcome by comparing ensembles directly against each other and prior to any dimensionality reduction. We have developed a method to accomplish just that – it performs excellently for both Gaussian and non-Gaussian distributions. The difference between ensembles is computed by solving the inverse machine learning problem and in terms of a metric that satisfies the conditions set forth by the zeroth law of thermodynamics. Conclusions: Such a quantification permits statistical analyses and quantitative data mining necessary for establishing causality in protein functional regulation. We have applied this method to (a) quantitatively understand the effect of ligand binding on the structure and dynamics of a viral protein whose function is controlled by dynamic allostery; (b) understand the role of water in the inception of allosteric signals; (c) determine intersecting signaling pathways. This method is available under standard GNU license on SimTk (
Jennie L.Catlett
University of Nebraska-Lincoln
United States of America
Modeling and Verification of a Syntrophic Relationship Between Human Gut Microbes
Jennie L. Catlett Jennie L. Catlett, Mikaela Cashman, Megan D. Smith, Mary Walter1, Jonathan Catazaro, Zahmeeth Sakkaff, Robert Powers, Myra B. Cohen, Massimiliano Pierobon, Christine Kelley, Nicole R. Buan
Background: The human gut microbiota is a diverse community of bacteria, archaea, and eukaryotic cells. Close interactions within this community are hypothesized to form complex and interdependent metabolisms (syntrophy) that are integral in maintaining human digestive and overall health. To study these interactions, we combine metabolomics, genetics and microbial growth studies with software engineering techniques and metabolic modeling to explore the limits of a proposed syntrophic relationship between two members of the gut microbiota: Gram-negative bacterium Bacteroides thetaiotaomicron and methane-producing archaeon Methanobrevibacter smithii. Within the anaerobic conditions of the small intestines, B. thetaiotaomicron partially oxidizes dietary polysaccharides through fermentation to produce small-chain carboxylic acids, carbon dioxide and hydrogen gas. M. smithii utilizes the fermentation products formate and acetate with hydrogen gas to reduce the carbon dioxide to methane. Description: To characterize this potential relationship, we use classification trees from machine learning in combination with variable coverage analysis from software engineering to predict a syntrophic relationship through analysis of in vitro high-throughput growth studies and in silico flux balance analysis of uncurated models. We verify in vitro that B. thetaiotaomicron fermentation products are sufficient for M. smithii growth, and a build-up of fermentation products alters B. thetaiotaomicron production of small-chain carboxylic acids and metabolites to benefit the growth of M. smithii. Growth studies further suggest that B. thetaiotaomicron grows better in the absence of fermentation products or in the presence of M. smithii. Conclusions: A combination of in vitro and in silico techniques indicate a mutually beneficial syntrophic relationship between B. thetaiotaomicron and M. smithii. By treating the metabolisms of living organisms as highly-configurable software systems, we were able to guide and assist laboratory experimentation. These results will be used to refine metabolic models and analyses to increase the accuracy of future predictions without the need for full-factorial experimentation or manual curation of metabolic models.
Institut Curie
Urszula Czerwinska Czerwinska Urszula, Barillot Emmanuel, Soumelis Vassili, Zinovyev Andrei
Background : In many fields of science observations on a studied system represent complex mixtures of signals of various origin. Tumors are engulfed in a complex microenvironment (TME) that critically impacts progression and response to therapy. It includes tumor cells, fibroblasts, and a diversity of immune cells. It is known that under some assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction. Our recent large scale analysis of more than 6500 tumor transcriptomes, applying classical blind source separation methods showed that we can reliably separate signals coming from tumor microenvironment from the tumors specific signals and various technical artefacts. Description : In this work, we apply independent components analysis (ICA) to decipher sources of signals shaping transcriptomes (global quantitative profiling of mRNA molecules) of tumor samples, with a particular focus on immune system-related signals. We use ICA iteratively decomposing signals into sub-signals that can be interpreted using pre-existing immune signatures through correlation or enrichment analysis. Our analysis revealed a possibility to identify signals related to groups of immune cell types with unsupervised learning approach in a Breast Cancer dataset. Through Fisher exact test we identified significative groups corresponding to three out of five sub-signals: (1) T-cells, (2) DC/Macrophages, (3) Monocytes/ Macrophages/ Eosynophiles/Neutrophiles. T-cells metagene correlates well with the tumor grade (Kruskall-Wallis test p-value=0.003). Conclusions Our work describes the most important underlying factors of tumor microenvironment transcriptome with focus on immune infiltration in breast cancer. Ongoing analysis aims to evaluate the robustness of the represented groups and eventual differences between several types of cancer. We are to characterize the immune infiltration degree in the cancer transcriptome dataset and further correlate with patients’ survival and tumor characteristics. In the case of success, the results will be used in the diagnosis and cancer therapy, especially immunotherapies.
United States of America
Differential Expression Analysis for Highly Related Samples
Natalie Davidson Kjong-van Lehmann, Gunnar Rätsch
Background: Large-scale efforts to measure genomic and transcriptomic patterns across several cancer types have helped to identify the genetic diversity of cancer. The degree of intra-cancer variability greatly complicates the analyses of differences between cancer types. Identifying differentially-expressed genes between cancer types or normal samples is especially confounded since expression variability within a single cancer type is large. When the variance within a single condition is large, or the samples are highly correlated, the typical fixed-effects model can lead to an increase in genes falsely identified as differentially-expressed. Description: Our method identifies differentially-expressed genes between cancer types with high expression variability by using a mixed-effects model that incorporates relatedness between samples to account for variance within cancer types. Relatedness is calculated by the correlation of somatic and germline variants. We validated on simulated and real data that exhibited high relatedness between samples, comparing our model against a baseline fixed-effects model. Our simulation shows, that in cases where samples are highly correlated and there are greater than 20 samples within a sample group, the mixed-effects model achieves a false positive rate of 0.012 and a false negative rate of 0.19. In comparison, the fixed-effects model has a false positive rate of 0.02 and a false negative rate of 0.63. This clearly shows that a mixed-model approach is able to account for structured variability within cancer types and identifies less false positives and significantly less false negatives. We applied this approach on TCGA samples of uterine carcinosarcoma and uterine corpus endometrial carcinoma. Our mixed-effects model identifies 5505 genes that are significantly different between the cancers, while also accounting for cancer relatedness. A GO analysis revealed they are enriched for processes related to cell adhesion, a known difference between epithelial cancers and carcinosarcomas. Conversely, the top differentially-expressed genes from the fixed-effects model were not enriched for any cell adhesion related processes. Conclusion: We validated our model on simulated data and successfully applied it to real data to identify differentially expressed genes between cancer types. We recommend accounting for sample relatedness in differential-expression analysis, especially in the context of cancer.
University of Arizona
United States of America
Adaptive local realignment via parameter advising
Dan DeBlasio John Kececioglu
Mutation rates can vary across the residues of a protein, but when multiple sequence alignments are computed for protein sequences, typically the same choice of values for the substitution score and gap penalty parameters is used across the entire protein. We provide for the first time a new method called adaptive local realignment, which computes protein multiple sequence alignments that automatically use diverse alignment parameter settings in different regions of the input sequences. This allows the aligner’s parameter settings to locally adapt across a protein to more closely follow varying mutation rates. <br><br> Our method builds on the Facet alignment accuracy estimator, and our prior work on global alignment parameter advising. In a computed alignment, for each region that has low estimated accuracy, a collection of candidate realignments is generated using a set of alternate parameter choices. If one of these alternate realignments has higher estimated accuracy than the original subalignment, it is replaced. <br><br> Adaptive local realignment significantly improves the quality of alignments over using the single best default parameter choice. In particular, local realignment, when combined with existing methods for global parameter advising, boosts alignment accuracy by almost 24% over the best default parameter setting on the hardest-to-align benchmarks. <br><br> A new version of the Opal multiple sequence aligner that incorporates adaptive local realignment, using Facet for parameter advising, is available free for non-commercial use at This site also contains the benchmarks from our experiments, and optimal sets of parameter choices.
Sam Higginbottom Institute of Agriculture, Technology and Sciences
Structure-based virtual screening studies for the inhibition of polyamine biosynthesis by targeting ornithine decarboxylase of Serratia marcescens strain WW4
Kalyani Dhusia Pramod K. Yadav, Rohit Farmer and Pramod W. Ramteke
Ornithine decarboxylase (ODC) enzyme, catalyzes the decarboxylation of ornithine to form spermidine which is a committed step in the biosynthesis of Polyamines. Polyamines are essential for growth, cell proliferation and differentiation, but are toxic when produced in excess. Ornithine is an immediate precursor, for the production of polyamines via Polyamine biosynthesis mechanism. Polyamines being produced by Ornithine being the immediate precursor and ODC playing the central role in this biosynthesis pathway is key target for polyamine biosynthesis inhibition study. Present work emphasizes on the inhibition of polyamine production in Serratia marcescens which is a plant growth promoting rhizobacterium (pgpr) that helps in biological control against P. nicotianae, an important soil-borne phytopathogenic fungus and also promote increased growth in shoot length, shoot dry weight, root length, and root dry weight. Here, in this study, structure of ODC protein was generated using MODELLER 9v8 software. After modeling, protein structure is validated and subjected to molecular docking studies. Docking results were analyzed for top ranking compounds using a consensus scoring function of X-Score to calculate the binding affinity and Ligplot was used to measure protein–ligand interactions. 142 Natural products of Indofine Herbal Ingredient from Zinc Database were screened using Autodock Vina for the identification of leading herbal inhibitors. The results obtained from docking showed that Conessine, Sumaresinolic acid, DNC, Exolone, Naringenin, Hesperidin and Baicailin were the top most inhibiting candidates with Docking Affinity -9.7(Kcal/mol), -9.2 (Kcal/mol), -9.0 (Kcal/mol), -8.9 (Kcal/mol), -8.8(Kcal/mol), -8.8(Kcal/mol) and -8.2(Kcal/mol) respectively. Ligplot showed hydrogen bondings (Gln680, Ala650 and Gln649) and hydrophobic interacting residue (Val647) with conessine as herbal ligand in the binding site of ODC protein. These herbal inhibitors can turn out to be significantly crucial in controlling the toxicity caused by excess production of polyamines by PGPRs. According to literature, no similar approach has been reported yet in arena of inhibition for polyamine biosynthesis. Key words: Ornithine decarboxylase, Herbal Inhibitor, Virtual Screening, AutoDock Vina, Docking.
DLab, Fundacion Ciencia y Vida
Modeling multiscale complex biological systems using PISKa
Ignacio Fuenzalida Ignacio Fuenzalida, Alberto J.M. Martin, Alejandro Bernardin, Tomas Perez-Acle
Background The Stochastic Simulation Algorithm (SSA) is a method to model the dynamical behavior of complex systems. Despite its increasing relevance, most implementations cannot deal properly with the combinatorial diversity and spatial heterogeneity of biological systems. This work describes the development of PISKa, a multiscale modeling tool to perform agent-based simulation of spatially explicit biological systems. Description We have expanded the Kappa Language to include explicit compartment and diffusion definitions. To execute simulations, we have defined mathematically a new approximate algorithm based on SSA, which can run over distributed machines. Three models have been used to demonstrate properties of PISKa: a well-studied signaling network of the circadian clock, a model to study the influence of information on the spread of Ebola fever, and a predator-prey system where two protozoans survive from extinction by changing the spatial distribution of its habitat. By comparing simulation averages with the sequential algorithm we have tested accuracy and performance of PISKa. We showed how the speed-up and the correlation with sequential simulations behave for different parameter configurations, including synchronization step and number of computing nodes and cores. The speed-up was measured based on events and biological-time, while for the accuracy we used several correlation methods. Conclusions PISKa allows massive simulations composed up to millions of agents distributed in different compartments that can be controlled by up to thousands of reactions or rules. By using the Kappa language, PISKa model files are simpler and easier to write and understand than those employed by other implementations. The divide and conquer approach used to solve the parallel compartments allows higher speed-up because rules or reactions will also be split between compartments. We show how speed-up and correlation with sequential simulations behave when we use different parameter configurations, including synchronization step and number of computing nodes and cores. Considering our results, we propose PISKa as a fast and versatile simulation tool to study the dynamic behavior of complex systems.
Pacific Northwest National Lab (PNNL)
United States of America
First-Principles Modeling of Metabolism using Statistical Thermodynamics and Maximum Entropy
Garrett Goh Jeremy Zucker, Douglas Baxter, William Cannon
Historically, modeling metabolism has fallen under two major approaches. Ideally, a well-parameterized kinetic model would provide detailed insight into metabolic processes. However, owing to the challenge of obtaining large-scale kinetic measurements, this approach is not tractable for larger systems. In contrast, constrained-based method, such as flux-balance analysis (FBA) uses the law of mass action to predict possible solutions of metabolic fluxes. However, the underdetermined nature of FBA requires the use of empirically-determined objective functions to “select” an appropriate prediction. Here, we present the theoretical framework of using statistical thermodynamics to model metabolism. This approach combines the simplicity of constrained-based methods, where no kinetic data is required, with the ability the model metabolism dynamically like in traditional kinetic models. Metabolism is modeled as a series of states, and using only standard free energy of reactions as input parameters, a stochastic simulation that propagates based on the principles of thermodynamics and maximum entropy is achieved. Therefore, no empirically-determined objective function is needed to select for the optimal solution. Metabolic pathways, such as glycolysis, was simulated, and the predicted metabolite concentrations agree with experimental measurements, to within 0.5 log concentration units, with a correlation coefficient of over 0.9. Future directions of scaling up to encompass central metabolism and genome scale models are also discussed.
University of Oklahoma Health Science Center
United States of America
Sex-mutual and sexually-dimorphic alterations in hippocampal DNA methylation with aging
Niran Hadad Dustin R. Masser, Nicholas W. Clark, David R. Stanford, Willard M. Freeman
Introduction: Aging is associated with a plethora of diseases such as cancer, cardiovascular disorders and neurodegeneration. Aberrant DNA methylation is a key process in many age-related diseases; however there is limited understanding of how DNA methylation changes with the aging process. In this study we investigated age-related changes in DNA methylation in the hippocampus in a base, strand and sex specific manner. Methods: To determine reproducible, age-related alterations to the hippocampal methylome in both males and females two independent sets of male and female mice (Young – 3M and Aged 24M) were collected. One set of animals was subjected to whole genome bisulfite sequencing while the other was subjected to bisulfite oligonucleotide capture sequencing for 110Mb of gene promoters, CG islands, and other gene regulatory regions. Results: There was no evidence of genome-wide hyper- or hypo- methylation with age in CpG and non-CpG contexts. Differentially methylated cytosines (DMCs) and differentially methylated regions (DMRs) with age common to both sexes were evident throughout the genome. Sex-specific, age-related DMCs and DMRs were also observed. Changes in DNA methylation were dispersed across the genome with no preference to a particular genomic element. Additionally, DNA methylation variance increases with age in a sex-specific manner. Conclusions: The longstanding hypothesis of genomic hypomethylation with aging is not supported by base-specific quantitation of DNA methylation. Instead, DNA methylation at specific genomic loci is regulated according to age and sex in both CpG and non-CpG context. Males, but not females, show an increase in methylation variance suggesting a loss of epigenetic regulation in the aged male methylome. The significant differences between males and females may underlie known sex-differences in the aging process.
Rostock unievrsity
Constructing and analyzing disease-specific transcription factor and miRNA co-regulatory networks
Mohamed Hamed Christian Spaniol, Maryam Nazarie, Volkhard Helms
TFmiR is a freely available web server for deep and integrative analysis of combinatorial regulatory interactions between transcription factors, microRNAs and target genes that are involved in disease pathogenesis. Since the inner workings of cells rely on the correct functioning of an enormously complex system of activating and repressing interactions that can be perturbed in many ways, TFmiR helps to better elucidate cellular mechanisms at the molecular level from a network perspective. The provided topological and functional analyses promote TFmiR as a reliable systems biology tool for researchers across the life science communities. TFmiR web server is accessible through the following URL:
Washington State University
United States of America
DEScan: A novel strategy for the analysis of epigenomic data with multiple biological replicates
John Koberstein John Koberstein, Shane Poplawski, Charlly Kao, Hakon Hakonarson, Robert Schultz, Nancy Zhang, Ted Abel, Lucia Peixoto
A common goal in epigenomic sequencing studies is to identify differences between conditions, i.e differential enrichment. Strategies to do so fall in two general categories: peak and window based. While window based strategies risk testing too many regions in which there is no signal, peak based strategies can introduce biases if peak calling is not done properly. An additional challenge for differential enrichment is encountered when the location of the epigenetic signal varies between replicates, as is the case for histone modification and chromatin accessibility data. Here we introduce DEScan, an R based integrated peak and differential caller, specifically designed for broad epigenomic signals. DEScan first calls peaks on individual replicates using an adaptive window scan and the surrounding 10kb as background. It then integrates peak calls among replicates by requiring a user-defined number of carriers (2 minimum) among replicates. The resulting reproducible peaks are tested for differential enrichment using RNA-seq based strategies. We use DEScan to analyze chromatin accessibility sequencing data following contextual fear conditioning (FC) in the mouse hippocampus. We show that FC increases activation of 2,101 regulatory regions. These regions are disproportionally associated with known ASD genes (p-value<0.004) and enriched in CHD8 binding sites. Using genotyping we show that one of those regions, promoter 6 within the Shank3 gene, contains a genetic variant significantly associated with ASD (SNP rs6010065, 422 ASD cases, 182 controls, p-value=0.03). Our results suggest that DEScan can identify relevant regulatory regions for genetic association studies in clinical populations.
University of Antwerp
Functional subgraph enrichments for node sets in interaction networks
Pieter Meysman Yvan Saeys, Ehsan Sabaghian, Wout Bittremieux, Yves van de Peer, Bart Goethals, Kris Laukens
Frequent subgraph mining (FSM) is a common but complex problem within the data mining field that has gained in importance as more graph data has become available. However traditional FSM finds all frequent subgraphs within the graph dataset, while often a more interesting query is to find the subgraphs that are most associated with a specific set of nodes. Nodes of interest might be those that are associated with a specific disease, or those that are differentially expressed in an omics experiment. We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given set of nodes. The association between a subgraph pattern and a set of nodes is defined by its significant enrichment based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it on three distinct problems, namely identification of gene ontology network motifs associated with duplicated genes in yeast, network motifs enriched for PhoR transcription factor orthologs across seven transcriptional networks, and amino acid labeled subgraphs associated with manganese-binding residues in protein structure networks. These applications could all be tackled with the same exact algorithm despite their diversity and the results show that in each case we can find relevant functional subgraphs enriched for the selected nodes.
Universidad de Buenos Aires
From Sequence to 3D-model: an efficient use of Homology Modeling, Molecular Dynamics and Ligand Docking techniques to predict Protein-Carbohydrate complexes.
Carlos Modenutti
Proteins that bind carbohydrates are responsible for numerous important biological functions, such as signal transduction, cell adhesion, among many others. Despite the number of reported structures of protein-carbohydrates complexes (PCCs) is constantly increasing, achieving accurate predictions of the protein-carbohydrate interaction by means of Structural Homology Models (SHM) and ligand-docking remains one of the biggest challenges in Glycobiology. This is due mainly because the residues that form the Carbohydrate Binding Site (CBS) can differ from its ideal binding rotamer in the SHM, which can thereafter significantly affect Docking algorithms performance. In addition, while generally available docking programs work reasonably well for most drug-like compounds, carbohydrates and carbohydrate-like molecules are often problematic, because Force-Fields and Scoring Functions are typically designed to reproduce structures of protein-drug complexes. In this work, we present an integrated approach that combines conformational-space sampling of SHM using Molecular Dynamics simulations (MD) and Docking experiments. In order to obtain the most plausible binding structure of receptor and ligand, clustering analysis to identify different conformations was applied. Finally, Water-Site Bias Docking Method (WSBDM, an Autodock based protocol) was performed to generate a diversity of structures and energy-population parameters are used to rank each one of them. Here, we used human Pulmonary Surfactant-associated protein D (hSP-D) as a case study. The results show that this emerges as a promising tool to build reliable 3D-models, which can then be used for rational design or optimization of glycomimetic drugs.
University of Liverpool
Great Britain
Elaborating deconvolution of immune cell sub-populations from PBMC transcriptomic data of a Singaporean cohort
Gianni Monaco
The cell composition of the hematopoietic tissue is highly heterogeneous and the characterization of phenotype and functionality of all the immune cell types a has high impact on the treatment of diseases and increasing of life expectancy. Deconvolution is a promising approach to define gene expression levels and cell proportions of specific subsets from transcriptomic data of a heterogeneous sample. Several deconvolution algorithms have already been proposed, nevertheless, there is still no consensus on the optimal methodology as well as on which cell type is more suitable for this approach. Here, transcriptomic and flow cytometry analysis was performed on PMBC extracted from fresh blood samples of a cohort of young and old Singaporean individuals. We used basic linear regression and support vector machine for the deconvolution of transcriptomic data and we validated the computed percentage of cell sub-populations by flow cytometry. The prediction of the proportions of NK cells and monocytes was satisfying for both methods, while the results for B cells and T cells where not agreeable. In conclusion, we believe that deconvolving blood expression data is a promising approach even though it is still in its early stages. Further investigation is necessary in order for immunological research to greatly profit from this methodology.
Novartis Institutes for BioMedical Research
United States of America
Proteochemometric Machine Learning Models for Predictive Drug Discovery, Target Identification, and Polypharmacology Deconvolution
Libere Ndacayisaba Yuan Wang, Jeremy Jenkins
Systematic investigation of compound-protein interactions is crucial to the rational design of therapeutic drugs and understanding of compound selectivity and polypharmacology. Quantitative structure-activity relationship (QSAR) methods have been developed to model and predict drug targets based on ligand similarity. However, conventional QSAR methods may be unaware of lurking compound promiscuity that results in poor drug selectivity. In addition, machine learning models based on chemical descriptors alone are widely applied for virtual screening and target prediction, but they are limited to liganded target space, and thus cannot extrapolate to unknown targets. The proliferation of bioactivity databases and advanced statistical methods and the need for drug target deconvolution and reduction of promiscuity has led to the development of proteochemometrics (PCM), a computational technique that models bioactivity by correlating descriptors of both protein and chemical spaces, concurrently. In the present study, we use PCM to develop, optimize, and validate 3 binary classifiers: Naïve Bayes, Random Forest, and a fusion model thereof. Modeling was based on a dataset of 5 million compound-target pairs comprised of 2.5 million low molecular weight compounds and 3.5 thousand proteins across 13 different target classes. To our knowledge, this is the largest data set in PCM modeling, thus providing optimal coverage of compound and target space. Preliminary findings indicate that Random Forests outperform Naïve Bayes models and suggest that model performance increases with training and test set size. We further investigate target class-specific modeling to illustrate the models’ domain of applicability and capacity for extrapolation to other ligand-protein interactions. PCM machine learning models with strong predictive power enhances understanding of ligand-protein interactions and thus enables hit finding for novel proteins, target identification for novel compounds, and deconvolution of targets and pathways in phenotypic drug discovery.
S Mohammad H
The University of Melbourne
The Impact of Sequence Ambiguity on Read Mapping Accuracy
S Mohammad H Oloomi Thomas C Conway, Justin Zobel
Background: Resolution of ambiguity when a read can be aligned to more than one location in a reference sequence is a key problem in the processing of short read data. Despite the importance of ambiguity resolution in read mapping, its impact has not had extensive investigation. In this research, we examine sources of ambiguity together with the shortcomings of existing read mapping methods with regard to ambiguous reads. Description: The factors that cause ambiguity in read mapping can be categorized into five major groups: repetitive regions in the genome, genetic variability, sequencing errors, algorithmic limitations, and multiple-organism samples. Existing read mappers either ignore ambiguous reads, report the best match, or report all mapping locations. We investigate the impact that ambiguous mappings have on the interpretation of short read data, using several k-mer analyses on two different bacterial genomes: Mycobacterium Tuberculosis (MTB) and Orientia Tsutsugamushi (OT). The difficulties posed by ambiguous mappings are illustrated by limitations of a read mapper (Bowtie2) when applied to a set of synthetic reads. While less than 3% of reads map to multiple locations for the MTB experiment, in the case of OT, about 45% of reads map to more than one location using single-end mapping and this only drops to 35% when the paired-end information is exploited. Conclusions: The accuracy of read mapping can be considerably affected by multi-reads even when using the best existing mapping tools. While some genomes are reasonably well behaved for read mapping, others, such as OT, are pathological. In the case of OT, the paired-end mapping can help to resolve only about a quarter of ambiguous mappings. The results show the importance of ambiguity resolution in read mapping and our approach to analyses can be used to discover the extent ambiguity affects mapping accuracy for a given genome and pave the way for more sophisticated mapping techniques.'
R. Gonzalo
Protein Physiology Lab, Dep de Qu ́ımica Biol ́ogica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, Buenos Aires, Argentina.
Protein Frustratometer 2: a tool to localize energetic frustration in protein molecules, now with electrostatics
R. Gonzalo Parra Nicholas P. Schafer, Leandro Radusky, Min-Yeh Tsai, A. Brenda Guzovsky, Peter G. Wolynes, and Diego U. Ferreiro
Natural proteins molecules are highly evolved complex systems. Spontaneous folding of individual proteins and recognition between polypeptides leading to well-defined structural ensembles are fundamental concepts in the biology of macromolecules, the specificity of which is explained by the ``Principle of minimal frustration'' . This insight has lead to multiple developments in the understanding of protein folding and function. The minimal frustration principle does not rule out that some energetic frustration may be present in a folded protein. Moreover, it may not be a random occurrence but an evolved characteristic, facilitating motion of the protein around its native basin, binding to appropriate partners and is thought to be fundamental to protein function. We have developed theoretical methods for spatially localizing and quantifying the energetic frustration present in native proteins. These have proven useful in the study of binding interfaces, allosteric transitions, aggregation and ligand binding, conformational dynamics, have been related with evolutionary patterns and disease-related polymorphisms. The new Protein Frustratometer server is based on the associative memory, water mediated, structure and energy model (AWSEM). AWSEM provides a transferable, coarse-grained, non-additive force field that is able to predict the native structures of many proteins and protein complexes from sequence information. Recently, electrostatic forces have been included in the AWSEM suite and have been shown to play a role in modulating the asperities of the folding and binding landscapes. Along with a significant speed-up for the calculations, this new server allows for the possibility of analyzing the local frustration that arises by electrostatic interactions.
Saberi Ansari
Institute for Research in Fundamental Science (IPM)
Islamic Republic of Iran
Significant Random Signatures Have Information
Elnaz Saberi Ansari
Determining cancer signatures is a challenging task in personalized patient care. In 2011, Venet proposed that random signatures unrelated to cancer have a high probability to be associated with breast cancer outcome. The aim of this research is to show that significant random signatures have information. Based on these informative random signatures, a score is assigned to each gene representing its significance. To identify important genes that either they or their interaction network neighborhood have high scores, meta-analysis is applied and functional relationships between genes are considered by exploiting the PPI network from STRING database and applying the diffusion kernel approach proposed by Kondor and Lafferty. Gene scores are assigned to related protein nodes and diffused through the network dependent on the network topology. To determine the significance of diffused scores, random permutation procedure is used. Genes with empirical p-value<0.05 are considered as significant genes. It is shown that these genes are enriched for cancer related genes. Based on the assumption that genes do not act in isolation and cancers are caused by perturbation of various pathways, significant genes are enriched into pathways to identify signatures. Through extensive literature search, it is shown that the enriched pathways are putatively involved in cancer. The enriched genes within these pathways associated with breast cancer outcome, are considered as potential signatures. Similar to Venet et. al., the association with outcome is computed using the 295 patients of the NKI cohort and their overall survival endpoints. It is showed that the suggested signatures are not random. To evaluate the performance and consistency of the suggested signatures, ACES dataset, which is a cohort of 1606 breast cancer samples collected from 12 studies in NCBIs Gene Expression Omnibus is considered. A signature score is defined and assigned to each patient based on the signature in a way that the population of scores for patients in phenotype Normal and Cancer can be differentiated using a suitable statistical test, e.g. t-test. Results show that predicted signatures can significantly separate the poor patients from good patients.
Iowa State University
United States of America
Studying Phenotypic impact of non-synonymous single nucleotide variants in LOC_ Os05g26040 and LOC_ Os05g27960 in Oryza sativa.
sayane shome Rakesh Kumar Meena
The prime objective of the study is finding phenotypic consequences of single nucleotide variants in the coding region of LOC_ Os05g26040 and LOC_ Os05g27960 with Oryza sativa japonica as species of interest. LOC_ Os05g27960 encodes for Endo Beta N-Acetylglucosaminidase, enzyme exhibiting hydrolase activity, whereas Pumilio-family RNA binding protein is predicted as a conserved domain in coding regions of LOC_Os05g26040. Pumilio-family RNA-binding proteins play a role in controlling gene expression at the post-transcriptional level by promoting RNA decay and repressing translation in other organisms including humans, yeasts and plants like Arabidopsis. We implemented SIFT sequence, PANTHER and I-mutant tools to classify the single nucleotide variants lying in the coding regions of both the loci. Further, the impact of detrimental non-synonymous variants was studied on translated protein product of LOC_Os05g26040.The theoretical model of Pumilio-family RNA-binding protein in LOC_Os05g26040 was modelled via homology modelling. Further, single point mutations were induced in modelled protein product and comparative docking studies were carried out in native as well as mutation-induced protein structures. In LOC_ Os05g27960, we determined two nonsense variants in the nucleotide regions coding for hydrolase domain, which suggests their presence leads to truncated protein product, finally leading to dysfunctional hydrolase. The detrimental non-synonymous SNPs (Single nucleotide Polymorphism) located in the Pumilio protein encoded by LOC_Os05g26040,lead to changes in binding energies with RNA, at the binding site. This can be considered as one of the key consequences of the presence of the variants in the RNA-binding domain. Further insights into how the mutations affect were determined based on structural features. Previous studies suggest Pumilio protein participate in cell development and differentiation in other organisms including lower vertebrates and invertebrates. Hence, the variants affecting its RNA binding activity pose a potential to hinder cell development. Studying the impact of these variants by experimental studies will be useful to study to find out if their presence is harmful to the crops. In addition, the computational pipeline devised during the study can be utilised to determine phenotypic consequences of SNPs in organisms,where tools have not been developed for phenotypic characterization of single nucleotide variants.'
The Broad Institute
United States of America
A comprehensive comparison of the Connectivity Map and The Cancer Genome Atlas: predicting patient survival and applications for drug discovery
Benjamin Siranosian Joseph Nasser, Rajiv Narayan, Aravind Subramanian, Todd Golub
The Connectivity Map (CMap) is a comprehensive catalog of transcriptional profiles representing systematic perturbation of cancer cell lines with genetic and pharmacologic reagents. We correlated differential gene expression signatures derived from patient tumors in The Cancer Genome Atlas (TCGA) with CMap profiles to uncover transcriptional programs in each tumor sample. We first confirmed that CMap in-vitro perturbations can accurately mirror tumor biology. As expected, tumor samples with loss of function mutations, copy number loss or aberrantly low transcript levels correlated with CMap signatures of shRNA knock-down of the same gene. Samples with gain of function mutations, copy number amplification or aberrantly high transcript levels are correlated with CMap signatures of genetic overexpression. Next, we mined the dataset for interesting connections between tumor samples and CMap perturbagens. For example, mRNAseq data from lung adenocarcinoma tumors with mutations in KEAP1 are correlated with the CMap signature of KEAP1 knock-down and anti-correlated with the signature of NRF2 (a negative regulator of KEAP1) knock-down. KEAP1 mutant samples are also strongly anti-correlated with the CMap signature of STK11 knock-down, potentially indicating a link between KEAP1 mutations and STK11/AMPK signalling which we are testing experimentally. Finally, we examined if correlations to CMap profiles could predict which pathways had been activated in tumor samples by looking for correlations with patient survival. In lung adenocarcinoma, patients with signatures of HDAC inhibition (correlated to HDAC inhibitor and HDAC gene knock-down signatures) survived significantly longer than the rest of the cohort, while patients with signatures of HDAC activation survived significantly shorter (p < 10^-5). Several HDAC inhibitors are currently in clinical trials for cancer treatment; this result indicates they could be applied to a subset of lung adenocarcinoma patients. We created a series of public web-based applications to allow computational and bench scientists to explore the results of the CMap-TCGA comparison, as well as apply our methods to any transcriptional dataset. These methods and tools can also be used as a first-pass method for drug discovery - we have identified a potential ADAM-protease inhibitor using these tools and are following the hypothesis up experimentally.
Comprehensive analysis of chromatin landscape in filamentous fungus Aspergillus nidulans
Xin Wang Djordje Djordjevic, Zhengqiang Miao, Chirag Parsania, Kaeling Tan, Joshua W. K. Ho, Koon Ho Wong
Chromatin organization, such as the deposition of active or repressive histone modifications, plays an important role in regulating gene expression. Advances in ChIP-seq and associated bioinformatics analytical techniques have enabled genome-wide analysis of histone modifications and transcription regulation dynamics. The chromatin landscapes of several model organisms have been widely studied, but other medically and biotechnologically important species are not yet fully studied. Here we present a genome-wide chromatin landscape of an important filamentous fungal model organism, Aspergillus nidulans. Using ChIP-seq, we generated genome-wide profiles of H3K4me3, H3K4me2, H3K4me1, H3Ac, H3K9ac, H3K27ac, H3K36me3, H3K79me1, H3K79me2, H3K79me3, H3K9me3, H4K20me1, H2A.Z, Pol II and transcription factors in A. nidulans in the presence or absence of nitrogen source NH4. All sequences were mapped to the reference genome, and processed using the standard ChIP-seq analysis pipeline SPP. We used a recently developed hierarchically linked infinite hidden Markov model (hiHMM) to systematically discover and characterize the most prevalent combination of histone modifications, i.e., chromatin state. Our analysis revealed interesting chromatin states that are associated with gene expression or repression. We investigated the association between these chromatin states with chromosomal location, gene ontology, and other genomic features that are important in fungal biology. Furthermore, we found that around 15% of the genome is marked by classical enhancer chromatin marks, suggesting that these previously uncharacterized non-coding regions may have potential regulatory functions. Our work represents a valuable resource in A. nidulans that opens new avenues for investigation of the dynamic chromatin organization and gene regulation in fungi.
KIXBASE: A KIX domain database and web server for prediction and analysis.
Archana Yadav Jitendra K. Thakur, Gitanjali Yadav
In today’s era of ever-increasing biological data, it is essential to maintain databases that not only integrate and cluster the data as a warehouse but as well help analyse it. Such resources are invaluable tools for researchers as they provide a useful platform to retrieve biological information like sequences, structures, structure classes, pathways etc., thus aiding in the research to a great extent. KIXBASE is a global repository as well as a prediction tool for KIX domains. These domains are present in coactivators and play a central role in the transcription process. KIXBASE has two main parts, a web server and a database. To date, there is no other web resource that provides information on KIX domains. The KIX prediction program detects KIX domains in any organism on the basis of profile hidden markov models developed through alignment of known KIX sequences along with additional filtering criteria for improving accuracy. KIXBASE incorporates the most widely used programs like PSIPRED-3.5 and CLUSTALW2 in the webserver for further annotation and quality assessment of the predictions. Users can upload batch entries for protein, genomic or EST sequences in FASTA format and carry out the detection of potential KIX domains, generate the secondary structure for domain of interest and examine conservation with other KIX domains. The backend prediction algorithm is also used for development of the KIX database, which contains 1891 KIX proteins representing 427 organisms spanning metazoans, fungi and plants, comprising the largest online collection of KIX sequences. The prediction algorithm is highly accurate, robust and very efficient in handling multiple input types for KIX prediction.
Bioinformatics Research Group (BioRG), School of Computing and Information Sciences,Florida International University,Miami,FL.
United States of America
Predicting Symptom Severity and Contagiousness of Respiratory Viral Infections
Medhini Narasimhan, Giuseppe Vietri, Arpit Mehta, Farid Rajabli, Vanessa Aguiar- Pulido, Kalai Mathee, Giri Narasimhan
Infections due to respiratory viruses affect millions of people across the world and have a huge economic impact. While the process of immune clearance allows most people to combat these infections, for many others viral exposure causes a variety of symptoms including runny nose, cough, sore throat, nasal congestion, headache, fever, myalgia and general malaise. These symptoms can vary in severity and have different onset and recovery times. To make matters worse, the viruses reproduce and €œshedding€ ensues, whereby the viral progeny are expelled making the host contagious. The goal of this work is to build predictive models for both severity of symptoms and contagiousness, given gene expression time series data recorded over a multi-­€day period starting prior to exposure, and measured at different intervals following exposure. Previous studies have shown that gene expression profiles from blood samples can aid in detecting pre-­‐symptomatic viral infections. Here we build predictive models using machine learning techniques to compute the symptom score for each subject given time series gene expression data. Our data was obtained from multiple NCBI GEO Datasets that consists of expression data from pre-­‐ and post-­‐infection by 4 viruses – RSV, H3N2, H1N1 and Rhinovirus. The dataset also contained clinical information such as symptom scores (using the Jackson Scoring scheme), and information on viral shedding. Feature selection was performed for each time point and for each virus using Partial Least Squares Discriminant Analysis (PLS-­‐DA), which ranked the genes based on their importance in the model. In each case, the most significant genes were filtered by thresholding and were then used to build a Random Forest classifier. Surprisingly, our models resulting from data from prior to exposure performed nearly as well as reported models with data from 29 hours post-­‐infection. Performance rose to 100% for H1N1 and H3N2 using data from later time points. We have identified several viral-­‐specific biomarkers, which appear to play a role in an early host-­‐genomic response.