Prediction of Novel Autism Risk Genes By
Genomic Data Mining
Abbegail King*, Anvita Pudipeddi*, Landon Ethredge*, Anqi Wei, Snehal Shah, and Liangjiang Wang
Department of Genetics and Biochemistry, Clemson University
Introduction 1-6
Methods
Results
Autism Spectrum Disorder (ASD)
LncRNAs
Machine Learning
References
Acknowledgements
We would like to thank Clemson University and Clemson University Honors College for their support of the EUREKA! program.
1Gudenas et al. J Zhejiang Univ-Sci B 2019. 2Lord et al. Primer 2020. 3Cogill et al. Bioinformatics 2016. 4Wang et al. Bioinformatics 2020. 5Wainburg et al. Nature Biotechnology 2018. 6Yang et al. Molecular Psychiatry 2020.
*Co-equal Contributors
Receiver Operating Characteristic (ROC) Curve
The Venn Diagram on the left depicts the number of candidate lncRNAs that each of the three models predicted to be ASD-associated. ANN, SVM, and RF commonly predicted 624 lncRNAs, and these are known as high confidence candidate lncRNAs.
Overlap of Candidate lncRNA and Known ASD Risk Gene
Abstract
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder which affects social development and general behavior. ASD varies in its symptoms and severity but is often characterized by rigid, repetitive behavior patterns, sensory issues, intense interests, and communication delays. While the majority of currently discovered ASD genes are protein-coding, great promise exists in finding novel autism risk genes among non-coding RNA genes since they make up the bulk of the human genome. Specifically, this study focused on long non-coding RNAs (lncRNAs), which are classified as transcripts longer than 200 nucleotides. This study aimed to identify potential ASD risk genes by means of machine learning, a subset of artificial intelligence. We have trained three machine learning models, Random Forest, Support Vector Machine, and Artificial Neural Network, to mine for relevant features within genomic data of previously discovered ASD risk genes to predict novel risk genes using those features. The Artificial Neural Network Model showed the highest performance of the three. 624 lncRNAs were identified as high confidence candidate lncRNAs by all three models and 1,174 were identified in total. Upon further analysis of the results, many candidate lncRNAs were located in close proximity to known ASD risk genes on their respective chromosomes.