1 of 1

Prediction of Novel Autism Risk Genes By

Genomic Data Mining

Abbegail King*, Anvita Pudipeddi*, Landon Ethredge*, Anqi Wei, Snehal Shah, and Liangjiang Wang

Department of Genetics and Biochemistry, Clemson University

Introduction ^1-6

Methods

Results

Autism Spectrum Disorder (ASD)

LncRNAs

Machine Learning

Autism Spectrum Disorder (ASD) is a highly heritable neurodevelopmental condition which affects social development and general behavior.
ASD is often associated with rigid, repetitive behavior patterns, intense interests, sensory issues, and communication delays.
Almost all known ASD risk genes are protein coding genes.

· Long non-coding RNAs (lncRNAs) are transcripts over 200 nucleotides in length which do not code for proteins.
Evidence suggests that lncRNAs are involved in gene regulation.

Machine learning is the study of computer algorithms that automatically improve performance through experience.
Supervised machine learning is a method of machine learning which uses a set of input-output training examples.
Widely used supervised learning algorithms include: Artificial Neural Network (ANN), Support Vector Machine (SVM), and Random Forest (RF).

References

Acknowledgements

We would like to thank Clemson University and Clemson University Honors College for their support of the EUREKA! program.

¹Gudenas et al. J Zhejiang Univ-Sci B 2019. ²Lord et al. Primer 2020. ³Cogill et al. Bioinformatics 2016. ⁴Wang et al. Bioinformatics 2020. ⁵Wainburg et al. Nature Biotechnology 2018. ⁶Yang et al. Molecular Psychiatry 2020.

*Co-equal Contributors

Receiver Operating Characteristic (ROC) Curve

The closer the curve is to the upper left corner of the graph, the higher the model performance is.
The greater the area under the curve (AUC) or the closer it is to 1.0, the better the classifier. Based on our results, of the three models above, the Artificial Neural Network (ANN) showed the highest performance.

Predicted novel ASD risk lncRNA from this project.

Known ASD risk gene (HIVEP2 mutations are associated with intellectual disabilities and behavior abnormalities).

The Venn Diagram on the left depicts the number of candidate lncRNAs that each of the three models predicted to be ASD-associated. ANN, SVM, and RF commonly predicted 624 lncRNAs, and these are known as high confidence candidate lncRNAs.

Overlap of Candidate lncRNA and Known ASD Risk Gene

Abstract

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder which affects social development and general behavior. ASD varies in its symptoms and severity but is often characterized by rigid, repetitive behavior patterns, sensory issues, intense interests, and communication delays. While the majority of currently discovered ASD genes are protein-coding, great promise exists in finding novel autism risk genes among non-coding RNA genes since they make up the bulk of the human genome. Specifically, this study focused on long non-coding RNAs (lncRNAs), which are classified as transcripts longer than 200 nucleotides. This study aimed to identify potential ASD risk genes by means of machine learning, a subset of artificial intelligence. We have trained three machine learning models, Random Forest, Support Vector Machine, and Artificial Neural Network, to mine for relevant features within genomic data of previously discovered ASD risk genes to predict novel risk genes using those features. The Artificial Neural Network Model showed the highest performance of the three. 624 lncRNAs were identified as high confidence candidate lncRNAs by all three models and 1,174 were identified in total. Upon further analysis of the results, many candidate lncRNAs were located in close proximity to known ASD risk genes on their respective chromosomes.