HNSE-P4-8. Predicting Variant Pathogenicity with Machine Learning

Zachary FitzHugh1, 2
Fatma Nasoz, Ph.D.3
Faculty Mentor: Martin Schiller, Ph.D.4
1Howard R. Hughes College of Engineering, Department of Computer Science
2Lee Business School, Department of Economics
3The Lincy Institute
4College of Sciences, Nevada Institute of Personalized Medicine

There are roughly 22,000 protein-coding genes in the human body, many of which play important roles in biological functions. The proteins fold in 3D space, and this is most often necessary for function. A genetic variant can disrupt the secondary structure of a protein (one aspect of structure) or eliminate a site important in protein-protein interaction or post-translational modification. The loss of function or deregulation can result in disease. Thus, there is great biomedical interest in identifying disease-causing single-nucleotide variants.
We hypothesize that we can accurately predict variant pathogenicity. We used machine learning to predict the pathogenicity of a set of 28,369 single-nucleotide variants across 10 genes. The data are acquired from publicly available saturation mutagenesis data sets, which generate every possible amino acid substitution at every position in a protein. Our approach employs a support vector machine using linear, polynomial, and RBF kernel functions. The problem is implemented as a binary classification problem, where a label of 1 indicates a disease-causing variant and a label of 0 indicates a benign variant. The model predicts pathogenicity based on amino acid, post-translational modification, and secondary structure information. We cleaned and analyzed the data with custom Python scripts. Our results show average balanced accuracy scores for classifying pathogenicity of approximately 57.9%, 60.3%, and 60.3% for the linear, polynomial, and RBF kernels, respectively. Therefore, the model is an improvement over random guessing but has room for improvement.


Nov 15 - 19 2021


All Day


HNSE: Poster Session 4
The Office of Undergraduate Research


The Office of Undergraduate Research


One Reply to “HNSE-P4-8. Predicting Variant Pathogenicity with Machine Learning”

  1. Please feel free to ask questions about what we have done here and provide feedback. I would love to hear people’s thoughts!

Leave a Reply

Your email address will not be published. Required fields are marked *