Enhancing genomic annotation through convolutional neural networks: development and evaluation of splice site and exon prediction models
The accurate prediction of splice sites and exon boundaries in genomic sequences is critical for understanding gene structure and function, with implications for various biological research and clinical applications. This project presents the development and validation of computational models, specifically convolutional neural networks (CNNs), aimed at enhancing the prediction accuracy of splice sites and consequently, exon sequences. The primary focus was on two models: SpliceSiteCNNs and ExonCNN.
The SpliceSiteCNNs were developed to predict splice donor and acceptor sites using the human genome for the creation of the training set, with the hope that this could then be applied to other species. Despite achieving high true positive rates, the model returned a considerable number of false positives, highlighting the need for further refinement. To address this, ExonCNN was introduced to specifically target the reduction of false positive exon predictions generated by SpliceSiteCNNs. The training data for ExonCNN comprised true exon sequences and their 'DiShuffled' counterparts, aiming to preserve dinucleotide composition without retaining actual exon functionality.
The effectiveness of ExonCNN was tested in two phases. Initially, the model showed a modest reduction in false positives, indicating room for improvement. Subsequent retraining with an expanded dataset led to a significant reduction in false positives.
Future work outlined in this thesis includes testing the combined model on other species, exploring alternative machine learning strategies such as support vector machines for exon prediction, developing models for untranslated region (UTR) prediction, and retraining on larger, more diverse datasets. These steps are crucial for moving towards more reliable and comprehensive gene annotation tools, ultimately contributing to better genomic insights and applications in personalised medicine.
History
Faculty
- Faculty of Science and Engineering
Degree
- Master (Research)
First supervisor
Virag SharmaDepartment or School
- Chemical Sciences