On the Use of Machine Learning in Genomic Prediction
Citation:
Kelly, Ciaran, On the Use of Machine Learning in Genomic Prediction, Trinity College Dublin, School of Genetics & Microbiology, Genetics, 2023Download Item:
Abstract:
This thesis explores the use of machine learning in the context of genomic prediction and the
issue of confounding in such approaches. Traditionally, genomic prediction methods have
made use of linear models with much success, although there remains some debate as to the
extent to which modeling non-additivity may improve polygenic scoring. As machine learning
algorithms offer an increased ability to model complexity, interest in their use has grown in
recent years, although the exact magnitude of the increase in predictive performance one can
expect from such models has remained unclear.
This thesis makes use of an Arabidopsis thaliana dataset as well as a human case/control ALS
cohort to investigate the ability of machine learning methods to statistically out-compete
traditional methods of genomic prediction. It finds that, across both datasets, the baseline
linear models provide a powerful reference with which to gauge the success of machine
learning models and that any improvement in predictive accuracy is generally modest.
However, certain model types such as random forests and feed-forward neural networks
were often able to provide increased performance, and so continued research into their use
is recommended.
Much care must be taken when conducting genetic association studies, as well as genomic
prediction methods, as many confounding factors such as population structure are known to
bias these efforts. However, although many tools exist for mitigating bias in a linear context,
consensus methods for reducing confounding when utilizing machine learning models for
genomic prediction remain lacking. As such, this thesis also explores the benefits of measuring
bias in the form of the distance correlation between outputted predictions and confounder
variable(s) during model training. It also investigates the ability of several recently proposed
adversarial approaches to actively reduce bias in a deep-learning context. It finds that some
methods do indeed have the ability to output predictions that retain high accuracy, while
keeping bias low relative to baseline methods. However, adversarial training was not
guaranteed to lead to a decrease in bias across all confounding variables.
As interest in the use of machine learning increases, and the clinical implementation of
polygenic prediction models become a reality, this thesis provides a framework on how to
implement best practices to ensure fair comparisons between learning methods. In line with
other work in this area, improvements in performance when using machine learning is not
expected to greatly increase predictive accuracy, although some gains in performance may
be achievable. Importantly, a heavy emphasis is placed on the issue of confounding in non-linear genomic prediction settings, and some techniques to detect and reduce bias during
training are offered.
Sponsor
Grant Number
Science Foundation Ireland (SFI)
Description:
APPROVED
Author: Kelly, Ciaran
Advisor:
McLaughlin, RussellPublisher:
Trinity College Dublin. School of Genetics & Microbiology. Discipline of GeneticsType of material:
ThesisCollections
Availability:
Full text availableKeywords:
Genomics, Machine Learning, Genomic Prediction, ALS, MND, Polygenic, Complex Traits, Arabidopsis thalianaMetadata
Show full item recordLicences: