On the Use of Machine Learning in Genomic Prediction

Kelly, Ciaran

This thesis explores the use of machine learning in the context of genomic prediction and the issue of confounding in such approaches. Traditionally, genomic prediction methods have made use of linear models with much success, although there remains some debate as to the extent to which modeling non-additivity may improve polygenic scoring. As machine learning algorithms offer an increased ability to model complexity, interest in their use has grown in recent years, although the exact magnitude of the increase in predictive performance one can expect from such models has remained unclear. This thesis makes use of an Arabidopsis thaliana dataset as well as a human case/control ALS cohort to investigate the ability of machine learning methods to statistically out-compete traditional methods of genomic prediction. It finds that, across both datasets, the baseline linear models provide a powerful reference with which to gauge the success of machine learning models and that any improvement in predictive accuracy is generally modest. However, certain model types such as random forests and feed-forward neural networks were often able to provide increased performance, and so continued research into their use is recommended. Much care must be taken when conducting genetic association studies, as well as genomic prediction methods, as many confounding factors such as population structure are known to bias these efforts. However, although many tools exist for mitigating bias in a linear context, consensus methods for reducing confounding when utilizing machine learning models for genomic prediction remain lacking. As such, this thesis also explores the benefits of measuring bias in the form of the distance correlation between outputted predictions and confounder variable(s) during model training. It also investigates the ability of several recently proposed adversarial approaches to actively reduce bias in a deep-learning context. It finds that some methods do indeed have the ability to output predictions that retain high accuracy, while keeping bias low relative to baseline methods. However, adversarial training was not guaranteed to lead to a decrease in bias across all confounding variables. As interest in the use of machine learning increases, and the clinical implementation of polygenic prediction models become a reality, this thesis provides a framework on how to implement best practices to ensure fair comparisons between learning methods. In line with other work in this area, improvements in performance when using machine learning is not expected to greatly increase predictive accuracy, although some gains in performance may be achievable. Importantly, a heavy emphasis is placed on the issue of confounding in non-linear genomic prediction settings, and some techniques to detect and reduce bias during training are offered.

On the Use of Machine Learning in Genomic Prediction

File Type:

Item Type:

Date:

Author:

Access:

Citation:

Download Item:

Abstract:

URI:

Author's Homepage:

Description:

Advisor:

Publisher:

Type of material:

URI:

Collections

Availability:

Keywords:

Metadata

Browse

My Account

On the Use of Machine Learning in Genomic Prediction

File Type:

Item Type:

Date:

Author:

Access:

Citation:

Download Item:

Abstract:

URI:

Author's Homepage:

Description:

Advisor:

Publisher:

Type of material:

URI:

Collections

Availability:

Keywords:

Metadata