Research News
Hi, on this page (more of a blog) I give short summaries of some of our current themes in research, in case of interest. This webpage isn't updated regularly however! Though having a background in machine learning, biomedical applications of machine learning has been a draw over the last few years.
Firstly, the biomedical sciences are increasingly generating very large datasets (think of the human genome). In addition, there are often a wide variety of different types of data coming
from the same patient. This suggests the use of algorithms involving data integration: a particular interest when we were working more on algorithm development. Secondly, of course, some of these types of project are potentially very high impact.
An ongoing theme of our work has been the development of classifiers for predicting the functional impact of human genetic variation i.e. if the variation is pathogenic (a disease-driver)
or neutral. Some of this work involves the impact of single nucleotide variants (SNVs, e.g. A->C or C->T, for example, in the human genome sequence) but we are also interested in predicting the pathogenic status of indels (short insertions or deletions of genetic
code) and related problems, such as discovering functionally significant combinations of variants involved in disease. We are further interested in the development of
disease-specific predictors, e.g. for cancer. Prediction in this context involves using over 30 different types of data and
thus we use data integration algorithms extensively. This project has involved collaborations with Mark Rogers, Amy Francis,
Tom Gaunt and many others, and it is more fully described on the Available Software tab to the left
(under FATHMM-MKL and CScape) and technical papers on my bioinformatics pages (see tab to left).
From 2015-2022 FATHMM-MKL was the mutation impact predictor at the
COSMIC cancer database, though it became obsolete due to later, and more
accurate, methods. The wider literature using these methods can be found by searching `FATHMM' and `FATHMM-MKL'
in Google Scholar.
A talk at the PharmaTec Congress (2019), gives an overview of
this research programme, and a review is:
Mark F Rogers, Tom R Gaunt and Colin Campbell. Prediction of driver variants
in the cancer genome via machine learning methodologies. Briefings in Bioinformatics (OUP). Volume 122, pages: 1467-1476 (2020), bbaa250, https://doi.org/10.1093/bib/bbaa250.
We remain interested in this research theme and published an extended resource for building these prediction tools,
called
DrivR-Base:
Amy Francis, Colin Campbell, Tom R Gaunt,
DrivR-Base: a feature extraction toolkit for variant effect prediction model construction,
Bioinformatics, Volume 40, Issue 4, April 2024, btae197, https://doi.org/10.1093/bioinformatics/btae197.
Our most recent paper covers a set of 50 cancer-specific classifiers, called,
CanDrivR-CS:
Amy Francis, Colin Campbell, Tom R Gaunt.
CanDrivR-CS: A Cancer-Specific Machine Learning Framework for Distinguishing Recurrent and Rare Variants, Bioinformatics Advances (OUP, 2026)
For a number of cancer types and restricting to SNVs in coding regions, these are the most accurate prediction tools to date. However, these algorithms
have a dependence on features related to the physical stability of the DNA. Thus, though predicting that a given mutation would be recurrently observed in
cancer genomes, absent from the genomes of healthy individuals, this can mean it is a
mutational hotspot,
i.e. frequently mutated but not necessarily a cancer-driver. In line with this comment, the highest test accuracy, at 90%, is for
skin cutaneous melanoma (SKCM). With SKCM, benign mutational hotspots can be UV-induced due to localised weakness of the double helix. However, taking this caveat into account,
mutations recurrently observed in cancer genomes, absent from the genomes of healthy individuals, is a class of mutations likely enriched for drivers of unregulated clonal expansion.
20% of men get prostate cancer during their lifetime and 3% die from the disease. Thus the disease is largely not life-threatening, with individuals dying with the disease, and not from it.
Treatment has associated risks and should ideally be targeted at those for whom the disease would be life-threatening. As stated, this is a classic machine learning problem in which
previously catalogued genomic and clinical data is used to train an algorithm which predicts if the tumor is aggressive or benign at diagnosis. Alas, prostate cancer tumors are
typically very heterogeneous, most often having different genetic signatures in different regions of the tumor, and this is not readily tractable as a prospect.
In the period 2005-2009, we developed an algorithmic approach called Latent Process Decomposition (LPD): most of the papers concerned
are on the Publications:Bioinformatics tab to the left. The LPD algorithm attempts to maximise the probability of the model given the data.
An interesting aspect of the method is that it is a mixed membership model, that is, a biopsy sample from a patient is represented as a combinatorial mixture over a number of underlying active states or processes. It is, in short, very suited to the analysis of prostate cancer samples and, indeed, many other cancers, because of the
heterogeneities involved. For this reason we used LPD on prostate cancer in an earlier study with Colin Cooper's group, then at the Institute of Cancer Research, London
(CS Cooper, C Campbell, S Jhavar.
Nature Reviews Urology 4 (12), 677-687 (2007)). Colin Cooper's group has now acquired improved data and LPD has now given substantially
improved resolution of the disease (see, e.g.
national,
international coverage,
cancer news, and even The Sun (look at the video
here!)). For more details there is a
technical paper (2017).
Specifically, it has indicated a 55 gene signature (the DESNT signature) which plays a core role in the ability of the tumor to progress towards aggressive outcome. We are working with
Colin Cooper and his team to progress understanding of this signature, via use of novel data and alterations
to the algorithmic framework. A recent paper is:
Bogdan Luca, Vincent Moulton, Christopher Ellis, Dylan R Edwards, Colin Campbell, Rosalin Cooper, Jeremy Clark,
Daniel Brewer and Colin Cooper. A Novel Stratification Framework for Predicting Outcome in Patients with
Prostate Cancer. British Journal of Cancer (Nature). Volume 122, pages: 1467–1476 (2020).
With Prof. Moin Saleem's group (University of Bristol, project PI) we were awarded circa 2.5 million pounds from the Medical Research Council
(Stratified Medicine initiative, MR/R013942/1, to December 2024) as part of an investigation of idiopathic nephrotic syndrome and chronic kidney disease.
The usage of machine learning methods can include resolution of clinically distinct disease subtypes (via unsupervised learning) or
discovery of those genetic variants in the human genome which act as disease-drivers (see above).
Dr. Amy Osborne, working in our group, has recently used methods from computational statistics to identify a
number of previously unreported variants in the human genome which predispose an individual to chronic kidney disease,
which affects about 700 million persons globally and is responsible for more than a million deaths per annum. As primary resource,
the project used data from the Northern Care Alliance (the Salford Kidney Study) and Bristol-based NURTuRE-CKD.
The research programme was initially supported by the above Medical Research Council stratified medicine grant (MR/R013942/1, MICA: NURTuRE)
and is currently supported by Kidney Research UK. The paper is: Amy Osborne, Agnieszka Bierzynska, Elizabeth Colby, Uwe Andag, Philip Kalra,
Olivier Radresa, Philipp Skroblin, Maarten Taal, Gavin Welsh, Moin Saleem and Colin Campbell,
"Multivariate canonical correlation analysis identifies additional genetic variants for chronic kidney disease",
NPJ Syst. Biol. Appl. (Nature) 10(1):28 (2024). The Nature SharedIt is here.