Research News



Hi, on this page (more of a blog) I give short summaries of some of our current themes in research, in case of interest. This webpage isn't updated regularly however! Though having a background in machine learning, biomedical applications of machine learning has been a draw over the last few years. Firstly, the biomedical sciences are increasingly generating very large datasets (think of the human genome). In addition, there are often a wide variety of different types of data coming from the same patient. This suggests the use of algorithms involving data integration: a particular interest when we were working more on algorithm development. Secondly, of course, some of these types of project are potentially very high impact.

  • An ongoing theme of our work has been the development of classifiers for predicting the functional impact of human genetic variation i.e. if the variation is pathogenic (a disease-driver) or neutral. Some of this work involves the impact of single nucleotide variants (A->C or C->T, for example, in the human genome sequence) but we are also interested in predicting the pathogenic status of indels (short insertions or deletions of genetic code) and related problems, such as discovering functionally significant combinations of variants involved in disease. We are further interested in the development of disease-specific predictors, e.g. for cancer. Prediction in this context involves using about 30 different types of data and thus we use data integration algorithms extensively. This project involves Mark Rogers (ISL) and Tom Gaunt (Social Medicine), and others, and it is more fully described on the Available Software tab to the left (under FATHMM-MKL and CScape) and technical papers on my bioinformatics pages (see tab to left). We are members of the Functional Effects GeCiP of the Genomics England (100,000 genomes) project, where we expect to use these methods on rare disease genomes and cancer genomes.


  • 20% of men get prostate cancer during their lifetime and 3% die from the disease. Thus the disease is largely not life-threatening, with individuals dying with the disease, and not from it. Treatment has associated risks and should ideally be targeted at those for whom the disease would be life-threatening. As stated, this is a classic machine learning problem in which previously catalogued genomic and clinical data is used to train an algorithm which predicts if the tumor is aggressive or benign at diagnosis. Alas, prostate cancer tumors are typically very heterogeneous, most often having different genetic signatures in different regions of the tumor, and this is not readily tractable as a prospect. In the period 2005-2009, we developed an algorithmic approach called Latent Process Decomposition (LPD): most of the papers concerned are on the Publications:Bioinformatics tab to the left. The LPD algorithm attempts to maximise the probability of the model given the data. An interesting aspect of the method is that it is a mixed membership model, that is, a biopsy sample from a patient is represented as a combinatorial mixture over a number of underlying active states or processes. It is, in short, very suited to the analysis of prostate cancer samples and, indeed, many other cancers, because of the heterogeneities involved. For this reason we used LPD on prostate cancer in an earlier study with Colin Cooper's group, then at the Institute of Cancer Research, London (CS Cooper, C Campbell, S Jhavar. Nature Reviews Urology 4 (12), 677-687 (2007)). Colin Cooper's group has now acquired improved data and LPD has now given substantially improved resolution of the disease (see, e.g. national, international coverage, cancer news or the technical paper). Specifically, it has indicated a 45 gene signature (the DESNT signature) which plays a core role in the ability of the tumor to progress towards aggressive outcome. We are working with Colin Cooper and his team to progress understanding of this signature, via use of novel data and alterations to the algorithmic framework.


  • With Prof. Moin Saleem's group (University of Bristol, project PI) we have recently been awarded circa 2.5 million pounds from the Medical Research Council (Stratified Medicine initiative, MR/R013942/1) as part of an investigation of idiopathic nephrotic syndrome and chronic kidney disease. The usage of machine learning methods can include resolution of clinically distinct disease subtypes (via unsupervised learning) or discovery of those genetic variants in the human genome which act as disease-drivers (see above).


  • With Prof. David Murphy's group (University of Bristol) we used a graphical lasso algorithm to look for main regulators of hypertension (though prevalent, it remains poorly understood). With control vs disease-trait expression array data, nodes in the graph represent genes and a node with large fan-out may indicate that the expressed product of a gene has a significant regulatory influence. This identified CAPRIN2 as a gene of interest. As CI with David as PI, we have recently been awarded 1.3 million from BBSRC (BB/R016879/) for further investigation of this topic.