Available Software

1. FATHMM-MKL and CScape

1.1. The FATHMM family of predictors

The FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions:

Hashem A. Shihab, Julian Gough, David N. Cooper, Peter D. Stenson, Gary L.A. Barker, Keith J. Edwards, Ian N.M. Day, Tom R. Gaunt. Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. (2013), 34:57-65

Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species. Sequence conservation across species is an indirect indicator of functional significance.

As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines, now published by Springer Nature). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here:

Hashem Shihab, Mark Rogers, Julian Gough, Matthew Mort, David Cooper, Ian Day, Tom Gaunt and Colin Campbell. An Integrative Approach to Predicting the Functional Effects of Non-Coding and Coding Sequence Variation. Bioinformatics Vol. 31, No. 10, 2015, pages 1536-1543.

We later improved the method using sequential learning:

Mark Rogers, Hashem Shihab, Tom Gaunt, Matthew Mort, David Cooper, and Colin Campbell, Sequential Data Selection for Predicting the Pathogenic Effects of Sequence Variation, Proceedings, 2015 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2015, B394)

FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups. A search using FATHMM-MKL or FATHMM in Google Scholar gives the broader literature utilising this method. FATHMM scores are a variant filter option with Ion Torrent genome sequencing machines. From 2015-2022 FATHMM-MKL was the mutation impact predictor at the COSMIC cancer database.

Joeri van der Velde et al proposed GAVIN a variant predictor which used variable thresholding to give a substantive performance gain over other methods, such as CADD. We pursued a similar approach to give a competitive performance gain with FATHMM-XF:

Mark F. Rogers, Hashem A. Shihab, Matthew Mort, David N. Cooper, Tom R. Gaunt and Colin Campbell. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics btx536, https://doi.org/10.1093/bioinformatics/btx536 (2017).

The FATHMM-XF server for GRCh37/hg19 (EMSEMBL release 87) is available here.

FATHMM-XF can achieve an overall test accuracy performance of 89.0% on approximately balanced (50:50) unseen test data for SNVs in coding regions, rising to 94% if restricted to high confidence prediction. These test accuracies look impressive and are roughly in accordance with the performance of other predictors, such as CADD. With this level of test accuracy, the expectation may be that there should be a good concordance between FATHMM-MKL, FATHMM-XF, CADD and other classifiers. However, this is not necessarily the case. An investigator will use different assumptions about those variants qualifying as neutrals, and those acting as drivers. Ambiguity about the status of disease-drivers is obvious. However, there is an ambiguity with neutral SNVs too e.g. do we use germline SNVs found in healthy individuals or perhaps randomly created neutral examples (both assumptions are used in practice)? Such assumptions will be common to both training and test data (for a machine learning approach, the test data is assumed IID: independent and identically distributed). So, in principle, a classifier can learn a distinction from a training set and achieve a good test accuracy, but not necessarily accord with another classifier where different assumptions have been made for driver and neutral.

We devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome, and which also gives a visual comparison of classifier similarities. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks:

Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4.

Subsequent to this we developed an indel predictor, FATHMM-indel, for estimating the pathogenic impact of short insertions or deletions of genetic code (aside: there can be some issues with CRs with Windows text editors, tab-delimited will resolve). This predictor can handle indels in non-coding regions of the human genome:

Michael Ferlaino, Mark F Rogers, Hashem A Shihab, Tom R Gaunt, Matthew Mort, David N Cooper, Colin Campbell. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. BMC Bioinformatics 18:442 (2017).

1.2. CScape: cancer specific prediction.

A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer:

Mark Rogers, Hashem Shihab, Tom Gaunt, and Colin Campbell. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Scientific Reports (Nature) 7, article number: 11597, doi:10.1038/s41598-017-11746-4, (2017) (main paper and supplementary).

Predictions are based on reference GRCh37/hg19 (ENSEMBL release 87) of the human genome, with the option of the GRCh38 build via conversion through lift-over.

Our baseline predictor appears more accurate than competitors and was based on data from COSMIC. The method can use up to 30 different types of genomic data, though only a small subset of these are actually informative and are used by the classifier. This method was benchmarked on independent data from The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), in addition to other databases and leave-one-chromosome-out cross validation. It is able to make predictions in both the coding and non-coding regions of the human cancer genome, though it is much more accurate in coding regions. We introduced a confidence measure for the predicted class label: by restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 91% test accuracy (in coding regions, approximately balanced 50:50 unseen test data), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). The overall test accuracy (i.e. all nucleotide positions, coding regions) is closer to 73% on approximately balanced (50:50) unseen test data. However, bear in mind the earlier comment that we necessarily use a proxy distinction to the true distinction (of drivers versus neutrals) and that, given large amounts of data which is at least partly informative, the machine learning method can learn the proxy distinction and generalise effectively, but the overlap of proxy to true is less clear. The test accuracy scores achieved with CScape are lower, overall, than for non-cancer diseases resulting from variants in the genome. This likely stems from the observation that variants driving unregulated cell proliferation are on a functional range of weak drivers (limited functional impact) through to strong drivers (significant functional impact), and this ambiguity complicates any definition of driver and neutral.

Subsequently, we replaced the neutrals class (germline variants in healthy individuals) with rare occurence somatic variants drawn purely from cancer genome data, to further understand ambiguities associated with defining drivers and neutrals. This gives an alternative view of predicted drivers and was published here:

Mark F Rogers, Tom R Gaunt and Colin Campbell. CScape-somatic: distinguishing driver and passenger point mutations in the cancer genome. Bioinformatics. Volume 36, Issue 12 , pages: 3637–3644 (2020). The CScape-somatic predictor is located here.

Subsequently we explored the view of the cancer genome, based on usage of CScape and taken across 25 types of cancer, in this paper:

Madeleine Darbyshire, Zachary du Toit, Mark F. Rogers, Tom Gaunt, and Colin Campbell. Estimating the Frequency of Single Point Driver Mutations across Common Solid Tumours. Scientific Reports (Nature) 9, article number: 13452, (2019) (main paper and supplementary).

This paper focusses on predicting the disease-driver status of single nucleotide variants driving cell proliferation (we call them SNV-drivers) within coding regions of the cancer genome, though there is some discussion and presented results for SNV-drivers in non-coding regions. The term driver is potentially a misnomer, an enabler may be more accurate since some predicted variants (e.g. in the tumour suppressor gene TP53) enable unregulated cell proliferation via loss of function.

The driver-genes in which predicted coding SNV-drivers are embedded are generally individualised to a given patient or tumour but ... one or more high-confidence predicted SNV-drivers are commonly located in certain genes, in certain contexts (high confidence equates to a false discovery rate of 5% or less, equivalent to a cutoff of 0.88 of the p-score CScape uses). For example, pancreatic cancer has at least one such high-confidence SNV-driver in the gene KRAS in 86.5% of samples in our data. The gene BRAF has such SNV-drivers in skin cutaneous melanoma (42.3% of cases) and thyroid cancer (55.8%). The gene APC is a top ranked driver for colon adenocarcinoma (49.0% of cases) and colorectal cancer (39.2%). Some genes have significant mutations across multiple types of cancer. Thus TP53 is placed in the top five driver genes across 17 of the 25 types of cancer considered. Other genes entering the top five ranked driver-gene list across multiple cancers are PIK3CA (6 cancer types), KRAS (5) and CTC-297N7.11 (4). The long non-coding RNA gene TTN-AS1 is transcribed from the opposite strand to the TTN gene, expressing Titin, and it is indicated as prospectively relevant across multiple cancers, though TTN itself belongs to a small class of genes, e.g. mucins (MUC) and ryanodine receptors (RYR), commonly mutated, but not regarded as cancer-drivers (since writing this TTN-AS1 has been implicated in multiple cancer contexts, a review is here). Plots of frequency of occurrence for some illustrative genes with embedded high confidence SNV-drivers are given here.

We have published a review paper and introduction, with some additional plots of driver genes here:

Mark F Rogers, Tom R Gaunt and Colin Campbell. Prediction of driver variants in the cancer genome via machine learning methodologies. Briefings in Bioinformatics (OUP). Volume 122, pages: 1467-1476 (2020), bbaa250, https://doi.org/10.1093/bib/bbaa250.

and for a comment: Colin Campbell, Amy Francis and Tom Gaunt. Predicting pathogenicity from non-coding mutations, Nature Biomedical Engineering, https://doi.org/10.1038/s41551-022-00996-x (2022).

1.3. Other Tools

We have also developed a state-of-the-art integrative classifier for predicting haploinsufficient genes:

Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, (2017) 33 (12): 1751-1757. https://doi.org/10.1093/bioinformatics/btx028 (2017).

The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait.

2. Learning with indefinite kernels (SVM classfication)

Reference:

Yiming Ying, Colin Campbell and Mark Girolami.Analysis of SVM with Indefinite Kernels Advances in Neural Information Processing Systems (NIPS) 22, 2009, p. 2214-2222.

Download the MATLAB code

3. Variational Bayesian approach to LPD cluster analysis

References:

Y.Ying, Peng Li and C. Campbell, A marginalized variational Bayesian approach to the analysis of array data, BMC Proceedings 2008, 2 (suppl 4):S7.

S. Rogers, M. Girolami, C. Campbell, and R. Breitling, The Latent Process Decomposition of cDNA Microarray Datasets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2 (2005) 143-156.

Download the MATLAB code