Available Software
1. FATHMM-MKL and CScape
1.1. The FATHMM family of predictors
The FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions:
Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species. Sequence conservation across species is an indirect indicator of functional significance.
As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines, now published by Springer Nature). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here:
FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups. A search using FATHMM-MKL or FATHMM in Google Scholar gives the broader literature utilising this method. FATHMM scores are a variant filter option with Ion Torrent genome sequencing machines. From 2015-2022 FATHMM-MKL was the mutation impact predictor at the COSMIC cancer database.
Joeri van der Velde et al proposed GAVIN a variant predictor which used variable thresholding to give a substantive performance gain over other methods, such as CADD. We pursued a similar approach to give a competitive performance gain with FATHMM-XF:
The FATHMM-XF server for GRCh37/hg19 (EMSEMBL release 87) is available here.
FATHMM-XF can achieve an overall test accuracy performance of 89.0% on approximately balanced (50:50) unseen test data for SNVs in coding regions, rising to 94% if restricted to high confidence prediction. These test accuracies look impressive and are roughly in accordance with the performance of other predictors, such as CADD. With this level of test accuracy, the expectation may be that there should be a good concordance between FATHMM-MKL, FATHMM-XF, CADD and other classifiers. However, this is not necessarily the case. An investigator will use different assumptions about those variants qualifying as neutrals, and those acting as drivers. Ambiguity about the status of disease-drivers is obvious. However, there is an ambiguity with neutral SNVs too e.g. do we use germline SNVs found in healthy individuals or perhaps randomly created neutral examples (both assumptions are used in practice)? Such assumptions will be common to both training and test data (for a machine learning approach, the test data is assumed IID: independent and identically distributed). So, in principle, a classifier can learn a distinction from a training set and achieve a good test accuracy, but not necessarily accord with another classifier where different assumptions have been made for driver and neutral.
We devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome, and which also gives a visual comparison of classifier similarities. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks:
Subsequent to this we developed an indel predictor, FATHMM-indel, for estimating the pathogenic impact of short insertions or deletions of genetic code (aside: there can be some issues with CRs with Windows text editors, tab-delimited will resolve). This predictor can handle indels in non-coding regions of the human genome:
1.2. CScape: cancer specific prediction.
A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer:
Predictions are based on reference GRCh37/hg19 (ENSEMBL release 87) of the human genome, with the option of the GRCh38 build via conversion through lift-over.
Our baseline predictor appears more accurate than competitors and was based on data from COSMIC. The method can use up to 30 different types of genomic data, though only a small subset of these are actually informative and are used by the classifier. This method was benchmarked on independent data from The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), in addition to other databases and leave-one-chromosome-out cross validation. It is able to make predictions in both the coding and non-coding regions of the human cancer genome, though it is much more accurate in coding regions. We introduced a confidence measure for the predicted class label: by restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 91% test accuracy (in coding regions, approximately balanced 50:50 unseen test data), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). The overall test accuracy (i.e. all nucleotide positions, coding regions) is closer to 73% on approximately balanced (50:50) unseen test data. However, bear in mind the earlier comment that we necessarily use a proxy distinction to the true distinction (of drivers versus neutrals) and that, given large amounts of data which is at least partly informative, the machine learning method can learn the proxy distinction and generalise effectively, but the overlap of proxy to true is less clear. The test accuracy scores achieved with CScape are lower, overall, than for non-cancer diseases resulting from variants in the genome. This likely stems from the observation that variants driving unregulated cell proliferation are on a functional range of weak drivers (limited functional impact) through to strong drivers (significant functional impact), and this ambiguity complicates any definition of driver and neutral.
Subsequently, we replaced the neutrals class (germline variants in healthy individuals) with rare occurence somatic variants drawn purely from cancer genome data, to further understand ambiguities associated with defining drivers and neutrals. This gives an alternative view of predicted drivers and was published here:
The driver-genes in which predicted coding SNV-drivers are embedded are generally individualised to a given patient or tumour but ... one or more high-confidence predicted SNV-drivers are commonly located in certain genes, in certain contexts (high confidence equates to a false discovery rate of 5% or less, equivalent to a cutoff of 0.88 of the p-score CScape uses). For example, pancreatic cancer has at least one such high-confidence SNV-driver in the gene KRAS in 86.5% of samples in our data. The gene BRAF has such SNV-drivers in skin cutaneous melanoma (42.3% of cases) and thyroid cancer (55.8%). The gene APC is a top ranked driver for colon adenocarcinoma (49.0% of cases) and colorectal cancer (39.2%). Some genes have significant mutations across multiple types of cancer. Thus TP53 is placed in the top five driver genes across 17 of the 25 types of cancer considered. Other genes entering the top five ranked driver-gene list across multiple cancers are PIK3CA (6 cancer types), KRAS (5) and CTC-297N7.11 (4). The long non-coding RNA gene TTN-AS1 is transcribed from the opposite strand to the TTN gene, expressing Titin, and it is indicated as prospectively relevant across multiple cancers, though TTN itself belongs to a small class of genes, e.g. mucins (MUC) and ryanodine receptors (RYR), commonly mutated, but not regarded as cancer-drivers. Plots of frequency of occurrence for some illustrative genes with embedded high confidence SNV-drivers are given here.
We have published a review paper and introduction, with some additional plots of driver genes here:
and for a comment: Colin Campbell, Amy Francis and Tom Gaunt. Predicting pathogenicity from non-coding mutations, Nature Biomedical Engineering, https://doi.org/10.1038/s41551-022-00996-x (2022).
1.3. Other Tools
We have also developed a state-of-the-art integrative classifier for predicting haploinsufficient genes:
The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait.
2. Learning with indefinite kernels (SVM classfication)
Reference:
Download the MATLAB code
3. Variational Bayesian approach to LPD cluster analysis
References:
Download the MATLAB code