Available Software



1. FATHMM-MKL and CScape

The FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions:

  • Hashem A. Shihab, Julian Gough, David N. Cooper, Peter D. Stenson, Gary L.A. Barker, Keith J. Edwards, Ian N.M. Day, Tom R. Gaunt. Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. (2013), 34:57-65
  • Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species.

    As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here:

  • Hashem Shihab, Mark Rogers, Julian Gough, Matthew Mort, David Cooper, Ian Day, Tom Gaunt and Colin Campbell. An Integrative Approach to Predicting the Functional Effects of Non-Coding and Coding Sequence Variation. Bioinformatics Vol. 31, No. 10, 2015, pages 1536-1543.
  • We later improved the method a little:

  • Mark Rogers, Hashem Shihab, Tom Gaunt, Matthew Mort, David Cooper, and Colin Campbell, Sequential Data Selection for Predicting the Pathogenic Effects of Sequence Variation, Proceedings, 2015 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2015, B394)
  • FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups.

    Recently, Joeri van der Velde et al proposed GAVIN a variant predictor which used variable thresholding to give a substantive performance gain over other methods, such as CADD. We pursued a similar approach to give a competitive performance gain with FATHMM-XF:

  • Mark F. Rogers, Hashem A. Shihab, Matthew Mort, David N. Cooper, Tom R. Gaunt and Colin Campbell. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics btx536, https://doi.org/10.1093/bioinformatics/btx536 (2017).
  • The FATHMM-XF server for GRCh37/hg19 (EMSEMBL release 87) is available here.

    We have also devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks:

  • Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4.
  • Subsequent to this we developed an indel predictor, FATHMM-indel, for estimating the pathogenic impact of short insertions or deletions of genetic code (aside: there can be some issues with CRs with Windows text editors, tab-delimited will resolve). This predictor can handle indels in non-coding regions of the human genome:

  • Michael Ferlaino, Mark F Rogers, Hashem A Shihab, Tom R Gaunt, Matthew Mort, David N Cooper, Colin Campbell. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. BMC Bioinformatics 18:442 (2017).

    A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer:

  • Mark Rogers, Hashem Shihab, Tom Gaunt, and Colin Campbell. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Scientific Reports (Nature) 7, article number: 11597, doi:10.1038/s41598-017-11746-4, (2017) (main paper and supplementary).
  • Predictions are based on reference GRCh37/hg19 (ENSEMBL release 87) of the human genome (aside: for another assembly you can use lift-over).

    Our baseline predictor appears more accurate than competitors and was based on data from COSMIC and up to 30 different types of genomic data sources. The method was benchmarked on independent data from the The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), in addition to other databases. It is able to make predictions in both the coding and non-coding regions of the human cancer genome, though it is much more accurate in coding regions. We furthermore introduced a confidence measure for the predicted class label. By restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 90% test accuracy (in coding regions), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). These high confidence predicted potential disease-driver variants are typically clustered by location in the cancer genome and the method highlights exons in 191 autosomal genes such that mutational change could act as a disease-driver.

    We have an ongoing research programme to develop more successful predictors within computational cancer genomics, targeting smaller core sets of driver mutations within coding regions, relating mutations to neoantigens, and predicting significant pairwise combinations of mutations in the cancer genome, for example. Interested researchers are welcome to get in contact.

    Finally, we have developed a state-of-the-art integrative classifier for predicting haploinsufficient genes:

  • Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, (2017) 33 (12): 1751-1757. https://doi.org/10.1093/bioinformatics/btx028 (2017).
  • The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait.

    2. Learning with indefinite kernels (SVM classfication)


    Reference:

  • Yiming Ying, Colin Campbell and Mark Girolami.Analysis of SVM with Indefinite Kernels Advances in Neural Information Processing Systems (NIPS) 22, 2009, p. 2214-2222.
  • Download the MATLAB code

    3. Variational Bayesian approach to LPD cluster analysis


    References:

  • Y.Ying, Peng Li and C. Campbell, A marginalized variational Bayesian approach to the analysis of array data, BMC Proceedings 2008, 2 (suppl 4):S7.

  • S. Rogers, M. Girolami, C. Campbell, and R. Breitling, The Latent Process Decomposition of cDNA Microarray Datasets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2 (2005) 143-156.
  • Download the MATLAB code