Why do we sequence gene panels, exomes and complete genomes? To find mutations in genes, of course. This is the currency that allows bioinformaticists and clinicians to determine which metabolic pathways are dysregulated in patients with gene-based disease.
What has surprised me most about NGS since the development of this wonderful technology for finding variation has been the under-investment and significant ignorance (deliberate or incidental) about what constitutes a disease-related mutation. There are a number sources of false positive and negative mutations, which will invariably impact which abnormal pathways are inferred from mutation data. This has significant implications for personalizing diagnosis and therapy.
So much emphasis has placed on coding mutations that alter amino acids and create nonsense codons, in part, because of the false sense of security conferred by apparent understanding of changes in protein coding. All of the tools available for inferring pathogenicity from missense changes are to some extent inaccurate and generally produce results inconsistent with one another. Even when they are consistent, it is not a guarantee of accuracy*. Leading researchers, recognizing this, have typically shunned incorporation of these results landmark NGS papers.
Nonsense codons and intronic dinucleotides at exon boundaries recognized in splicing are thought to be the most rigorous evidence for pathogenicity. Many nonsense codons induce exon skipping, and these events frequently preserve reading frame, which should dampen confidence about pathogenicity of such mutations, a potential source of false positives. The analyses of dinucleotides in splice sites are known to comprise only a small fraction of known splicing mutations. We and others have demonstrated that splicing mutations can occur widely throughout genes (in exons and introns), either inactivating splice site and exon recognition, reducing splicing efficiency, or creating novel cryptic splice isoforms. These account for a substantial fraction of unrecognized, highly deleterious mutations in NGS data. Our company, Cytognomix, has released peer-reviewed software which predicting such mutations in genomic sequences (shannonpipeline.cytognomix.com), and companion software (veridical.org) using RNASeq data for validating these predictions.
Our ongoing analysis of complete gene sequences in breast cancer has revealed breathtaking levels of sequence variation in introns, promoter and downstream regions that dwarf that seen in exons alone. Variants in promoter regions have been proven to explain expression levels in normal individuals. It seems likely that these gene regions will harbor disease causing variants in many patients. Using information theory-based methods we are prioritizing the most likely regulatory mutation candidates. Our aim is to produce a full catalogue of likely disease-causing variants in each patient, to improve our understanding of dyregulated pathways in each individual.
The blog entry has also been posted on the NGSLeaders Discussion Board on Data Analysis and Informatics at www.ngsleaders.org
*The fact that an opinion has been widely held is no evidence whatever that it is not utterly absurd.”
— Bertrand Russell