Aug. 13, 2018. Differential accessibility to homologous chromosomal loci confirmed by international consortium

A large international consortium based at Harvard University has demonstrated parental homolog-specific differences in chromatin accessibility on human chromosome 19:

Nir et al. BioRxiv  doi: 10.1101/374058 (Walking along chromosomes with super-resolution imaging, contact maps, and integrative modeling)

This work reproduces previous reports previously published  by CytoGnomix scientists using our patented scFISH™ probes:

Khan et al. Molecular Cytogenetics 2014 7:70  (Localized, non-random differences in chromatin accessibility between homologous metaphase chromosomes)

Khan et al.  Molecular Cytogenetics 2015 8:65 (Reversing chromatin accessibility differences that distinguish homologous mitotic metaphase chromosomes)


June 13, 2018. Article in Fast Forward

The 2018 Impact report from the  Southern Ontario Smart Computing for Innovation Platform (SOSCIP), which supports the development of a supercomputer version of the Automated Dicentric Chromosome and Dose Estimation (ADCI) system in IBM Blue Gene/Q, contains an article about our project:

















June 7, 2018. Presentations at upcoming international conferences





Population scale biodosimetry with the Automated Dicentric Chromosome Identifier and Dose Estimator (ADCI) software system. (Platform) Rogan PK, Ali, S, Li Y, Shirley B, Wilkins R, Flegal F, Cooke R, Peerlaproulx T, Waller E, Knoll JHM. EPR Biodose 2018, June 11-15, Munich Germany

Optimization of image selection in Automated Dicentric Chromosome Analysis. Li Y, Shirley B, Wilkins R, Flegal, F, Knoll JHM, Rogan PK. EPR Biodose 2018, June 11-15, Munich Germany.

Predicting exposure to ionizing radiation by biochemically-inspired genomic machine learning. Rogan PK, Zhao JZL, and Mucaki EJ. EPR Biodose 2018, June 11-15, Munich Germany.








Comprehensive prediction of responses to chemotherapies by biochemically-inspired machine learning. (Best Poster session) Rogan PK, Zhao JZL, Mucaki EJ.  European Society of Human Genetics 2018, June 16-19, Milan Italy.




May 29, 2018. Change to URL

As of today, we have transitioned the website to:

The site contains our published articles, lectures and presentations about human genetics and molecular biology. All of the legacy content at this site (1980-2007) will be  preserved on the new site.

Please update your browser bookmarks to reflect this change.

March 29, 2018. Mutation Forecaster: Key to discovery of mutations in novel ALS gene

CytoGnomix’s Mutation Forecaster: Key to discovery of mutations in novel ALS gene. Nicolas et al. Genome-wide Analyses Identify KIF5A as a Novel ALS Gene. Neuron 97:1268-1283.e6, 2018

From our subscriber, Dr. John Landers (U. Mass. School of Medicine):

“We used the application ASSEDA (Automated Splice Site and Exon Definition Analyses) to predict any mutant mRNA splice isoforms resulting from these variants (Mucaki et al. 2013). This algorithm was chosen as it is known to have high performance in splice prediction (Caminsky et al., 2014). ASSEDA predicted a complete skipping of exon 27 for all variants, yielding a transcript with a frameshift at coding amino acid 998, the deletion of the normal C-terminal 34 amino acids of the cargo-binding domain, and the extension of an aberrant 39 amino acids to the C terminus (Table 3; Figures 4B and 4C). The presence of transcripts with skipped exon 27 was demonstrated by performing…”


April 9, 2018. Upcoming release of Automated Dicentric Chromosome Identifier and Dose Estimator (ADCI)

We have just completed porting Windows ADCI from the MinGW C++ (32 bit)  to Microsoft’s C++ (64 bit) compiled version. The release of this software in summer 2018 will contain this new version (v 2.0).  ADCI now has access to 8 Gb of runtime memory, which should allow twice as many samples to be batch processed in a single run.  We estimate at least 700 samples consisting of 500 images each can be analyzed in unattended operation.
Our primary motivation for building the 64 bit version was actually to write new code that exploits the onboard Nvidia graphics card (GPU) in  the gaming computers that we recommend. Benchmarking of the Gradient Vector Flow code (that defines chromosome objects) indicates a speedup of ~20% when the main CPU is linked to the GPU. The speed of the Integrated Intensity Laplacian module (which measures chromosome crossectional width) will also benefit from the GPU link. GPU acceleration will be available later in the year.

Request a quote.

March 13, 2018. Oral presentation on chemotherapy response in Best Poster session at ESHG 2018

On behalf of the Scientific Programme Committee of the European Conference of Human Genetics 2018 taking place in Milan, Italy from June 16 to June 19, 2018, we are pleased to inform you that the abstract entitled:

‘Comprehensive prediction of responses to chemotherapies by biochemically-inspired machine learning’

(Control No. 2018-A-2095-ESHG)

was among the best scored papers accepted for a poster presentation. Best Poster session takes place on Sunday, June 17, 2018 13:00 hrs, and consists of a 3 minute presentation followed by discussion at your electronic poster.

February 19, 2018. Article on genomic signature of radiation exposure

Manuscript describing accurate genomic signatures  of radiation exposure will be published shortly by F1000Research.

Jonathan ZL Zhao, Eliseos J Mucaki, Peter K  Rogan. Predicting Exposure to Ionizing Radiation by Biochemically-Inspired Genomic Machine Learning, F1000Research, in press.


Background: Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches.

Methods: Gene Expression Omnibus (GEO) datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998) were then ranked by Minimum Redundancy Maximum Relevance (mRMR). Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM), and validated using k-fold or traditional validation on independent datasets.

Results: The best human signatures we derived exhibit k-fold validation accuracies of up to 98% (DDB2,  PRKDC, TPP2, PTPRE, and GADD45A) when validated over 209 samples and traditional validation accuracies of up to 92% (DDB2,  CD8A,  TALDO1,  PCNA,  EIF4G2,  LCN2,  CDKN1A,  PRKCH,  ENO1,  and PPM1D) when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans). We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance within individual signatures. Several genes in the signatures we derived are present in previously proposed signatures.

Conclusions: Gene signatures for ionizing radiation exposure derived by machine learning have low error rates in externally validated, independent datasets, and exhibit high specificity and granularity for dose estimation.

February 7, 2018. Accepted presentations at EPR Biodose (Munich, June, 2018)

Ali, S, Li Y, Shirley B, Wilkins R, Flegal F, Rogan PK, Knoll JHM. Population scale biodosimetry with the Automated Dicentric Chromosome Identifier and Dose Estimator (ADCI) software system. [Platform]

Rogan PK, Zhao JZL, and Mucaki EJ. Predicting exposure to ionizing radiation by biochemically-inspired genomic machine learning.[Poster]

Li Y, Shirley B, Wilkins R, Flegal, F, Knoll JHM, Rogan PK. Optimization of image selection in Automated Dicentric Chromosome Analysis. [Poster]















May 4, 2015. Comment on PMID 23348723. Prediction of mutant mRNA splice isoforms by information theory-based exon definition.

Peter Rogan 2015 May 04 6:14 p.m.

The Logic and Formulation of Exon Definition for Splice and Splicing Regulatory Sites with Negative Information Content. PK Rogan, EJ Mucaki

Update on: Mucaki EJ, 2013 and the Automated Splice Site and Exon Definition Analysis server (ASSEDA).

In Mucaki EJ, 2013, we described a method of predicting the overall strength of an exon by calculating its total information content (Ri,total) from the sum of the Ri values of its donor and acceptor splice sites, adjusted for their gap surprisal (the self-information of the distance between the two sites). Differences between ΔRi,total values are predictive of the relative abundance of these exons in distinct processed mRNAs.

Splice sites altered by mutations that prevent stable interaction with splicesomes are said to be abolished. Information theory predicts abolition of binding below their minimum binding affinity, Ri,minimum, which is empirically derived. This value is slightly above zero bits, the theoretical minimum for binding at equilibrium (ΔG = 0; Schneider TD, 1997). Sites with Ri < 0 are not bound, forming stable interactions would be endergonic (ΔG > 0). This raises the question, when predicting the change in exon strength (ΔRi,total) due to a mutation that inactivates binding, whether mutant sites with varying degrees of negative information content are energetically distinguishable from one another.

The computation of Ri,total contains the sum of the the Ri values of component binding sites, irrespective of their initial or final strengths. Thus, a mutated site with Ri << 0 would result in greater ΔRi,total compared to a site with Ri ~ 0. To assess whether the degree of unfavorable binding should be applied to the exon definition calculation, or if values below 0 bits should be computed similarly to a binding site at equilibrium (Ri ~ 0), we reevaluated experimentally validated natural and regulatory splicing mutations in our paper with both approaches. Ri,total was calculated for 10 variants from Supplementary Table 2, both including and excluding the negative information (ie. Ri < 0 vs. Ri = 0) of inactivated splice sites. Mutation #2 of Supplementary Table 2 [ADA:g.43249658G>A] abolishes a natural donor site, from 8.8 to -9.9 bits. In applying the full decrease in strength (ΔRi,total: -18.7 bits), the natural exon strength decreases from 21.0 to 2.3 bits. When the negative information content is set to zero bits, the change is significantly smaller (21.0 -> 12.2 bits; ΔRi,total = -8.8 bits). When a weak natural splice site is abolished, the difference as expressed as ΔRi,total can be quite small (Mutation #9; -14.8 vs -3.1 bits). In the case of Mutation #38, the reduction in ΔRi,total leads to a partially discordant prediction where the abolished natural exon is weaker than the experimentally confirmed activated cryptic exon. Results for this mutation were concordant with the published version when the negative bit value of the mutated natural site was included in the calculation.

The impact of mutations in splicing regulatory (SR) factors can also be predicted on ASSEDA, where the Ri of the SR binding site is added to the R_i,total, as well as a secondary gap surprisal value for the particular SR protein. These sites can also be abolished. But when a SR protein binding site is no longer active, should the SR gap surprisal still be applied, or is the SR gap surprisal no longer applicable?

We tested mutations from Mucaki EJ, 2013 (Supplementary Table 4), which abolish the splicing enhancer SF2/ASF with and without the SR protein gap surprisal when Ri of the SR site is < 0 bits. The removal of the gap surprisal term for Mutation #2 of Supplementary Table 4 leads to a discordant prediction, where the ΔRi is less than the SR gap surprisal at that distance and therefore the ΔRi,total is positive. As experimental evidence shows an increase in skipping, it is a discordant prediction. Therefore, the gap surprisal is still applied in the computation of both initial and final Ri,total values when the SR protein of interest is abolished as the site is naturally present and therefore expected for binding. Conversely, when we apply the gap surprisal to the initial Ri,total for a splicing factor that is being created, we are essentially applying a penalty for a site that does not normally exist. Therefore, we no longer apply the SR gap surprisal value to the initial Ri,total in these cases.

The revised Ri,total values of SR binding site mutations slightly differ from those reported in Mucaki EJ, 2013 (Supplementary Table 4). This is because the gap surprisal distributions were recomputed for the following factors: SF2/ASF, SC35 and SRp40, with updated versions of these models based on CLIP Seq data (Blin K, 2015Khorshid M, 2011). This resulted in small changes to the distributions for SF2/ASF and SC35, however changes for SRp40 were significant, and now more closely resembles the other gap surprisal functions. The updated graphs of distance vs. gap surprisal are available at: While this should not significantly affect ΔRi,total values, it may affect the initial and final Ri,total values.

Oct. 1, 2017. Comment on PubMed PMID 28949076: Rules and tools to predict the splicing effects of exonic and intronic mutations. In: PubMed Commons [Internet]. Bethesda (MD): National Library of Medicine; 2017 Sep 26

Peter Rogan2017 Oct 01 8:57 p.m.

We would like to alert readers to the fact that information theory-based splicing mutation analysis has been used to analyze a wide range of variants (in/dels and SNVs) that affect splicing in introns and exons in peer reviewed studies. These tools have been used analyze mutations that alter branchpoint recognition and within introns in peer reviewed studies. The Automated Splice Site and Exon Definition Analysis server, ASSEDA (Mucaki EJ, 2013) analyzes mutations at branchpoints, within intronic sequences, at cryptic splice sites, and at splicing regulatory protein binding sites (“enhancer/silencer” sequences). We have also published the Shannon pipeline (Shirley BC, 2013), which carries out mutation analysis affecting splicing (and transcription factor binding sites; Lu R, 2017) on a genome scale. Veridical is software validates splicing mutations found with the Shannon pipeline (or any other program) with RNASeq data from the same individual (Viner C, 2014Dorman SN, 2014).

Our previous review article extensively describes the use of these tools for splicing mutation analysis by many other research groups, besides ourselves (Caminsky N, 2014).

Dec. 7, 2017. Rogan PK, Mucaki EJ. Comment on PMID 29185120: Characterization of a novel germline BRCA1 splice variant, c.5332+4delA. In: PubMed Commons [Internet]. Bethesda (MD): National Library of Medicine; 2017 Nov 28 [cited 2017 Dec 7].

Peter Rogan2017 Dec 07 5:24 p.m.

We have analyzed this mutation with the Automated Splice Site and Exon Definition Analysis server (ASSEDA). The 1 nt deletion in the splice donor of exon 20 reduces the strength of this site from 11.5 -> 4.1 bits. (100/[27.4 bits] = 0.6% binding affinity)

The information theory-based approach used in ASSEDA predicts isoform abundance and computes the fold changes in binding affinity from mutations (Mucaki EJ, 2013), which corresponds to the degree of exon skipping in this case. The reduction in splice site strength is much greater than the estimates given by the ad hoc methods used in the paper. LOH was not complete; some of the observed expression may have been derived from the contaminating normal allele. In fact, had the loss of function in splice site recognition only been 25-40% according to the paper, it could have been classified as a variant of unknown significance, or possibly as benign (as we suggested in Mucaki EJ, 2011).

Dec 12, 2017. Comment on PubMed PMID 23169495: Analysis of the effects of rare variants on splicing identifies alterations in GABAA receptor genes in autism spectrum disorder individuals.

Rogan PK, Mucaki EJ. Comment on PMID 23169495: Analysis of the effects of rare variants on splicing identifies alterations in GABAA receptor genes in autism spectrum disorder individuals. In: PubMed Commons [Internet]. Bethesda (MD): National Library of Medicine; 2012 Nov 21 [cited 2017 Dec 12].

Peter Rogan2017 Dec 12 09:53 a.m

Regarding GABRQ:c.306G>C: Whereas none of the splicing analysis programs tested predict outcomes shown in the mini-gene construct shown in Figure 2A, information theory-based exon definition analyses using ASSEDA (Mucaki EJ, 2013) was completely concordant. A novel band 116nt longer than the product expected from the wild type exon is observed. The mutation reduces the strength of the natural donor splice site of exon 3 from 9.5 -> 4.5 bits (32 fold). The pre-existing intronic cryptic site 116 nt downstream (8.6 bits) is 17 fold stronger than the mutated splice site. ASSEDA indicates that the total exon information (Ri,total) of wildtype exon is reduced (19.8 -> 14.8 bits) and the corresponding strength of the gap-surprisal adjusted cryptic exon significantly exceeds this (17.7 bits). The wildtype exon is predicted to be ~5-6 fold more abundant than the cryptic exon BEFORE mutation, and the cryptic exon is predicted to be ~8 fold more abundant AFTER mutation.

Jan. 13 and 21, 2018. Comments on PubMed PMID 29280214: Thorough in silico and in vitro cDNA analysis of 21 putative BRCA1 and BRCA2 splice variants and a complex tandem duplication in BRCA2, allowing the identification of activated cryptic splice donor sites in BRCA2 exon 11.

We have posted a comment in PubMed Commons about Baert et al. “Thorough in silico and in vitro cDNA analysis of 21 putative BRCA1 and BRCA2 splice variants and a complex tandem duplication in BRCA2, allowing the identification of activated cryptic splice donor sites in BRCA2 exon 11.” (2017) (doi: 10.1002/humu.23390). The updated comments can be found at: They have been highlighted twice by PubMed Commons as a “Top Comment”.

NB: We have exchanged views with Dr. Claes (senior author), who has inquired about our NGS pipeline for splicing mutation analysis, MutationForecaster (

Peter Rogan2018 Jan 12 2:39 p.m.edited 2 of 2 people found this helpful

Twenty one BRCA1 and BRCA2 mRNA splice site variants were analyzed by semi-quantitative RT-PCR, with commercial software that scores putative splice sites by ad hoc methods, and with bioinformatic models based on Adaboost and Random Forest, which are general machine learning approaches. The authors cited our review on interpretation of splicing mutations (Caminsky N, 2014), however the analytic approach described in that paper was not evaluated. As an update to our previous BRCA mutation study (Mucaki EJ, 2011), we carried out information theory-based splicing analysis of all potential splicing mutations listed in Supplemental Table S3. The splicing consequences of all variants were accurately predicted by information analysis. We also report results of exon definition-based mRNA splicing mutation analysis (Mucaki EJ, 2013), which infers relative abundance of wild type and mutated splice isoforms from total splicing information content of each prospective exon. Due to length limitations in PubMed Commons commenting system, detailed results for each variant are described in:

Also, during our analysis, some inconsistencies in mutation designation or interpretation were noted in the paper: (1) The complex BRCA2duplication described in this article (c.425+415_4780dup[insGATCGCAGTGA]) is sometimes referred to as “c.426-415_4780dup[insGATCGCAGTGA]” (e.g. the title of Figure 5, and Suppl. Table S3), which are not congruent mutations. The true mutation is likely the former, as the Figure 5 legend describes an mRNA splice form that includes 293nt of intron 4. If the duplication was c.426-415_4780dup[insGATCGCAGTGA], the intron inclusion would only be 205nt long. (2) We report an additional inconsistency in regards to Figure 5: The legend of Figure 5E describes a splice form where a truncated exon 11 junctions with the aforementioned 11nt insertion. However, the diagram and the electropherogram in Figure 5e shows exon 11 (ending at c.2398) sharing a junction with the beginning of exon 5. The latter is most likely the correct isoform, as an acceptor is not predicted at the junction between c.4780 and the 11nt insertion.

  • Kathleen B M Claes2018 Jan 17 10:44 a.m. 2 of 2 people found this helpful

    Dear dr Rogan, thank you very much for your constructive comments. It is very interesting to learn that your exon definition-based mRNA splicing analyses are in agreement with our cDNA analyses for all variants we studied (an overview is provided in Suppl Table S1 of our paper – not S3). I read the detailed comments on the URL you referred to. How easy can this approach be implemented in an NGS data analysis pipeline? Can you define cut-offs in this program to indicate when cDNA analysis is warranted?

    I also would like to thank you for alerting us about the typing error for the Multi-exon duplication in BRCA2 – the correct nomenclature for this duplication is indeed c.426+415_4780dup{insGATCGCAGTGA}. We corrected this in the final proofs.

    • Peter Rogan2018 Jan 21 1:20 p.m.edited 1 of 1 people found this helpful

      The results reported in Table S1 of the different bioinformatic methods were difficult for us to assess. For example, why were there no bioinformatic analyses for c.426+415_4780dup(insGATCGCAGTGA)? Our analysis includes this mutation. Model cutoffs for these bioinformatic methods are defined arbitrarily because they are based on underlying datasets with unpublished or unknown content; furthermore, the binding site models are not easily reproduced, in part because they are not actually based on binding site affinities (Rogan PK, 2013).

      The details of the methods and source data we use to derive our information weight matrices and the matrices themselves are available (Rogan PK, 2003). The information contents of splice recognition sites or exons are expressed in units of bits, which have been formally proven to be related to binding site affinity through the second law of thermodynamics (Schneider TD, 1997Rogan PK, 1998). In fact, relative entropy used by maxEntscan, violates the triangle inequality which is a fundamental requirement of the second law (Schneider TD, 1999). These articles demonstrate the cutoff for true binding sites is very close to the theoretical minimum of zero bits (Delta G = 0). We have also demonstrated this thermodynamic threshold holds for other types of binding sites (Lu R, 2017).

      Our pipeline for NGS data analysis has been validated extensively (Shirley BC, 2013Viner C, 2014Dorman SN, 2014Caminsky NG, 2016Mucaki EJ, 2016Yang XR, 2017Dos Santos ES, 2017). The URL of the MutationForecaster pipeline is given in the document linked to our previous PubMed Commons post .





December 12, 2017. New preprint predicting response to platin drugs

Mucaki et al. Predicting Response to Platin Chemotherapy Agents with Biochemically-inspired Machine Learning. bioRxiv.


Selection of effective drugs that accurately predict chemotherapy response could improve cancer outcomes. We derive optimized gene signatures for response to common platinum-based drugs, cisplatin, carboplatin, and oxaliplatin, and respectively validate each with bladder, ovarian, and colon cancer patient data. Initially, using breast cancer cell line gene expression and growth inhibition (GI50) data, we performed backwards feature selection with cross-validation to derive predictive gene sets in a supervised support vector machine (SVM) learning approach. These signatures were also verified in bladder cancer cell lines. Aside from published associations between drugs and genes, we also expanded these gene signatures using a systems biology approach. Signatures at different GI50 thresholds distinguishing sensitivity from resistance to each drug contrast the contributions of different genes at extreme vs. median thresholds. An ensemble machine learning technique combining different GI50 thresholds was used to create threshold independent gene signatures. The most accurate models for each platinum drug in cell lines consisted of cisplatin: BARD1, BCL2, BCL2L1, CDKN2C, FAAP24, FEN1, MAP3K1, MAPK13, MAPK3, NFKB1, NFKB2, SLC22A5, SLC31A2, TLR4, TWIST1; carboplatin: AKT1, EIF3K, ERCC1, GNGT1, GSR, MTHFR, NEDD4L, NLRP1, NRAS, RAF1, SGK1, TIGD1, TP53, VEGFB, VEGFC; and oxaliplatin: BRAF, FCGR2A, IGF1, MSH2, NAGK, NFE2L2, NQO1, PANK3, SLC47A1, SLCO1B1, UGT1A1. Recurrence in bladder urothelial carcinoma patients from the Cancer Genome Atlas (TCGA) treated with cisplatin after 18 months was 71% accurate (59% in disease-free patients). In carboplatin-treated ovarian cancer patients, predicted recurrence was 60.2% (61% disease-free) accurate after 4 years, while the oxaliplatin signature predicted disease-free colorectal cancer patients with 72% accuracy (54.5% for recurrence) after 1 year. The best performing cisplatin model best predicted outcome for non-smoking TCGA bladder cancer patients (100% accuracy for recurrent, 57% for disease-free; N=19), the median GI50 model (GI50 = 5.12) predicted outcome in smokers with 79% with recurrence, and 62% who were disease free; N=35). Cisplatin and carboplatin signatures were comprised of overlapping gene sets and GI50 values, which contrasted with models for oxaliplatin response.

December 1, 2017. New bespoke variant interpretation service

MutationForecaster® is now available as a custom analysis service that we provide to you on your data. We’ve listened to you, let us assume the task of performing information theory-based analysis for you. We now offer a Bespoke service that allows you to get fully documented reports based on analysis of variants that you submit to us. Please click on the Learn More link below for more information. Our tools for non-coding variant interpretation utilize an information theory-based approach only available through CytoGnomix. No other service provides the patented, molecular diagnostic information that MutationForecaster® generates.




















November 20, 2017. New publication on inherited breast and ovarian cancer accepted for publication

Collaboration with a French consortium to study non-coding variants in BRCA1 and BRCA2 in patients with a family history of breast and ovarian cancer:

Santana dos Santos, E, Caputo, S.M., Castera, L, Gendrot, M, Briaux, A., Breault, M, Krieger, S, Rogan, P.K, Mucaki, E.J., Bieche, I, Houdayer, C, Vaur, D, Stoppa-Lyonnet, D, Brown, M, Lallemand, F., Rouleau, E. Assessment of functional impact of germline BRCA1/2 variants located in noncoding regions in families with breast and or ovarian cancer predisposition, Breast Cancer Research and Treatment, in press.