Tiange Cui, PhD


Wuhan Institute of Technology, B.E. (Bioengineering)

Rutgers University, M.S. (Biology)

George Mason University, Ph.D. (Bioinformatics)

      

Extensive WGS, WES, RNASeq and Microarray data analysis

PreDist: A distance-based model as
potential diagnostic and prediction
classifiers for human diseases


PI: Dr. Ancha Baranova

Sep. 2013 - May 2018
  • Typically, gene expression biomarkers are being discovered in course of high-throughput experiments, for example, RNAseq or microarray profiling. But problems with the reproducibility of gene signatures are more common than not; when the algorithm is executed using a different training set, the resulting diagnostic signature may turn out to be completely different. Even commercial biomarker panels for the same purposes often have very small concordance overlap.

  • Many research papers have shown that the traditional way of picking informative biomarkers is problematic. Dr. Domany even showed that the ranking of differentially expressed genes doesn't matter, the rank 1-70 differentially expressed gene-set has the same prediction power than rank 211-280, or even rank 701-770 gene-set. Analytic pipelines that extract so-called gene signatures suffer from the "Dimensionality curse": the number of genes expressed exceeds the number of patients we can enroll in the study and use to train the discriminator algorithm.

  • A good analogy is the ranking of runners in 3 consecutive competitions. Only few extraordinary professional runner will always end up in top places since they are way consistent and faster than the amateur runners. Majority of the runners are just ordinary runners, their performance is simply based on their personal condition on that certain day. So it is common to see that the top 50 runners from these 3 competitions will vary significantly. At least 2300 samples are needed to achieve over 50% overlap using aforementioned Van't Veer's 70-gene classifier. So in order to make an affirmative conclusion for picking 50 best runners, you will need to let the same runners run thousands of times, which would never happen in real life.

  • One way to address this problem is to put samples in a multi-dimensional space. Each gene in the microarray or RNA-Seq panel represents for a dimension, and the sample will ends up to be a data point in this multi-dimensional space. By adding samples from an sequencing experiment to the same multi-dimensional space, they will locate in different spacial position and their distances from one another can suggest the holistic gene expression profiles. In this way, instead of picking top expressed genes, this distance-based model doesn't discriminate but rather use every gene in the panel.

  • PreDist uses absolute Pearson correlation distance to convert overall gene expression values into distance metric.

  • An example of how PreDist can separate normal vs cancer samples. mRNA-Seq breast cancer samples are from TCGA.

  • An example of how PreDist can separate samples with different treatment (senescence state). Affymetrix human foreskin BJ fibroblasts samples are from GEO.

  • An example of how PreDist can find potential outliers. Peripheral blood mononuclear cells (PBMCs) were used after kidney transplantation and after the start of immunosuppressive drugs. Two samples with an extremely high expression of miR-612, which has been shown to suppresses the invasive-metastatic cascade in hepatocellular carcinoma and suppresses the stemness of liver cancer via Wnt/β-catenin signaling, may also plays an important part in kidney function.

  • An example of how PreDist can better grouping samples. High Density Lipoprotein (HDL) Proteome was analyzed using mass spectrometry in human subjects who had undergone coronary computed tomography angiography to quantify non-calcified (NCB) atherosclerosis burden. With the help of PreDist, the discovery cohort can be further grouped with their protein expression profiles rather than NCB score alone.

Viral Genomics

Genomic analysis of a large set of
currently- and historically- important
human adenovirus pathogens.


PI: Dr. Donald Seto

Sep. 2015 - Oct. 2017
  • Several members of the Human Adenoviruses (HAdVs) family are important human pathogens.  They are agents in acute respiratory disease (ARD).  While ARD has a low mortality rate, especially in the military training population, the morbidity rate is high. As pathogens, HAdVs may effect asymptomatic or mild and severe symptomatic disease upon their infection of respiratory, ocular, gastrointestinal, and genitourinary systems. High-resolution genomic data have enhanced the understanding of HAdV epidemiology, with recombination recognized as an important and major pathway in the molecular evolution and genesis of emergent HAdV pathogens.

  • Gene annotation for three major genes in HAdV, the Penton (P), Hexon (H) and Fiber (F), was done by in-house scripts and Genome Annotation Transfer Utility (GATU; https://virology.uvic.ca/virology-ca-tools/gatu/).

  • Phylogenetic analysis of whole-genome phylogenetic trees for HAdV-A through C and HAdV-D complete genomic sequences were generated. Trees were constructed using the maximum-parsimony analysis, following alignment using the ClustalW tool in the Molecular Evolutionary Genetic Analysis version 6 software package (MEGA6; www.megasoftware.net/), with a bootstrap test of 1000 replicates and default parameters and Tree- Bisection-Reconnection (TBR) method. Each clade is highlighted according to specific HAdV species using a color code as shown. Novel sequences are shown in blue and displayed as GenBank accession numbers (acc. nos.). Bootstrap values are displayed on the branches and nodes for different HAdV species used for references are shown in red

  • Nucleotide sequence identity and diversity analysis for HAdV species (A–D) prototype sequences. “Box-and-whiskers” plots show percent nucleotide identity of prototype HAdV complete genome sequences. The “box” represents the median and interquartile range (IQR), and the “whiskers” show both minimum and maximum values. The spacing between the boxes indicates the degree of spread. Nucleotide diversity (π) plots showing the average number of nucleotide differences per site along each gene for HAdV-A through D, calculated for penton base, hexon, and fiber prototype sequences. The plot was constructed using DnaSP v5 (http://www.ub.edu/dnasp/), with a 100 nucleotide window and 25 nucleotide step size.

  • Recombination and structural modeling analysis of putative novel penton base gene in HAdV-D72. SimPlot (http://sray.med.som.jhmi.edu/SCRoftware/simplot/) analysis demonstrates the genetic distances to HAdV reference sequences across the penton base gene. Default parameter settings for the Simplot software were used for analyzing the hexon sequences: window size (200 nucleotides [nt]), step size (20 nt), replicates used (n1/4100), gap stripping (on), distance model (Kimura) and tree model (neighbor-joining). Bootscan analysis demonstrates phylogenetic relationships to the reference strains. Homology modeling of the penton base in which the HAdV-D72 amino-acid sequences (blue) is superimposed over that of HAdV-D37 (cyan). Superimposition of structural models showed significant structural variation in RGD motif location (indicated in red). Homology model was built in Swiss ExPASy (http://swissmodel.expasy.org) using the crystal structure of HAdV-C2 as the template ([PDB] code 1X9TA)

  • Maximum-parsimony phylogenetic analysis for the fiber genes identified two novel fiber genes (KF268207, HAdV-D71, and KF268355, D72, red dots) with sequences that diverged significantly with known HAdV-D type sequences.

  • Our study showed that the recombination among HAdV-Ds was much more common than finding novel capsid genes, supporting the primacy of homologous recombination in the molecular evolution of HAdV-D. Our data further support evolution of HAdV-B by homologous recombination, but less assuredly of HAdV-C. Our results taken together with recent descriptions of putatively new HAdV types causing serious diseases, highlights HAdV evolution as a persistent threat to public health.

Metagenomics

Bacterial and Archaeal abundances with
Acetate production, Biogas and CH4/CO2
composition from various coal samples.


PI: Dr. Patrick Gillevet

Sep. 2014 - Mar. 2015
  • Previous analysis has shown that coal samples collected by the company showed good potential to generate acetate, biogas and CH4/CO2. In a 82-days period, 1.8 cumulated CH4/CO2 and 2749 ppm max acetic acid were detected from the sample. Further analysis reveled the most abundant Bacterial phyla are: Firmicutes, Thermotogae, Bacteroidetes, Proteobacteria; and the most abundant Archaeon genus are: Methanosarcina and Methanobacterium. To further investigate the microbial community from West side of Australia and Victoria Australia, NGS-Based 16S rRNA Sequencing was used to identify Bacterial and Archaeal, and its functional abundances in newly collected coal samples.

  • Krona and QIIME were used to generate the organism abundance and functional abundance. MetaPhlAn was used to generate the heatmap for top25 most abundance species.







Software Development

Development of a prototype visualization
approach for next-generation sequencing
technologies using Genome Navigator framework


PI: Dr. Andrey Grigoriev

Jan. 2011 - May. 2013
  • The advent of next-generation sequencing (NGS) technologies has greatly accelerated the development of the genome analysis. Genome Navigator (GN) is a Java-based genome viewer for visualization and analysis massive NGS datasets. GN supports data stream from both local and remote sources with the user interaction and customization. The highlighted colors enable users to examine the variations, insertions, and deletions at desired genomic resolution level.

  • A flow chart demonstrates a user calling for GN. Data from different sources are sent to the web server and stored. User accesses web interface and sends an inquiry with data specifications to the web server. CGI script processes the data stream and prepares data for the applet. At last, web server generates a HTML page for the Genome Navigator display.

  • An example of user input interface.

  • Semantic zooming at different modes. Highlight colors indicate nucleotide discrepencies to the reference sequence.

  • Stacked view.

  • Condensed view

  • Dual-mode of PLOT display: Line graph and histogram.

  • Whole-chromosome coverage histogram.


  • Single-nucleotide polymorphisms(SNPs) identification.


  • Sequencing discrepancy between difference sequencing methods. A. Illumina. B. SOLiD

  • Comparative genomics. A. Neurospora crassa strain FGSC-2223. B. Neurospora crassa strain 4825.

  • Structural variants identification. Structural rearrangement such as deletions and insertions can be identified by browsing GN.

  • NOTE: Java(TM) plug-in is required for your internet browser to view the following live examples. If you have Java installed but still cannot see the display, please open the Java control panel and clear the Java cache first then try again. Step-by-step instructions can be found HERE.

    UPDATE: Oracle is migrating from Java Applets to plugin-free Java technologies, after its JDK 9 release, Java Applets is no longer supported by the major browsers, which will result the following live visualizations failing.

    Live Example 1: RNA secondary structure display.
    Click the link below to display deep sequencing reads and stem-loop sequence of dme-mir-1012.


    Screenshot:



  • Live Example 2: Illumina poly(A)+ RNA-Seq data of Drosophila melanogaster mixed embryos from modENCODE project.
    Click the link below to display the mapping of sequencing data by using two different alignment tools: BWA and Bowtie.
    Transcript FBtr0070291 is aligned with the reference sequence with all intronic regions removedd.

Developmental Genetics

A missing piece of Drosophila shell game:
The role of Dpp/Gbb heterodimers in
Drosophila patterning


PI: Dr. Nir Yakoby

Sep. 2009 - Dec. 2010
  • The development of Drosophila Oogenesis contains 14 stages which will eventually build a fly egg. The eggshell is a highly structured three-dimensional chamber that is derived from the follicular epithelium in the developing egg.

  • One of the essential signaling pathway for Drosophila Morphogenesis is the Bone Morphogenetic Protein(BMP) pathway which belongs to the highly conserved TGF-beta super-family. Three BMP ligands, Decapentaplegic (DPP), Glass bottom boat (GBB) and Screw (SCW) serve as morphogen diffuse through the developing field and establish a morphogen gradien across the tissue. Upon ligand binding to the receptors, the Type-I and Type-II receptors will form a receptor complex and activates the phosphorylation of cytosolic signaling molecules. Moreover, DPP and GBB can also form a heterodimer which will generate synergetic signaling, but the function of this heterodimer in Drosophila Oogenesis still remains largely unknown.

  • The patterning of follicle cells are determined and regulated by numerous signaling pathways, including the Epidermal Growth Factor Receptor(EGFR) and Bone Morphogenetic Protein (BMP). Two main morphogens are involved in this process: Gurken (GRK) and Decapentaplegic (DPP). In situ hybridization results have shown that DPP is anteriorly secreted from cells along the nurse cells (NCs) - folicle cells (FCs) border, and generates an anterior-posterior morphogen gradient, which will eventually captured by uniformly expressed TKV receptor. GBB is expressed uniformly along anterior folicle cells.

  • UAS/GAL4 overexpression system was applied to the fly lines with two drivers, CY2 and 55B, which will lead gbbRNAi to corresponding extent. Wild type D. melanogaster (OreR) was used as control group.

  • BMP signaling was monitored by the phosphorylated form of the intercellular signaling molecule MAD (P-MAD). In addition, an established marker of DA forming cells, BROAD (Br), which is a zinc-finger transcription factor that will later form the roof cells of Drosophila Dorsal Appendages (DA), was co-stained by Immunofluorescence. Preliminary results show that BMP signaling is more refined anteriorly and has longer duration.

  • In another Drosophila species, D. virilis, similar dynamics of BMP signaling is observed. Unlike D. melanogaster, D. virilis has four dorsal appendages. One of my lab colleague Matthew Niepielko has proposed that the expression of tkv is diverse across Drosophila species, other BMP signaling components such as, Short-gastrulation (Sog), Twisted Gastrulation (Tsg) and proteinases Tolloid (Tld) are also need to be tested to better understand the mechanism for ligands' migration.

  • Preliminary microscopy images show that eggshell morphology is changed in the gbbRNAi fly lines. Shorter dorsal appendages with a larger pedal and a smaller operculum are seen in most samples. (Further quantification methods need to be developed and used for more accurate conclusion).