----------------------------------------------------------------------- BIOINFORMATICS COLLOQUIUM College of Science George Mason University ----------------------------------------------------------------------- Determining Differentially Expressed Genes: Is Using Statistics From Each Probe of an Affymetrix Array a Good Approach? Alan Berger Naval Surface Warfare Center - Dahlgren Abstract: In many situations one wishes to determine which genes are differentially expressed across two (or more) experimental conditions, such as between a disease state and normal, or between tumor subtypes, or between cultured cells stimulated in specific ways and corresponding controls. Microarray technology enables the simultaneous measurement of the expression levels of thousands of genes, but effective utilization of the resulting data requires determination of as many as possible of the truly differentially expressed genes while limiting to a tolerable number false positive declarations of differential expression. For example, if for each gene one did a t-test with a threshold of 0.05 and the chip being used reported on 20,000 genes, then one would expect, on average, 1,000 false positive declarations of differential expression. It is thus essential to carefully address the determination of differential expression in order to obtain information that will be of practical use in elucidating the biology of the system under study. There has been a considerable amount of work addressing this problem, involving a wide variety of statistical techniques and the incorporation of additional criteria such as requiring the average (geometric mean) of the ratio of the expression levels between a pair of conditions under study being sufficiently different from 1, and requiring the average expression level being sufficiently above background in order for a gene to be declared differentially expressed. Faced with the limitations of (very often) only a few samples (here gene expression measurements from microarray chips) while requiring thousands of "calls" (decisions whether or not there is differential expression for each of the genes on the chip), it is common to attempt to control the false discovery rate (FDR), which is the number of genes incorrectly declared to be differentially expressed divided by the total number of genes declared to be differentially expressed. Depending on the uses to be made of the "list" of genes declared to be differentially expressed, an FDR of 0.1 or 0.2 might well be acceptable, if one has been able to capture (have included in the list) a "sufficient number" of the truly differentially expressed genes. In this seminar we will explore the reasons for the striking success demonstrated by the Logit-t method of Lemon et al. when applied to several data sets including the Affymetrix HG_U95Av2 Latin Square data set (where a small number of genes have had controlled variable amounts of RNA "spiked in" so the truly differentially expressed genes are known). . The Logit-t method is a member of the family of methods that, for Affymetrix GeneChip data, make direct use of the probe level data for each gene (each gene is represented by a "probe set" consisting of on the order of 11 distinct probes on the chip), rather than only utilizing a single summary expression level for each gene (i.e., probe set) on the chip. Implementing the Logit-t method, and several variants, on the Affymetrix HG_U133A Latin Square data set yields similar striking levels of effectiveness. The results indicate that the process of taking the median of the t-scores from the individual probes in each probe set is what primarily accounts for the effectiveness of the Logit-t method, rather than their particular transformation of the data, probably because using the median robustly handles outliers as well as the occurrence of individual probes that cross hybridize with RNA from other genes.