Name: Katherine S.

Firstname: Pollard

Title: Parametric and nonparametric methods to identify significantly differentially expressed genes.

Institution: University of California, Berkeley

Street: 140 Earl Warren Hall #7360

City: Berkeley, CA

Zip-Code: 94720-7360

Country: USA

Phone: 510-642-3241

Fax: 510-643-5163

Email: kpollard@stat.berkeley.edu

Authors: Katherine S. Pollard and Mark J. van der Laan

Title: Parametric and nonparametric methods to identify significantly differentially expressed genes.

Abstract: New technologies are allowing researchers to monitor the expression of thousands of genes simultaneously. A typical gene expression experiment results in an observed data matrix $X$ whose columns $X_1,\ldots,X_n$ are $n$ copies of a $p$-dimensional vector of gene expression measurements. Analysis of gene expression data typically begins with identification of a subset of genes that are differentially expressed either across all samples in one population or between known sub-populations of samples. Since the number of genes $p$ is usually more than 10,000, some adjustment for multiple comparisons is necessary. Current approaches to this problem are mostly based upon the marginal test statistics (i.e.: t-statistics or modified versions of these) for every gene. The significance of these statistics is then assessed relative to a null distribution. Choosing the null distribution as close as possible to the observed distribution increases the power of the test. Most m! ethods tend to over look this fact. The Bonferoni adjustment and permutation methods, for example, ignore or break the correlation structure of the data. We propose the use of null distributions which preserve the correlation between genes. A parametric solution is to use the quantiles of a multivariate $N(0,\hat{\rho})$. A non-parametric approach is to use quantiles estimated from bootstrap samples of the mean-centered empirical distribution. We also extend this methodology to the setting in which the gene-expression data is accompanied by an outcome variable, such as survival, so that the goal is to select genes with a significant association with the outcome. Again, the use of null distributions which preserve the correlation structure of the data improve the power of the test.