ProgramThe conference will be held from July 9 to July 11. On July 8 several preconference courses are offered. Accepted Talks:
Across and Down in Large SNP Studies: the MAX test of Freidlin and Zheng vs SAS PROC CASECONTROL
Dana Aeschliman; MariePierre Dube Statistical genetics research group, Montreal Heart Institute 
SAS PROC CASECONTROL offers the user 3 statistical tests for assessing the association of a SNP and a binary phenotype: the allele, genotype and trend tests. Three important models of genotypephenotype association are the recessive, additive and dominant genetic models. In a large SNP study, one is faced with both "across" and "down" aspects of the multiple testing problem. The MAX test of Freidlin et al. (Freidlin et al., 2002, Zheng and Gastwirth 2006) builds on the ideas of Armitage (1955), Sasieni (1997), and Slager and Schaid (2001) and offers a way of testing for recessive, additive, and dominant models while producing one Pvalue for each SNP. In this report, we compare the power of the MAX test to each of the 3 tests in SAS PROC CASECONTROL. We show that the MAX test compares very favorably. We developed a program in R to simulate genetic data sets of varying complexity. We provide two SAS MACROs that use only BASE SAS. One encodes the MAX test. The second MACRO acts as a wrapper for the first and encodes a stepdown resampling algorithm, Westfall and Young's (1993) Algorithm 2.8, resulting in pvalues which are corrected for the correlation between test statistics. We comment on the notion of subset pivotality as applied to this situation and discuss the treatment of missing values.
References: Zheng, G. and Gastwirth, J. (2006) On estimation of the variance in Cochran_Armitage trend tests for genetic association using casecontrol studies. Statistics in Medicine; 25(18): 31503159. Freidlin, B et al. (2002) Trend Tests for CaseControl Studies of Genetic Markers: Power, Sample Size and Robustness. Human Heredity; 53, 3. Slager, S.L. and Schaid, D.J. (2001) CaseControl Studies of Genetic Markers: Power and Sample Size Approximations for Armitage’s Test for Trend. Human Heredity; 52, 3. Sasieni, P.D. (1997) From genotypes to genes: Doubling the sample size. Biometrics; 53: 12531261. Armitage, P. (1955) Tests for linear trends in proportions and frequencies. Biometrics; 11: 12531261. Westfall, P.H. and Young, S.S. (1993) ResamplingBased Multiple Testing. John Wiley and Sons, Inc.

Multiple Testing Procedures for Hierarchically Related Hypotheses
Przemyslaw Biecek Institute of Mathematics and Computer Science, Wroclaw University of Technology 
In some genomic studies, the considered hypotheses are in a hierarchical relation. For example in Gene Set Functional Enrichment Analysis (GSFEA), we are confronted with a problem of testing thousands of hypotheses which correspond to different biological terms. Since the biological terms are hierarchically related, the corresponding hypotheses are also related. If biological term f(i) is more specific than biological term f(j), then the rejection of hypothesis H_0(i) associated with the term f(i) implies the rejection of hypothesis H_0(j) associated with the term f(j) (the relationship between biological attributes is defined by the Gene Ontology Biological Process (GOBP) hierarchical taxonomy [1]).
In this case, in addition to correction for number of hypotheses, we want to guarantee that testing outcomes are coherent with the relation among biological functions. Popular multiple testing procedures (eg. stepup, stepdown or single step) do not guarantee the coherency. Moreover, methods designed for testing under hierarchical relation (see [2]) do not provide a control of FDR and also cannot be easily applied in the context of GSFEA.
We propose a novel approach which incorporates knowledge about the relation among hypotheses. We consider an issue of testing a set of null hypotheses with a given hierarchical relation among them. The relation, represented by a directed acyclic graph (DAG), determines all possible outcomes of testing. It also leads to the two natural testing procedures (the follow up and the follow down) presented in this paper. For these procedures, we derive formulas for significance levels that provide a strong control of the three most popular error rates (FWER, PFER and FDR). We also present a simulation study for the proposed testing procedures, discuss their strengths and weaknesses and point out some applications.
REFERENCES
[1] Harris, M. A., et al. (2004)
,,The Gene Ontology (GO) database and informatics resource.''
Nucleic Acids Res. 32(Database issue): D258–D261. doi: 10.1093/nar/gkh036
[2] Finner, H., Strassburger, K. (2002)
,,The partitioning principle: A powerful tool in multiple decision theory''.
The Annals of Statistics, Vol. 30, No. 4, 1194–1213

Multitreatment optimal responseadaptive designs for continuous responses
Atanu Biswas; Saumen Mandal Indian Statistical Institute, Kolkata 
Optimal responseadaptive designs in phase III clinical trial set up are becoming more and more current interest. Most of the available designs are not from any optimal consideration. An optimal design for binary responses is given by Rosenberger et al. (2001) and an optimal design for continuous responses is provided by Biswas and Mandal (2004). Recently, Zhang and Rosenberger (2006) provided another design for normal responses. The present paper deals with some shortcomings of the earlier works and then extends the present approach for more than two treatments. The proposed methods are illustrated using some real data.

A Procedure to Multiple Comparisons of Diagnostic Systems
Ana Cristina Braga; Lino A. Costa e Pedro N. Oliveira University of Minho 
In this work, a method for the comparison of two diagnostic systems based on ROC curves is presented. ROC curves analysis is often used as a statistical tool for the evaluation of diagnostic systems. For a given test, the compromise between the False Positive Rate (FPR) and True Positive Rate (TPR) can be graphically presented through a ROC curve. However, in general, the comparison of ROC curves is not straightforward, in particular, when they cross each other. A similar difficulty is also observed in the multiobjective optimization field where sets of solutions defining fronts must be compared in a multidimensional space. Thus, the proposed methodology is based on a procedure used to compare the performance of distinct multiobjective optimization algorithms. Traditionally, methods based on the area under the ROC curves are not sensitive to the existence of crossing points between the curves. The new approach can deal with this situation and also allows the comparison of partial portions of ROC curves according to particular values of sensitivity and specificity, of practical interest. For illustration purposes, real data from Portuguese hospital was considered.

A general principle for shortening closed test procedures with applications
Werner Brannath; Frank Bretz Medical University of Vienna 
The closure principle is a general, simple and powerful method for constructing multiple test procedures controlling the family wise error rate in the strong sense. In spite of its generality and simplicity, the closure principle has the disadvantage that the number of individual tests required for its completion increases exponentially with number of null hypotheses of primary interest. Hence, multiple test procedures based on the closure principle can require large computational efforts and may become infeasible for a larg number of hypotheses and/or for computational intensive hypotheses tests, such as permutation or bootstrap tests.
Shortcut procedures have been proposed in the past, which substantially reduce the number of operations. In this presentation we provide a general principle for shortening closed tests. This principle provides a unified approach that covers many known shortcut procedures from the literature. As one application among others we derive a shortcut procedure for flexible two stage closed tests for which no shortcuts have been available yet.

Powerful shortcuts for gatekeeping procedures
Frank Bretz; Gerhard Hommel, Willi Maurer Novartis Pharma AG 
We present a general testing principle for a class of multiple testing problems based on weighted hypotheses. Under moderate conditions, this principle leads to powerful consonant multiple testing procedures. Furthermore, shortcut versions can be derived, which simplify substantially the implementation and interpretation of the related test procedures. It is shown that many wellknown multiple test procedures turn out to be special cases of this general principle. Important examples include gatekeeping procedures, which are often applied in clinical trials when primary and secondary objectives are investigated, and multiple test procedures based on hypotheses which are completely ordered by importance. We illustrate the methodology with two real clinical studies.

Adjusting pvalues of a stepwise generalized linear model
Chiara Brombin; Finos L., Salmaso L. University of Padova 
Stepwise variable selection methods are frequently used to determine the predictors of an outcome in generalized linear model (glm). Despite its widespread use, it is well know that the tests on the explained deviance of the selected model are biased. This arise from the fact that the ordinary test statistics upon which these methods are based were intended for testing prespecified hypotheses; whereas the tested model is selected through a datasteered procedure. In this work we define and discuss a simple nonparametric procedure which corrects the pvalue of the selected model of any stepwise selection method for glm. We also prove that this procedure fall in the class of weighted nonparametric combining functions defined by Pesarin [1] and extended in Finos and Salmaso [2]. The unbiasedness and consistency of the method is also proved. A simulation study also shows the validity of this procedure. Theorical differences with previous works in the same filed (Grachanovsky and Pinsker, [3]; Harshman and Lundy, [4]) are also provided. Free codes for R and Matlab are available and an application on a real dataset is presented.
[1] Pesarin, F. (2001). Multivariate Permutation tests: with application in Biostatistics. John Wiley & Sons, ChichesterNew York.
[2] L. Finos, L. Salmaso (2006). Weighted methods controlling the multiplicity when the number of variables is much higher than the number of observations. Journal of Nonparametric Statistics 18, 2, 245–261.
[3] E. Grachanovsky, I. Pinsker(1995). Conditional pvalues for the Fstatistic in a forward selection procedure. Computational Statistics & Data Analysis 20, 239263.
[4] R. A. Harshman, M. E. Lundy (2006). A randomization method of obtaining valid pvalues for model changes selected “post hoc”. http://publish.uwo.ca/~harshman/imps2006.pdf

Multiple Testing Procedures with Incomplete Data for Rankbased Tests of Ordered Alternatives.
Paul Cabilio; Jianan Peng Acadia University 
Page (1963) and Jonckheere (1954) introduced tests for ordered alternatives in blocked experiments. Specifically, in the model with n blocks and t treatments, it is wished to test the hypothesis of no treatment effect against a specified ordered treatment effect with at least one inequality strict. Page proposed a statistic which can be expressed as the sum of Spearman correlations between each block and the criterion ranking chosen to be (1,2,...,t), while Jonckheere proposed a statistic which is based on Kendall's tau correlation. These tests were extended in Alvo and Cabilio (1995) to the situation where only k(i) treatment responses are observed in block i. For such incomplete blocks, the resulting extended Page statistic L* differs from the one in the complete case in that the complete rank of a response in each block is replaced by a weight times a score which is either the incomplete rank of the response or the average rank (k(i)+1)/2, depending on whether or not the treatment is ranked in that block. If the null hypothesis is rejected, it is of interest to construct test procedures to identify which inequalities in the alternative are strict, and in so doing maintain the experimentwise error rate at a preassigned level. Our approach in this case is to modify one or more procedures that have been developed for detecting ordered means in the context of ANOVA (Nashimoto and Wright 2005.) The form of the extended Page statistic makes it possible to apply a general stepdown testing procedure for multiple comparisons such as that proposed in Marcus, Peritz, and Gabriel (1976) for normal based tests. Specifically, we define a partition of the integers 1 to t into h sets of consecutive integers. For each set of integers in the partition we define an extended Page test statistic to test the subalternative hypothesis of ordered effects of treatments indexed by such integers. The intersection of such hypotheses over the partition can then be tested by the sum of such statistics. The procedure is to test all such hypotheses over all possible partitions. This approach may also be used for the extended Jonckheere statistic.

A leavepout based estimation of the proportion of null hypotheses in multiple testing problems
Alain Celisse UMR 518 AgroParisTech / INRA MIA 
A large part of the literature have been devoted to multiple testing problems since the introduction of the False Discovery Rate (FDR) by Benjamini and Hochberg (1995). In this seminal paper, authors provide a procedure that enables control of the FDR at a prespecified level. However an improvement of the method in terms of power is possible thanks to the introduction of an estimate of the unknown proportion of true null hypotheses: pi0. We propose an estimator of this proportion that relies on both density estimation by means of irregular histograms and exact leavepout crossvalidation.
We estimate first the density of pvalues from a collection of irregular histograms among which we select the best estimator in terms of minimization of the quadratic risk. The estimate of pi0 is deduced as the height of the largest column of the selected histogram. An estimator of the risk is obtained by use of leavepout crossvalidation. We present a closed formula for this risk estimator and an automatic choice of the parameter p in the leavepout. It consists in minimizing the mean square error (MSE) of the leavepout risk estimator.
Besides, recent papers have pointed out that the use of twosided statistics in onesided tests entails pvalues corresponding to false positives near to 1. Whereas most of the existing estimators do not take this phenomenon into account, leading to systematic overestimation, our estimator of the proportion remains accurate in such situations.
Eventually, we compare our procedure with existing ones in simulations, showing as well how problematic false positives near 1 may be. The proposed estimator seems more accurate in terms of variability for instance. Better FDR estimations are obtained.

Multiple Testing in ChangePoint Problem with Application to Safety Signal Detection
Jie Chen Merck Research Laboratories 
Detection of a change point usually requires testing multiple null hypotheses. In this talk we focus on the inference of a change in the ratio of two timeordered Poisson stochastic processes, by developing multiple testing procedures which offer the control of some error rates. Possible extensions of the procedures to multiple changepoints are explored. The procedures are illustrated using a real data example for drug safety signal detection and a simulation study.

On the Probability of Correct Selection for Large k Populations, with Application to Microarray Data
Xinping Cui; Jason Wilson University of California, Riverside 
One frontier of modern statistical research is the “multiple comparison problem” (MCP) arising from data sets with large k (>1000) populations, e.g. microarrays and neuroimaging data. In this talk we demonstrate an alternative to hypothesis testing. It is an extension of the Probability of Correct Selection (PCS) concept. The idea is to select the top t out of k populations and estimate the probability that the selection is correct, according to specified selection criteria. We propose “dbest” and “Gbest” selection criteria that are suitable for large k problems and illustrate the application of the proposed method on two microarray data sets. Results show that our method is a powerful method for the purpose of selecting the “top t best” out of k populations.

A semiparametric approach for mixture models: Application to local FDR estimation
JeanJacques Daudin; A. BarHen, L. Pierre, S. Robin AgroParisTech / INRA 
In the context of multiple testing, the estimation of false discovery rate (FDR) or local FDR can be stated in the mixture model context. We propose a procedure to estimate a twocomponents mixture model where one component is known. The unknown part is estimated with a weighted kernel function, which weights are defined in an adaptative way. We prove the convergence and unicity of our estimation procedure. We use this procedure to estimate the posterior population probabilities and
the local FDR.
Key words: FDR, Mixture model, Multiple testing procedure, Semiparametric density estimation.

Asymptotic improvements of the BenjaminiHochberg method for FDR control based on an asymptotically optimal rejection curve
Thorsten Dickhaus; Helmut Finner, Markus Roters German Diabetes Center, LeibnizInstitute at the Heinrich–HeineUniversity Düsse 
Due to current applications with a large number $n$ of hypotheses, asymptotic control ($n \to \infty$) of the false discovery rate (FDR) has become a major topic in the field of multiple comparisons. In general, the original linear stepup (LSU) procedure proposed in Benjamini & Hochberg (1995) does not exhaust the prespecified FDR level, which gives hope for improvements with respect to power.
Based on some heuristic considerations, we present a new rejection curve and implement this curve into several stepwise multiple test procedures for asymptotic FDR control. It will be shown that the new tests asymptotically exhaust the full FDR level under some extreme parameter configurations. This optimality leads to an asymptotic gain of power in comparison with the LSU procedure.
For the finite case, we discuss adjustments both of the curve and
of the procedures in order to provide strict FDR control.
References:
Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 289300.
Benjamini, Y. & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 4, 11651188.
Finner, H., Dickhaus, T. & Roters, M. (2007). On the false discovery rate and an asymptotically optimal rejection curve. Submitted for publication.
Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Stat. 30, 1, 239257.

Comparison of Methods for Estimating Relative Potencies in Multiple Bioassay Problems
Gemechis Dilba Institute of Biostatistics, Leibniz University of Hannover, Germany 
Relative potency estimations in both multiple parallelline and
sloperatio assays involve construction of simultaneous confidence intervals for ratios of linear combinations of general linear model parameters. The key problem here is that of determining multiplicity adjusted percentage point of a multivariate tdistribution the correlation matrix R of which depends on the unknown ratio parameters. Several methods have been proposed in the literature on how to deal with R. Among others, conservative methods based on probability inequalities (e.g., Boole's and Sidak inequalities) and a method based on an estimate of R are used. In this talk, we explore and compare the various methods (including the delta approach) in a more comprehensive manner with respect to their simultaneous coverage probabilities via Monte Carlo simulations. The methods will also be evaluated in terms of confidence interval width through application to data on multiple parallelline assay.

Adaptive modelbased designs in clinical drug development
Vlad Dragalin Wyeth Research 
The objective of a clinical trial may be either to target the maximum tolerated dose or minimum effective dose, or to find the therapeutic range, or to determine the optimal safe dose to be recommended for confirmation, or to confirm efficacy over control in a Phase III clinical trial. This clinical goal is usually determined by the clinicians from the pharmaceutical industry, practicing physicians, key opinion leaders in the field, and the regulatory agency. Once agreement has been reached on the objective, it is the statistician's responsibility to provide the appropriate design and statistical inferential structure required to achieve that goal. There is a plenty of available designs on statistician's shelf. The greatest challenge is their implementation. We exemplify this in three case studies.

Some insights into FDR and kFWER in terms of average power and overall rejection rate
Meng Du Department of Statistics, University of Toronto 
This paper provides some insights into the false discovery rate (FDR) and the kfamilywise error rate (kFWER), through comparing, in terms of the average power, an FDR controlling procedure by Benjamini and Hochberg (1995) and a kFWER controlling procedure by Lehmann and Romano (2005). A further look at the overall rejection rate, the probability of obtaining at least one single discovery, explains the behavior patterns of the average powers of these two procedures that control different types of error rates.
Keywords: average power, false discovery rate, kfamilywise error rate, largescale multiple testing, overall rejection rate.

Sequentially rejective test procedures for partially ordered sets of hypotheses
David Edwards; Jesper Madsen Novo Nordisk 
A popular method to control multiplicity in confirmatory clinical trials is to use a hierarchical (sequentially rejective) test procedure, based on an apriori ordering of the hypotheses. The talk describes a simple generalization of this approach in which the hypotheses are partially ordered. It is convenient to display the partial ordering as a directed acylic graph (DAG). To obtain strong FWE control, certain intersection hypotheses must be inserted into the DAG. The resulting DAG is called partially closed. The purpose of the approach is to enable the construction of inference strategies for confirmatory clinical trials that more closely reflect the trial objectives.

Homogeneity of stages in adaptive designs
Andreas Faldum IMBEI  Universitätsklinikum Mainz 
Adaptive designs result in great flexibility in clinical trials and guarantee full control of type I error. Despite increasing interest, such designs are only hesitantly implemented in pharmaceutical trials. One possible reason is concern of the regulatory authorities. In a reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan [EMEA 06], the European Medicines Agency (EMEA) requests methods to assure comparable results of interim and end analysis. The authors point out that it might be difficult to interpret the conclusions from a trial if it is suspected that the observed discrepancies of stages are a consequence of dissemination of the interim results. EMEA states that the simple rejection of the global null hypothesis across all stages is not sufficient to establish a convincing treatment effect. In order to avoid jeopardizing the success of a trial by differing results of the stages, the probability of such discrepancies should be taken into account when planning a trial.
In this talk we concentrate on twostage adaptive designs. Boundaries for discrepant effect estimates of stages are given dependent on the p value of the first stage and the adaptive design selected. By choosing an appropriate adaptive design a rejection of the null hypothesis despite a relevantly reduced effect estimate in the second stage can be prevented. On the other hand, rejection of the null hypothesis with treatment effect estimates increasing relevantly over stages cannot reasonably be avoided. However, the probability of rejecting the null hypothesis with homogeneous effect estimates of both stages can be predetermined. The results can help to find an adaptive design, which prevents a relevant decrease of effect estimates in case of a significant trial success and reduces the probability of a random relevant increase in the effect estimate. The underlying analyses can be used as a basis for discussion with the regulatory authorities. The considerations proposed here will be clarified by examples.
EMEA (2006). Reflection Paper on Methodological Issues in Confirmatory Clinical Trials with Flexible Design and Analysis Plan. CHMP/EWP/2459/02, end of consultation Sept 2006, http://www.emea.eu.int/pdfs/human/ewp/245902en.pdf.

FDRcontrol: Assumptions, a unifying proof, least favorables configurations and FDRbounds
Helmut Finner; Thorsten Dickhaus, Markus Roters German Diabetes Center, LeibnizInstitute at the Heinrich–HeineUniversity Düsse 
We consider multiple test procedures in terms of pvalues based on a fixed rejection curve or a critical value function and study their FDR behavior.
First, we introduce a series of assumptions concerning the underlying distributions and the structure of possible multiple test procedures.
Then we give a short and unifying proof of FDR control for procedures (stepup, stepdown, stepupdown)based on Simes'critical values for independent pvalues and for a special class of dependent pvalues considered in Benjamini and Yekutieli (2001), Sarkar (2002) and Finner, Dickhaus and Roters (2007).
Moreover, we derive upper bounds for the FDR for non stepup procedures which can be calculated with respect to Diracuniform configurations.
Finally, it will be shown that Diracuniform configurations are asymptotically least favorable for certain stepupdown procedures when the number of hypotheses tends to infinity.
References
Benjamini, Y. and Yekutieli, D. (2001).
The control of the false discovery rate in multiple testing under dependency.
The Annals of Statistics 29, 11651188.
Finner, H., Dickhaus, T. and Roters, M. (2007).
Dependency and false discovery rate: Asymptotics.
The Annals of Statistics, to appear.
Finner, H., Dickhaus, T. and Roters, M. (2007).
On the false discovery rate and an asymptotically optimal rejection curve.
Submitted for publication.
Sarkar, S. K. (2002)
Some results on false discovery rate in stepwise multiple testing procedures.
The Annals of Statistics 30, 239257.

Nonnegative matrix factorization and sequential testing
Paul Fogel; S. Stanley Young, NISS (possibly speaker) Consultant, Paris 
The “omic” sciences, transcriptomics, proteomics, metabalomics, all have data sets with n much lower than p leading to serious multiple testing problems. On the other hand, the coordination of biological action implies that there will be important correlation structures in these data sets. There is a need to take advantage of these correlations in any statistical analysis. We use nonnegative matrix factorization to organize the predictors into sets. We alpha allocate over the sets and then test sequentially within each set. The within set testing is sequential so there is no need for multiple testing adjustment. We use simulations to demonstrate the increased power of our methods. We demonstrate our methods with a real data set using a SAS JMP script.

Exploring changes in treatment effects across design stages in adaptive trials
Tim Friede; Robin Henderson University of Warwick, Warwick Medical School 
The recently published draft of a CHMP reflection paper on flexible designs highlights a controversial issue regarding the interpretation of adaptive trials when the treatment effect estimates differ across design stages (CHMP, 2006). In Section 4.2.1 it states “… the applicant must preplan methods to ensure that results from different stages of the trial can be justifiably combined. In this respect, studies with adaptive designs need at least the same careful investigation of heterogeneity and justification to combine the results of different stages as is usually required for the combination of individual trials in a metaanalysis.” This suggests that a test for heterogeneity should be preplanned and in the event of a significant result the policy should be to discard observations subsequent to the interim analysis that induced changes in the treatment. In this presentation we investigate the error rates of this procedure. Furthermore, we present an alternative testing strategy which is based on change point methods to detect calendar time effects (Friede and Henderson, 2003; Friede et al., 2006). In a simulation study we demonstrate that our procedure performs favourably compared to the procedure suggested by the guideline. Tools that help to explore changes in treatment effects will be discussed.
References
Committee for Medicinal Products for Human Use (2006) Reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan. London, 23 March 2006, Doc. Ref. CHMP/EWP/2459/02.
Friede T, Henderson R (2003) Intervention effects in observational studies with an application in total hip replacements. Statistics in Medicine 22: 37253737.
Friede T, Henderson R, Kao CF (2006) A note on testing for intervention effects on binary responses. Methods of Information in Medicine 45: 435440.

On estimates of Rvalues in selection problems
Andreas Futschik University of Vienna 
In the context of selection, quantities analogous to pvalues
(called Rvalues) have been introduced by J. Hsu (1984). They may
be interpreted as a measure of evidence for rejecting (i.e.
not selecting) a population. As in multiple hypothesis testing
when pvalues are corrected for multiplicity,
these Rvalues can be quite conservative in high dimensional settings
unless the parameters are close to the least favorable configuration.
We propose estimates of Rvalues that are less conservative
and investigate their behavior. They also lead to selection rules for high dimensional problems.

False discovery proportion control under dependence
Yongchao Ge Mount Sinai School of Medicine, New York 
In datasets involving the problem of multiple testing, we are
interested to have statistical inferences of a) the total number $m_1$ of false null hypotheses, and b) the random variable false discovery proportion (FDP): the ratio of the total number of false positives to the total number of positives. The expectation of the FDP is the false discovery rate defined by Benjamini and Hochberg 1995. We describe a general algorithm to construct an upper prediction band for the FDPs and a lower confidence bound for $m_1$ simultaneously. This algorithm has three features: i)
resampling to incorporate the dependence information among the test statistics to improve power, ii) an appropriate normalization of the order test statistics or the numbers of false positives, and iii) carefully chosen rejection regions. Two interesting choices for normalizations are: standard normalization and quantile normalization. The former choice generalizes the maxZ procedure (Ge et al 05, Meinshausen and Rice 06) from independent to dependent data; and the latter improves the work by Meinshausen 06. The properties of these two choices of normalizations combined with other normalizations are compared with simulated data and microarray data.

ResamplingBased Empirical Bayes Multiple Testing Procedure for Controlling the False Discovery Rate with Applications to Genomics
Houston Gilbert; Sandrine Dudoit, Mark J. van der Laan University of California, Berkeley 
We propose resamplingbased empirical Bayes multiple testing procedures (MTP) for controlling a broad class of Type I error rates, defined as tail probabilities and expected values for arbitrary functions of the numbers of false positives and true positives [3, 4]. Such error rates include, in particular, the popular false discovery rate (FDR), defined as the expected proportion of Type I errors among the rejected hypotheses. The approach involves specifying the following: (i) a joint null
distribution (or estimator thereof) for vectors of null test statistics; (ii) a distribution for random guessed sets of true null hypotheses. A working model for generating pairs of random variables from distributions (i) and (ii) is a common marginal nonparametric mixture distribution for the test statistics. By randomly sampling null test statistics and guessed sets of true null hypotheses, one obtains a distribution for a guessed specific function of the numbers of false positives and true
positives, for any given vector of cutoffs for the test statistics. Cutoffs can then be chosen to control tail probabilities and expected values of this distribution at a usersupplied level.
We wish to stress the generality of the proposed resamplingbased empirical Bayes approach: (i) it controls tail probability and expected value error rates for a broad class of functions of the
numbers of false positives and true positives; (ii) unlike most MTPs controlling the proportion of false positives, it is based on a test statistics joint null distribution and provides Type I error control in testing problems involving general data generating distributions with arbitrary dependence structures among variables; (iii) it can be applied to any distribution pair for the null test statistics and guessed sets of true null hypotheses, i.e., the common marginal nonparametric mixture model is only one among many reasonable working models that does not assume independence of the test statistics.
Simulation study results indicate that resamplingbased empirical Bayes MTPs compare favorably in terms of both Type I error control and power to competing FDRcontrolling procedures, such as those of Benjamini and Hochberg (1995) [1] and Storey (2002) [5]. The proposed MTPs are also applied to DNA microarraybased genetic mapping and gene expression studies in Saccharomyces cerevisiae [2].
1.) Y. Benjamini and Y. Hochberg. Contolling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 1995.
2.) R.B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci., 2005.
3.) S. Dudoit and M.J. van der Laan. Multiple Testing Procedures and Applications to Genomics. Springer, 2007. (In preparation).
4.) S. Dudoit, H.N. Gilbert and M.J. van der Laan. Resamplingbased empirical Bayes multiple testing procedure for controlling the false discovery rate. Technical report, Division of Biostatistics, University of California, Berkeley, 2007. (In preparation).
5.) J.D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 2002.

Comparing treatment combinations with the corresponding monotherapies in clinical trials
Ekkehard Glimm; Norbert Benda Novartis Pharma AG 
The intention of many clinical trials is to show superiority of a treatment over two others. E.g.\ a combination therapy may be compared to the corresponding monotherapies. In such a trial two drugs are administered simultaneously. A beneficial effect might arise from a synergistic effect of the monotherapies. Even in presence of an antagonistic effect, however, a simple superiority of the combination drug might be sufficient, e.g.\ as a way to overcome dose limitations of the monotherapies.
The standard confirmatory statistical test consists of two tests at level $\alpha$ and rejection if both of them are significant. This approach was called min test by Laska and Meissner (1989) who showed that it is uniformly most powerful in a certain class of monotone tests. However, while it exhausts the $\alpha$level if the difference between monotherapy effects approaches infinity, it is very conservative in the practically more relevant situation of similar monotherapy effects. Sarkar et al.\ (1995) have shown that it is possible to construct tests that are uniformly more powerful than this approach, if the notion of monotonicity is abandoned.
In this talk, we will present alternatives to the tests suggested by Sarkar et al., some of which are also uniformly more powerful than the min test, and others which simply have a different power profile (e.g.\ are advantageous for small or large effect differences).
Simulations and asymptotic considerations will be used to investigate where and how much power is gained depending on the constellation of the therapeutic effects. Finally, the concept of monotonicity and its practical implications will be discussed.

Exact calculations of expected power for the BenjaminiHochberg procedure
Deborah Glueck; Anis KarimpourFard, Lawrence Hunter, Jan Mandel and Keith E. Muller University of Colorado at Denver and Health Sciences Center 
We give exact analytic expressions for the expected power of the Benjamini and Hochberg procedure. We derive bounds for multiple dimensional rejection regions. We make assumptions about the number of hypotheses being tested, which null hypotheses are true, which are false, and the distributions of the test statistics under each null and alternative. This enables us to find the joint cumulative distribution function of the order statistics of the pvalues, both under the null, and under the alternative. We thus have order statistics that arise from two sets of realvalued independent, but not necessarily identically distributed random variables. We show that the probability of each rejection region can be expressed as the probability that arbitrary subsets of order statistics fall in disjoint, ordered intervals, and that of the smallest statistics, a certain number come from one set. Finally, we express the joint probability distribution of the number of rejections and the number of false rejections by summing the appropriate probabilities over the rejection regions. The expected power is a simple function of this probability distribution. We give an example power analysis for a multiple comparisons problem in mammography.

Familywise error on the directed acyclic graph of Gene Ontology
Jelle Goeman; Ulrich Mansmann Leiden University Medical Center 
Methods that test for differential expression of gene groups such as provided by the Gene Ontology database are becoming increasingly popular in the analysis of gene expression data. However, so far methods could not make use of the graph structure of Gene Ontology when adjusting for multiple testing.
We propose a multiple testing method, called the focus level procedure, that preserves the graph structure of Gene Ontology (GO) when testing for association of the expression profiles of GO terms with a response variable. The procedure is constructed as a combination of a Closed Testing procedure with Holm's method. It allows a user to choose a ``focus level'' in the GO graph, which reflects the level of specificity of terms in which the user is most interested. This choice also determines the level in the GO graph at which the procedure has most power. The procedure strongly keeps the familywise error rate without any additional assumptions on the joint distribution of the test statistics used. We also present an algorithm to calculate multiplicityadjusted pvalues. Because the focus level procedure preserves the structure of the GO graph, it does not generally preserve the ordering of the raw pvalues in the adjusted pvalues.

Twostage designs for proteomic and gene expression studies applying methods differing in costs
Alexandra Goll; Bauer Peter Section of Medical Statistics  Medical University of Vienna, Austria 
In gene expression and proteomic studies we generally deal with large numbers of hypotheses, where only for a small fraction of the hypotheses noticeable effects exist. Due to limited resources, the number of observations per hypotheses in a conventional singlestage design is low which limits the power. It has been shown that twostage pilot and integrated designs are a good option to improve the power. In these sequential designs, the first stage is used to screen for the promising hypotheses, which are further investigated in the second stage. In the following we more thoroughly investigate this type of twostage designs where the costs per measurement and effect sizes differ between the first and second stage. To compare different designs we assume that the total costs of the experiment are fixed. Both integrated and pilot designs are based on procedures either controlling the family wise type 1 error rate (FWE) or the false discovery rate (FDR). Two scenarios are considered: In the first scenario the experimenter from the beginning may have the choice between two methods that differ in costs and effect sizes (a lowcost standard method or a highcost improved method). In the second scenario different costs per measurement may arise if the same method is applied at both stages but specific experimental devices have to be produced at higher costs per measurement for the selected markers at the second stage. For the first scenario we show that depending on the cost and the effect size ratios between the methods it is preferable either to apply the lowcost or the highcost method at both stages. For the second scenario we will show for which cost ratios between stages it is worthwhile to use (optimal) twostage designs as compared to the single stage design. Finally we also look how design misspecifications in the planning phase would change the power of twostage designs as compared to the singlestage design.

Adaptive Designs with Correlated Test Statistics
Heiko Götte; Andreas Faldum, Gerhard Hommel Institute of Medical Biostatistics, Epidemiology and Informatics, Johannes Guten 
In clinical trials the collected observations are often correlated, for example: clustered data or repeated measurements. When applying adaptive designs test statistics of different stages are often also correlated in these situations so that classical adaptive designs for uncorrelated test statistics (for example Bauer/ Köhne, 1994) do not seem to be appropriate. Hommel et al. (2005) proposed the Modified Simes test for two stage adaptive designs with correlated test statistics to handle this issue. For bivariate normally distributed test statistics the significance level can be preserved. Analogously to Shih/ Quan (1999) we give the probability of type one error for the BauerKöhnedesign in the situation of bivariate normally distributed test statistics in an explicit formula. We show that the significance level is inflated for positively correlated test statistics. The decision boundary for the second stage can be modified in a way that the type one error is controlled. The concept is expandable to other adaptive designs. The Modified Simes test is a special case. In order to use these designs the correlation between the test statistics has to be determined. For a repeated measurement setting we show how correlation can be estimated within the framework of linear mixed models. The power of Modified Simes test is compared with the power of the BauerKöhnedesign for this situation.
Literatur:
Bauer, P., Köhne, K. (1994). Evaluation of Experiments with Adaptive Interim Analyses. Biometrics, 50:10291041.
Hommel G., Lindig V., Faldum A.(2005). Twostage adaptive designs with correlated test statistics. Journal of Biopharmaceutical Statistics, 15:613623.
Shih W.J., Quan H. (1999). Planning and analysis of repeated measures at key timepoints in clinical trials sponsored by pharmaceutical companies. Statistics in Medicine, 18:961973
This talk contains parts of the thesis of Heiko Götte.

A Bayesian screening method for determining if adverse events reported in a clinical trial are likely to be related to treatment
A Lawrence Gould Merck Research Laboratories 
Many different adverse events usually are reported in largescale clinical trials. Most of the events will not have been identified a priori. Current analysis practice often applies Fisher's exact test to the usually relatively small event counts, with a conclusion of “safety” if the finding does not reach statistical significance. This practice has serious disadvantages: lack of significance does not mean lack of risk, the various tests are not adjusted for multiplicity, and the data determine which hypotheses are tested. This presentation describes a new approach that does not test hypotheses, is selfadjusting for multiplicity, and has welldefined diagnostic properties. The approach is a screening approach that uses Bayesian model selection techniques to determine for each adverse event the likelihood that the occurrence is treatmentrelated. The approach directly incorporates clinical judgment by having the criteria for treatment relation determined by the investigator(s). The method is developed for outcomes that arise from binomial distributions (relatively small trials) and for outcomes that arise from Poisson distributions (relatively large trials). The calculations are illustrated with trial outcomes.

Simultaneous confidence regions corresponding to Holm's stepdown multiple testing procedure
Olivier Guilbaud AstraZeneca R&D, Sweden 
The problem of finding simultaneous confidence regions corresponding to multiple testing procedures (MTPs) is of considerable practical importance. Such confidence regions provide more information than the mere rejections/acceptances of null hypotheses that can be made by MTPs. I will show how one can construct simultaneous confidence regions for a finite number of quantities of interest that correspond to Holm's (1979) stepdown multipletesting procedure. Holm's MTP is an important and widely used generalization of the Bonferroni MTP. As the Bonferroni and Holm MTPs, the proposed confidence regions are quite flexible and generally valid. They are based on marginal confidence regions for the quantities of interest, and the only essential assumption for their validity is that the marginal confidence regions are valid. The estimated quantities, as well as the marginal confidence regions, can be of any kinds/dimensions. The proposed simultaneous confidence regions are of particular interest when one aims at confidence statements that will "show" that quantities belong to target regions of interest.

Simultaneous Inference for Ratios
David Hare; Hare, David and John Spurrier University of Louisiana at Monroe 
Consider a general linear model with pdimensional parameter vector β and i.i.d. normal errors. Let K1, ..., Kk, and L be linearly independent vectors of constants such that LTβ ≠ 0. We describe exact simultaneous tests for hypotheses that KiTβ/LTβ equal specified constants using onesided and twosided alternatives, and describe exact simultaneous confidence intervals for these ratios. In the case where the confidence set is a single bounded contiguous set, we describe what we claim are the best possible conservative simultaneous confidence intervals for these ratios  best in that they form the minimum kdimensional hypercube enclosing the exact simultaneous confidence set. We show that in the case of k = 2, this “box” is defined by the minimum and maximum values for the two ratios in the simultaneous confidence set and that these values are obtained via one of two sources: either from the solutions to each of four systems of equations or at points along the boundary of the simultaneous confidence set where the correlation between two t variables is zero. We then verify that these intervals are narrower than those previously presented in the literature.

Screening for Partial Conjunction Hypotheses
Ruth Heller; Benjamini, Yoav TelAviv University 
We consider the problem of testing the partial conjunction null,
that asks whether less than $u$ out of $n$ null hypotheses are
false. It offers an inbetween approach to the testing of the
global null that all $n$ hypotheses are null, and the full conjunction null that not all of the $n$ hypotheses are false. We
address the problem of testing many partial conjunction hypotheses simultaneously, a problem that arises when combining maps of pvalues. We suggest powerful test statistics that are valid under dependence between the test statistics as well as under independence. We suggest controlling the false discovery rate (FDR) on the pvalues for testing the partial conjunction
hypotheses, and we prove that the BH FDR controlling procedure
remains valid under various dependency structures. We apply the method to examples from Microarray analysis and functional Magnetic Resonance Imaging (fMRI), two application areas where the need for partial conjunction analysis has been identified.

A unifying approach to noninferiority, equivalence and superiority tests
Chihiro Hirotsu Meisei University 
Two approaches of multiple decision processes are proposed for unifying the noninferiority, equivalence and superiority tests in a comparative clinical trial for a new drug against an active control. One is a method of confidence set with confidence coefficient 0.95 improving the consumer’s and producer’s risks of the usual approach of the naïve confidence interval. It requires to include 0 within the region as well as to clear the noninferiority margin so that a trial with somewhat large number of subjects for proving noninferiority of a drug which is actually inferior should be unsuccessful.
The other is the closed testing procedure combining the one and twosided tests by applying the partitioning principle and justifies the switching procedure unifying the noninferiority, equivalence and superiority tests. In particular regarding the noninferiority the proposed method justifies simultaneously the old Japanese Statistical Guideline (onesided 0.05 test) and the International Guideline (twosided 0.05 test). The method is particularly attractive changing the strength of the evidence of relative efficacy of the test drug against a control at five levels according to the achievement of the clinical trial.
Key words: Bioequivalence, closed testing procedure, confidence set, noninferiority, partitioning principle, superiority.

Neglect of Multiplicity in Hypothesis Testing of Correlation Matrices
Burt Holland Temple University 
Many social science journals publish articles with correlation matrices
accompanied by tests of significance that ignore multiplicity. A highly cited article in Psychological Methods recommended use of an MCP when testing correlations but promoted MCP procedures that are inapplicable to correlations. We discuss viable options for handling this problem.

Multiple comparisons for ratios to the grand mean
Ludwig A. Hothorn; G. Dilba Leibniz Uni Hannover 
Multiple comparison for differences to the grand mean is a wellknown approach and commonly used in quality control, see the recent textbook on ANOM (analysis of means) by Nelson et al. (2005). Alternatively, we discuss multiple comparisons for ratios to the grand mean: multiple tests and simultaneous confidence intervals. Simultaneous confidence intervals represent a generalization of Fieller intervals and pluggingin the estimated correlations into the multivariatet distribution with arbitrarily correlation matrix. A related R program will be provide using the mvtnorm package by Hothorn et al 2001.
The advantage of dimensionless confidence intervals will be demonstrated by examples for comparing several mutants or different varieties for multiple endpoints.
References
Hothorn T et al. (2001) On multivariate t and Gauss probabilities. R New 1 (2): 2729.
Nelson PR et al. (2005) The analysis of means SIMA

To model or not to model
Jason Hsu; Violeta Calian, Dongmei Li The Ohio state University 
Resampling techniques are often used to estimate null distributions of test statistics in multiple testing. In the comparison of gene expressions of levels and in multiple endpoint problems, resampling is often used to take into account correlations among the observations. We describe how each of the resampling techniques: permutation of raw data, postpivot of resampled test statistics, and resampling of prepivoted observations, each has its requirement of knowledge of the joint distribution of the test statistics for validity. Modeling is useful toward validating a resampling multiple testing technique. To the extent prepivot resampling is valid, for small samples it has some advantage of smoothness and stability of estimated null distributions.

Simultaneous confidence intervals by iteratively adjusted alpha for relative effects in the oneway layout
Thomas Jaki; Martin J. Wolfsegger Lancaster University 
A bootstrap based method to construct 1−alpha simultaneous confidence interval for relative effects in the oneway layout is presented. This procedure takes the stochastic correlation between the test statistics into account and results in narrower simultaneous confidence intervals than the application of the Bonferroni correction. Instead of using the bootstrap distribution of a maximum statistic, the coverage of the confidence intervals for the individual comparisons are adjusted iteratively until the overall confidence level is reached. Empirical coverage and power estimates of the introduced procedure for manytoone comparisons are presented and compared with asymptotic procedures based on the multivariate normal distribution.

Distribution Theory with Two Correlated ChiSquare Variables
Anwar H Joarder KIng Fahd University of Petroleum & Minerals 
Ratios of two independent chisquare variables are widely used in statistical tests of hypotheses. This paper introduces a new bivariate chisquare distribution where the variables are not necessarily independent. Moments of the product and ratio of two correlated chisquare variables are outlined. Distributions of the sum and product of two correlated chisquares are also derived.
AMS Mathematics Subject Classification: 60E05, 60E10, 62E15
Key Words and Phrases: Chisquare distribution, Wishart distribution, product moments, Bivariate distribution, Correlation

On Multiple Treatment Effects in Adaptive Clinical Trials for Longitudinal Count data
Vandna Jowaheer; Brajendra C. Sutradhar University of Mauritius 
In longitudinal adaptive clinical trials it is an important research problem to compare more than two treatments for the purpose of treating maximum number of patients with the best possible treatment. Recently, in the context of longitudinal adaptive clinical trials for count responses, Sutradhar and Jowaheer (2006) [SJ (2006)] introduced a simple longitudinal playthewinner (SLPW) design for the treatment selection for an incoming patient and discussed a weighted generalized quasilikelihood (WGQL) approach for consistent and efficient estimation of the regression effects including the treatment effects. Their study however was confined to the comparison of two treatments. In this paper, we generalize their SLPW design for the two treatment case to the multiple treatment case. For the estimation of the treatment effects we provide a conditional WGQL (CWGQL) as well as an unconditional WGQL approach. Both approaches provide consistent and efficient estimates for the treatment effects, the CWGQL being simpler but slightly unstable as compared to the unconditional WGQL approach where we use the limiting weights for the treatment selection. A normality based asymptotic test for testing the equality of the treatment effects is also outlined.

Sequential genomewide association studies for pharmacovigilance
Patrick Kelly University of Reading, UK 
Pharmacovigilance, the monitoring of adverse events, is an integral part in the clinical evaluation of a new drug. Until recently, attempts to relate the incidence of adverse events to putative causes have been restricted to the evaluation of simple demographic and environmental factors. The advent of largescale genotyping, however, provides an opportunity to look for associations between adverse events and genetic markers, such as single nucleotides polymorphisms (SNPs). It is envisaged that a very large number of SNPs, possibly over 500,000, will be used in pharmacovigilance in an attempt to identify any genetic difference between patients who have experienced an adverse event and those who have not.
This paper presents a sequential genomewide association test for analysing pharmacovigilance data as adverse events arise, allowing evidencebased decisionmaking at the earliest opportunity. This gives us the capability of quickly establishing whether there is a group of patients at highrisk of an adverse event based upon their DNA. The method uses permutations and simulations in order to obtain valid hypothesis tests which are adjusted for both linkage disequilibrium and multiple testing. Permutations are used to calculate pvalues because the asymptotic properties of the test statistic are unlikely to hold due linkage disequilibrium. Simulations are used to find the required nominal significance level in order to satisfy some overall type I error rate. The simulations provide a simple and easy approach for obtaining a correction for the multiple testing without having to determine how the repeated tests are correlated.

Effects of dependence in highdimensional multiple testing problems
Kyung In Kim; Mark A. van de Wiel Eindhoven University of Technology 
We consider effects of dependence among variables of highdimensional data in multiple hypothesis testing problems. Recent simulation studies considered only simple correlation structure among variables, which was hardly inspired by real data features. Our aim is to describe dependence as a network and systematically study effects of several network features like sparsity and correlation strength. We discuss a new method for efficient guided simulation of dependent data, which satisfy the imposed network constraints. We use constrained random correlation matrices and perform extensive simulations under nested conditional independence structures. We check the robustness against dependence of several popular FDR procedures such as BenjaminiHochberg FDR, Storey’s qvalue, SAM and other resampling based FDR procedures. False Nondiscovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulations studies show that popular methods such as SAM and the qvalue seem to overestimate nominal FDR significance level under dependence conditions. On the other hand, the adaptive BenjaminiHochberg procedure seems to be most robust and remain conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable.

A unified approach to proof of concept and dose estimation for categorical responses
Bernhard Klingenberg Williams College 
This talk suggests to unify doseresponse modeling and target dose estimation into a single framework for the benefit of a more comprehensive and powerful analysis. Bretz, Pinheiro and Branson (Biometrics, 2006) recently implemented a similar idea for independent normal data by using optimal contrasts as a selection criterion among various candidate doseresponse models. We suggest a framework in which from a comprehensive set of candidate models the ones are chosen that best pick up the doseresponse. To decide which models, if any, significantly pick up the signal we construct the permutation distribution of the maximum penalized deviance over the candidate set. This allows us to find critical values and multiplicity adjusted pvalues, controlling the error rate of declaring spurious signals as significant. A thorough evaluation and comparison of our approach to popular multiple contrast tests reveals that its power is as good or better in detecting a doseresponse signal under a variety of situations, with many more additional benefits: It incorporates model uncertainty in proof of concept decisions and target dose estimation, yields confidence intervals for target dose estimates, allows for adjustments due to covariates and extends to more complicated data structures. We illustrate our method with the analysis of a Phase II clinical trial.

On the use of conventional tests in flexible, multiple test designs
Franz Koenig; Peter Bauer, Werner Brannath Medical University of Vienna 
Flexible designs based on the closure principle offer a large amount of flexibility in clinical trials with control of the type I error rate. This allows the combination of trials from different clinical phases of a drug development process. Flexible designs have been criticized because they may lead to different weights for the patients from the different stages when reassessing sample sizes. Analyzing the data in a conventional way avoids such unequal weighting but may inflate the multiple type I error rate. In cases where the conditional type I error rates of the new design (and conventional analysis) is below the conditional type I error rates of the initial design the conventional analysis may be done without inflating the type I error rate. This method will be used to explore switching between conventional designs for typical examples.

Gatekeeping testing without tears
David Li; Mehrotra, Devan Merck Research Labs 
In a clinical trial, there are one or two primary endpoints, and a few secondary endpoints. When at least one primary endpoint achieves statistical significance, there is considerable interest in using results for the secondary endpoints to enhance characterization of the treatment effect. Because multiple endpoints are involved, regulators may require that the trialwise typeI error rate be controlled at a preset level. This requirement can be achieved by using “gate keeping” methods. However, existing methods suffer from logical oddities such as allowing results for secondary endpoint(s) to impact the likelihood of success for the primary endpoint(s). We propose a novel and easytoimplement gatekeeping procedure that is devoid of such deficiencies. Simulation results and real data examples are used to illustrate efficiency gains of our method relative to existing methods.

Exact simultaneous confidence bands for multiple linear regression over an ellipsoidal region
Shan Lin; Wei Liu University of Southampton, S3RI 
A simultaneous confidence band provides useful information on whereabouts of the true regression function. Construction of simultaneous confidence bands has a history going back to Working and Hotelling (1929) and is a hard problem when the predictor space is restricted in some region and the number of regression covariates is more than one. This talk gives the construction of exact onesided and twosided simultaneous confidence bands for a multiple linear regression model over an ellipsoidal region that is centered at the point of the means of the predictor variables in the experiment based on three methods, i.e.,the method of Bohrer (1973), the algebraical method and the tubular neighborhood method. Also,it is of interest to show these three methods give the same result.

Testing Procedures on Comparisons of Several Treatments with one Control in a Microarray Setting
Dan Lin; Ziv. Shkedy, Tomasz Burzykowski, Hinrich W.H. Göhlmann, An De Bondt, Tim Perera, Center for Statistics,Hasselt University 
We discuss a particular situation in a microarray experiment; when two dimensional multiple testing occurs because of comparing several treatments with a control at one hand and testing tens of thousands of genes simultaneously at the other hand. Dunnett’s single step procedure (Dunnett 1995) for testing eﬀective treatments can be used to address one dimensional question of primary interes. Dunnett’s procedure was implemented within resamplingbased algorithms such as Signiﬁcance Analysis of Microarray (SAM, Tusher et al. 2001) and Benjamini and Hochberg False Discovery Rate (FDR, Benjamini and Hochberg 1995). To combine the twodimensional testing problem into one testing procedure, we proposed an approach to test for m*K (number of genes*number of comparisons between several treatments with the control) tests simultaneously. We compared the performance of SAM and the classical BHFDR. The method was applied to a microarray experiment with 4 treatment groups (3 microarrays in each group) and 16998 genes. Additionally a simulation study was conducted to investigate the power of the methods proposed and to investigate how to choose the fudge factor in SAM to leverage the genes with small variances.
Keywords: Dunnett’s single step procedure; microarray; multiple testing; Benjamini and Hochberg false discovery rate (BHFDR); SAM.

A New Hypothesis to Test Minimal Fold Changes of Gene Expression Levels
Jenpei Liu; ChenTuo Liao, JiaYan Dai Division of Biometry, Department of Agronomy, National Taiwan University 
Current approaches to identifying differentially expressed genes are based either on the fold changes or on the traditional hypotheses of equality. However, the fold changes do not take into consideration the variation in estimation of the average expression. In addition, the use of fold changes is not in the frame of hypothesis testing and hence the probability associated with errors for decisionmaking in for identification of differentially expressed genes can not be quantified and evaluated. On the other hand, the traditional hypothesis of equality fails to take into consideration the magnitudes of the biologically meaningful fold changes that truly differentiate the expression levels of genes between groups. Because of the large number of genes tested and small number of samples available for microarray experiments, the false positive rate for differentially expressed genes is quite high and requires further adjustments such as Bonferroni method, false discovery rate, or use of an arbitrary cutoff for the pvalues. All these adjustments do not have any biological justification. Hence, we propose to formulate the hypothesis of identifying the differentially expressed genes as the interval hypothesis by consideration of both the minimal biologically meaningful fold changes and statistical significance simultaneously. Based on the interval hypothesis, a two onesided tests procedure is proposed with a method for sample size determination. A simulation study is conducted to empirically compare the type I error rate and power of the traditional hypothesis among the twosample ttest, the twosample ttest with Bonferroni adjustment, the foldchange rule, the method of combination of the twosample ttest and foldchange rule, and the proposed two onesided tests procedure under various combinations of fold changes, variability and sample sizes. Simulation results show that the proposed two onesided tests procedure based on the interval hypothesis not only can control the type I error rate at the nominal level but also provides sufficient power to detect differentially expressed gene. Numeric data from public domains illustrate the proposed methods.
Key words: Interval hypothesis, Type I error, Power, Fold change

Minimum area confidence set optimality for confidence bands in simple linear regression
Wei Liu; A. J. Hayter S3RI and School of Maths 
The average width of a simultaneous confidence band has been used by several authors (e.g. Naiman, 1983, 1984, Piegorsch, 1985a) as a criterion for the comparison of different confidence bands. In this paper, the area of the confidence set
corresponding to a confidence band is used as a new criterion. For simple linear regression, comparisons have been carried out under this new criterion between hyperbolic bands, twosegment bands, and threesegment bands, which include constant width bands as special cases. It is found that if one requires a confidence band over the whole range of the covariate, then the best confidence band is given by the Working \& Hotelling hyperbolic band. Furthermore, if one needs a confidence band over a finite interval of the covariate, then a restricted hyperbolic band can again be recommended, although a threesegment band may be very slightly superior in certain cases.

A Bayesian Spatial Mixture Model for FMRI Analysis
Brent Logan; Maya P. Geliazkova, Daniel B. Rowe, Prakash W. Laud Medical College of Wisconsin 
One common objective of fMRI studies is to identify voxels or points in the brain, which are activated by a neurocognitive task. This is an important multiple comparisons problem, since typically inference (often using z or t tests) is performed on each of thousands or hundreds of thousands of voxels. The false discovery rate has been studied for use in this problem by several authors. Finite mixture models have also been proposed to address the multiplicity issue, where voxels are classified according to being activated or not activated by the cognitive task. Links between the false discovery rate and mixture models have been shown in the literature. One limitation to these procedures is that activation is typically expected to occur in clusters of neighboring voxels rather than in isolated single voxels; methods which do not account for this may have lower sensitivity to activation. We propose a Bayesian spatial mixture model to address these issues. Each voxel has an unknown or latent activation status, denoted by a binary activation variable. The spatial model for the binary activation indicators is induced by a latent Gaussian spatial process (a conditional autoregressive, or CAR, model), thresholded to produce the binary activation, analogous to a spatial probit model. An efficient Gibbs sampling algorithm is developed to implement the model, yielding posterior probabilities of activation for each voxel, conditional on the observed data. We apply this method to a real fMRI study, and compare its performance in simulation with other methods proposed for fMRI analysis.

Multiplicitycorrected, nonparametric tolerance regions for cardiac ECG features
Gheorghe Luta; S. Stanley Young, Alex Dmitrienko National Institute of Statistical Sciences, USA 
Electrocardiograms are used to evaluate possible effects on the heart induced by drug candidates. These waveforms are quite complex and many numerical features of these waveforms are extracted for statistical evaluation. In addition, various covariates, heart rate, gender, age, etc., also need to be taken into account. There is a need to consider the multiple questions under consideration. Our idea is to combine two statistical methodologies, nonparametric tolerance regions and resamplingbased multiple testing correction. We will review electrocardiograms and their standard numerical characteristics, and place this work into the framework of drug evaluation clinical trials. Using real data, we will show how nonparametric tolerance regions can be used with resampling multiplicity adjustments. The product of this strategy will be tolerance regions that adapt to the shape of the observed distributions and control over the familywise error rate over the clinical trial.

Adaptive Design in Dose Ranging Studies Based on Both Efficacy and Safety Responses
Olga Marchenko; Prof. R. Keener, University of Michigan, Ann Arbor i3 Statprobe, Inc 
Traditionally, most designs for Phase I studies gather safety information, aiming to determine the maximum tolerated dose (MTD). Then Phase II designs would evaluate the efficacy of doses in the (assumed) toxicity acceptable. It is highly desirable for many reasons to base the dose selection on efficacy and safety responses simultaneously. Recently, several different designs for dose selection have been proposed that are based on both efficacy and safety (e.g., Thall and Cook (2004), Fedorov and Dragalin (2006), Zhang et al. (2006), etc.). While a majority of designs provide appropriate, safe and efficacious dose or doses with some precision, few of them gain the sufficient information on all doses in the range studied. In this talk, I will show how a flexible, adaptive, modelbased design proposed by V.Fedorov and V.Dragalin can be implemented and changed as appropriate by studying simulations similar to three case studies with different desirable responses from several therapeutic areas.

Estimation in Adaptive Group Sequential Design
Cyrus Mehta; Werner Brannath, Martin Posch Cytel Inc. 
This paper proposes two methods for computing confidence intervals with exact or conservative coverage following a group sequential test in which an adaptive design change is made one or more times over the course of the trial. The key idea, due to Muller and Schafer (2001), is that by preserving the null conditional rejection probability of the remainder of the trial at the time of each adaptive change, the overall type 1 error, taken unconditionally over all possible design modifications, is also preserved. This idea is further extended by considering the dual tests of repeated confidence intervals (Jennison and Turnbull, 1989) and of stagewise adjusted confidence intervals (Tsiatis, Rosner and Mehta, 1984). The method extends to the computation of median unbiased point estimates.

Estimating the interesting part of a doseeffect curve: When is a Bayesian adaptive design useful?
Frank Miller AstraZeneca, Södertälje, Sweden 
We consider the design for dosefinding trials in phase IIB of drug development. We propose that “estimating the interesting part of the doseeffect curve” is an important objective of such trials. This objective will be made more concrete and formulated in statistical terms in the talk. Having defined the objective, we can apply optimal design theory to derive efficient designs. Due to our objective, we use a customized optimality criterion and not a common optimality criterion like Doptimality. We specify both an optimal fixed design (without adaptation) and a twostage Bayesian adaptive design. The efficiencies of these two designs are compared for several situations. We describe typical situations where you can gain efficiency from using an adaptive design but also situations where it might be better with a fixed design. Briefly, we discuss modifications of the considered adaptive design and potential advantages of these.

The multiple confidence procedure and its applications
Tetsuhisa Miwa National Institute for AgroEnvironmental Sciences 
In 1973 Takeuchi proposed a multiple confidence procedure for multiple decision problems in his book “Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis” (in Japanese). This procedure is based on the partition of the parameter space. Therefore it is closely related to the recent development of the partitioning principles. In our talk we first review the basic concepts of Takeuchi's multiple confidence procedure. Then we discuss some applications and show the usefulness of the procedure.

Estimating the proportion of true null hypotheses with the method of moments
Jose Maria Muino; P. Krajewski Instytut Genetyki Roslin PAN 
In order to construct the critical region for the test statistic in a multiple hypotheses testing situation, it is necessary to obtain some information about the distribution of the test statistic under the null hypothesis and under the alternative, and to use this information in an optimal way to asses which tests can be declared significant. We propose how to obtain this information in the form of the moments of these distributions and the proportion of true null hypotheses ($\pi_0$) with the method of moments. As a particular case, we study the properties of the estimator $\pi_0$ when the test statistic is the mean value, and we construct a new asymptotically unbiased (as the number of test goes to infinity) estimator. Some numerical simulation are done to compare the proposed method with others.

CONFIDENCE SETS FOLLOWING A MODIFIED GROUP SEQUENTIAL TEST
HansHelge Müller; Nina Timmesfeld Institute of Medical Biometry and Epidemiology, PhilippsUniversity of Marburg, 
Consider the statistically monitoring of a clinical trial comparing two treatments where the confirmatory analysis is based on a carefully planned group sequential design. Let us look at the Brownian motion model with the drift parameter reflecting the treatment difference. From now on suppose that during the course of the trial a change of the group sequential design is advisable, however, that the effect size parameter measuring treatment differences can be retained unchanged.
In order to control the type I error rate, it is necessary and sufficient to redesign the trial on the basis of the Conditional Rejection Probability (CRP) principle proposed by Müller and Schäfer (2004). In addition to decision making on a hypothesis testing paradigm, estimation of the effect size parameter with a confidence set is an important issue at the end of the trial.
Following a group sequential trial, the simple fixed sample confidence intervals are inadequate. Methods for the construction of confidence intervals reflecting early stopping for both, significance and futility, have been proposed, e.g. the confidence intervals by Tsiatis et al. (1984).
Starting with a valid concept of estimation of confidence sets in group sequential testing, in this contribution it is shown how to accommodate with the issue of constructing confidence sets following a modified design using the flexible CRP approach. The application in clinical trials is illustrated for a survival study using the method by Tsiatis et al.. The method of transformation is discussed regarding the choice of group sequential confidence sets.
References:
Müller HH, Schäfer H. A general statistical principle for changing a design any time during the course of a trial. Statistics in Medicine 2004; 23: 24972508.
Tsiatis AA, Rosner GL, Metha CR. Exact confidence intervals following a group sequential test. Biometrics 1984; 40:797803.

On the conservatism of the multivariate TukeyKramer procedure
Takahiro Nishiyama; Takashi Seo Tokyo university of science 
We consider the conservative simultaneous confidence intervals for pairwise comparisons among mean vectors in multivariate normal distributions. The multivariate TukeyKramer procedure which is the multivariate version of TukeyKramer procedure is presented. Also, the affirmative proof of the multivariate version of the generalized Tukey conjecture of the conservativeness of the simultaneous confidence intervals for pairwise comparisons of four mean vectors is presented.
Further, the upper bound for the conservativeness of the multivariate TukeyKramer procedure is also given in the case of four mean vectors. Finally, numerical results by Monte Carlo simulations are given.

Sch\'effe type multiple comparison procedure in order restricted randomized designs
Omer Ozturk; Steve MacEachern The Ohio State University 
Ozturk and MacEachern (2004) introduced a new design, the order restricted randomized design (ORRD), for the contrast parameters in a linear model. This new design uses a restricted randomization scheme that relies on subjective judgment ranking of the experimental units based in their inherent heterogeneity (or homogeneity). The process of judgment ranking creates a positive correlation structure among within set units and the
restricted randomization on these ranked units translates this positive correlation into a negative one when estimating a contrast. Hence, the design serves as a variance reduction technique for treatment contrasts.
In this talk, we first develop a test for the generalized linear
hypothesis based on an ORRD and discuss how this test can be used to test the treatment effects. We then develop a Sch\'effetype multiple comparison procedure for all possible contrasts of the treatment effects. We show that the coefficients of contrasts depend on the design matrix and the underlying covariance structure of the judgment ranked observations. A simulation study shows that the multiple comparison procedure is robust against wide range of underlying distributions.

Stepwise confidence intervals for monotone doseresponse studies
Jianan Peng; ChuIn Charles Lee, Karolyn Davis Acadia University 
In doseresponse studies, one of the most important issues is the identification of the minimum effective dose (MED), where the MED is defined as the lowest dose such that the mean response is better than the mean response of a zerodose control by a clinically significant difference. Usually the doseresponse curves are monotonic. Various authors have proposed stepdown test procedures based on contrasts among the sample means to find the MED. In this paper, we improve Marcus and Peritz's method (1976, Journal of Royal Statistical Society, Series B, Vol 38, 157165) and combine Hsu and Berger's DR method (1999, Journal of the American Statistical Association, Vol 94, 468482) to construct the lower confidence bound for the difference between the mean response of any nonzero dose level and that of the control under the monotonicity assumption to identify the MED. The proposed method is illustrated by numerical examples and simulation studies on power comparisons are presented.

Detecting differential expression in microarray data: Outperforming the Optimal Discovery Procedure
Alexander Ploner; Elena Perelman, Stefano Calza, Yudi Pawitan Karolinska Institutet 
The identification of differentially expressed genes among the tens of thousands of sequences measured by modern microarrays presents an obvious and serious multiplicity problem. The central role of gene expression data in molecular biology has stimulated much research in addressing this issue over the last decade; an important result of that research is the Optimal Discovery Procedure (ODP) proposed by John Storey, which generalizes the likelihood ratio test statistic of the NeymanPearson lemma for multiple parallel hypotheses, and which can be shown to be optimal in the sense that for any fixed number of false positive results, ODP will identify the maximum number of true positives [1].
However, the optimality result derived in [1] assumes exact knowledge of a large number of nuisance parameters that have to be estimated for any realistic application. In our talk, we will demonstrate that the practical implementation of ODP described in [2] is less powerful than a variant of the local false discovery rate we have proposed recently, which uses the distribution of the same nuisance parameters to weight conventional tstatistics [3]. We also show how a combination of the ODP test statistic with our weighting scheme can even further improve the power to detect differentially expressed genes at controlled levels of false discovery.
[1] Storey JD: The Optimal Discovery Procedure: A New Approach to Simultaneous Significance Testing. UW Biostatistics Working Paper Series 2005, Working Paper 259.
[2] Storey JD, Dai JY, Leek JT: The Optimal Discovery Procedure for LargeScale Significance Testing, with Applications to Comparative Microarray Experiments. UW Biostatistics Working Paper Series 2005, Working Paper 260.
[3] Ploner A, Calza S, Gusnanto A, Pawitan Y: Multidimensional local false discovery rate for microarray studies. Bioinformatics 2006, 22(5):556–565.

Repeated significance tests controlling the False Discovery Rate
Martin Posch; Sonja Zehetmayer, Peter Bauer Medical University of Vienna 
When testing a single hypothesis repeatedly at several interim
analyses, adjusted significance levels have to be applied at each
interim look to control the overall Type I Error rate. There is a
rich literature on such group sequential trials investigating the
choice and computation of adjusted critical values. Surprisingly, if a large number of hypotheses are tested controlling the False Discovery Rate (a frequently used error criterion for large scale multiple testing problems), we can show that under quite general conditions no adjustment of the critical value for multiple interim looks is necessary. This holds asymptotically (for a large number of hypotheses) under all scenarios but the global null hypothesis where all null hypotheses are true. Similar results are given for a procedure controlling the percomparison error rate.

Involving biological information for weighing statistical error under multiple testing
Anat ReinerBenaim Stanford University 
Given a multiple testing problem, each hypothesis may be associated with some prior information, which is related to the structure of the data and its scientific basis. This information may be unique to each hypothesis, and therefore, when estimating the overall statistical error, treating the hypotheses as having the same null distributions may lead to biased results. Using the prior information for weighing the null hypothesis can improve the error estimate and may offer less conservative controlling procedure.
The emphasis of the talk will be on use of biological data as prior information. For instance, the machinery of genetic regulation is subjected to probabilistic factors. Regulation happens when a transcription factor binds to a site on the gene. Since the match level between the two is not perfect and can vary within a wide range, it can be incorporated into the error estimation as hypotheses weights.
The effect of the weights on the error estimate will be presented, given the method of computing the weights, the pattern of the weight structure and the type of error controlled. Two approaches to control the False Discovery Rate (FDR)with weights are compared – empirical Bayes perhypothesis FDR estimation, and weighing the pvalues to control the overall FDR.

Two new adaptive multiple testing procedures.
Etienne Roquain; Gilles Blanchard MIG INRA JouyenJosas 
The proportion $\pi_0$ of true null hypotheses is a quantity that often appears explicitly in the FDR control bounds. Recent research effort has focussed on finding ways to estimate this quantity and incorporate it in a meaningful way in a multiple testing procedure, leading to socalled "adaptive" procedures.
We present here two new adaptive stepup multiple testing procedures:
 The first procedure that we present is a onestage stepup
procedure. We prove that it has a correct (and strong) FDR control given that the test statistics are independent. If there the set of rejection is not too large (typically less than 50%), this procedure is less conservative than the socalled "twostage linear stepup procedure" of Benjamini, Krieger and Yekutieli (2006). Moreover, preliminary simulations show that this new procedure seems to still have a correct FDR control when the test statistics are positively correlated.
 The second procedure that we present is a twostage stepup
procedure. We prove that it has a correct (and strong) FDR control in the "distribution free" context. Because the techniques used in the distribution free context are inevitably less precise, this new adaptive procedure is more conservative than thoses built under independence. However, it will be relevant if we expect a "large" proportion of rejected hypotheses (typically more than 50%).

Procedures Controlling Generalized False Discovery Rate
Sanat Sarkar; Wenge Guo Temple University 
Procedures controlling error rates measuring at least k false
rejections, instead of at least one, can potentially increase the ability of a procedure to detect false null hypotheses in
situations where one seeks to control k or more false rejections
having tolerated a few of them. The kFWER, which is the
probability of at least k false rejections and generalizes the
usual familywise error rate (FWER), is such an error rate that is recently introduced in the literature and procedures controlling it have been proposed. An alternative and less conservative notion of error rate, the $k$FDR, which is the expected proportion of k or more false rejections among all rejections and generalizes the usual notion of false discovery rate (FDR) will be introduced in this talk. Procedures with the kFDR control dominating the BenjaminiHochberg stepup FDR procedure and its stepdown analog under independence or positive dependence and the BenjaminiYekutieli stepup FDR procedure under any form of dependence will be presented.

FLEXIBLE TWOSTAGE TESTING IN GENOMEWIDE ASSOCIATION STUDIES
André Scherag; Helmut Schäfer, HansHelge Müller Institute of Medical Biometry and Epidemiology, PhilippsUniversity of Marburg, 
Genomewide association studies have been suggested to unravel the genetic etiology of complex human diseases [1]. Typically, these studies employ a multistage plan to increase costefficiency. A large panel of markers is examined in a subsample of subjects, and the most promising markers will also be genotyped in the remaining subjects.
Until now all proposed design require adherence to formal statistical rules which may not always meet the practical necessities of ongoing genetic research. In practice, investigators may e.g. wish to base the genetic marker selection on other criteria than formal statistical thresholds.
In this talk we describe an algorithm that enables various design modifications at any time during the course of the study. Using the Conditional Rejection Probability approach [2] the familywise type I error rate is strongly controlled. The algorithm can deal with an extremely large number of hypotheses tests though requiring very limited computational resources. This algorithm is evaluated by simulations. Furthermore, we present a real data application.
References
[1] Freimer NB, Sabatti C. Human genetics: variants in common diseases. Nature. 2007 Feb 22;445(7130):82830.
[2] Müller HH, Schäfer H. A general statistical principle for changing a design any time during the course of a trial. Stat Med. 2004 Aug 30;23(16):2497508.

A test procedure for random degeneration of paired rank lists
Michael G. Schimek; Peter Hall, Eva Budinska Medizinische Universität Graz 
Let us assume two assessors (e.g. laboratories), at least one of which ranks N distinct objects according to the extent to which a particular attribute is present. The ranking is from 1 to N, without ties. In particular we are interested in the following situations: (i) The second assessor assigns each object to the one or the other of two categories (01decision assuming a certain proportion of ones). (ii) The second assessor also ranks the objects from 1 to N. An indicator variable takes I_j=1 if the ranking given by the second assessor to the object ranked j by the first is not distant more than m, say, from j, and zero otherwise. For both situations our goal is to determine how far into the two rankings one can go before the differences between them degenerate into noise. This allows us to identify a sequence of objects that is characterized by a high degree of assignment conformity.
For the estimation of the point of degeneration into noise we assume independent Bernoulli random variables. Under the condition of a general decrease of p_j for increasing j a formal inference model is developed based on moderate deviation arguments implicit in the work of Donoho et al. (1995, JRSS, Ser. B 57, 301369). This idealized model is translated into an algorithm that allows to adjust for irregular rankings (i.e. handling of quite different rankings of some objects) typically occuring in real data. A regularization parameter needs to be specified to account for the closeness of the assessors' rankings and the degree of randomness in the assignments. Our approach can be generalized to the case of more than two assessors.
The class of problems we try to solve has various bioinformatics
applications, for instance in the metaanalysis of gene expression studies and in the identification of microRNA targets in protein coding genes.

Comparing mutliple tests for separating populations
Juliet Shaffer University of California at Berkeley 
Most studies for comparing multiple test procedures for finding differences among populations concentrate on the number of true and false differences that are significant, the former as a measure of power, the latter or a combination of both in various forms as a measure of error. For researchers, the configuration of results, e.g. the extent to which they divide populations into nonoverlapping classes, may be as important as or more important than the actual numbers. Results that lead to separations of populations into groups, when accurate, are especially useful. The talk will discuss some new measures of such separability and compare different multiple testing methods on these measures.

An Exact Test for Umbrella Ordered Alternatives of Location Parameters: the Exponential Distribution Case
Parminder Singh Guru Nanak Dev University, Amritsar 
A new procedure for testing the null hypothesis against umbrella ordered alternative with at least one strict inequality, where is the location parameter of the ith twoparameter exponential distribution, , is proposed. Exact critical constants are computed using recursive integration algorithm. Tables containing these critical constants are provided to facilitate the implementation of the proposed test procedure. Simultaneous confidence intervals for certain contrasts of the location parameters are derived by inverting the proposed test statistic. In comparison to existing tests, it is shown, by a simulation study, that the new test statistic is more powerful in detecting umbrella type alternatives when the samples are derived from exponential distributions. As an extension, the use of the critical constants for comparing Pareto distribution parameters is discussed.

Multiple hypothesis testing to establish whether treatment is
Aldo Solari; Salmaso Luigi, Pesarin Fortunato Department of Chemical Process Engineering, University of Padova, Italy 
Experiments are often carried out to establish whether treatment is “better” than control with respect to a multivariate response variable, sometimes referred to as multiple endpoints. However, in order to develop suitable tests, we have to specify the notion of “better”. To formulate the problem, let X and Y denote the kvariate responses associated with control and treatment, respectively. We may be interested in testing H0: “X and Y are equal in distribution” against H1: “X is stochastically smaller than Y and not H0” where the definition of 'stochastically smaller' is given in [1]. If a test rejects H0, then it does not necessarily follow that there evidence to support H1, unless the
latter is the complement of the null hypothesis [2]. Hence we must suppose that “X is stochastically smaller than Y” is known a priori, i.e. either H0 or H1 is true. Under this assumption, we prove that testing H0 against H1 is equivalent to the union
intersection (UI, [3]) testing formulation based on marginal distributions. However, this is not the only possible formulation for the treatment to be preferred to the control. It may be appropriate to show that the former is not inferior, i.e. not
much worse, on any of the enpoints and is superior on at least one endpoint, resulting in an intersectionunion (IU, [4]) combination of IU and UI testing problems [5].
For both formulations of “better”, we propose a multiple testing procedure based on combining dependent permutation tests [6], and an application is presented.
[1] Marshall, A. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. Academic Press, New York.
[2] Silvapulle, J.S. and Sen, P.K. (2005) Constrained Statistical Inference. Inequality, Order, and Shape Restictions. Wiley, New Jersey.
[3] Roy, S. (1953). On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathemathical Statistics, 24:220238.
[4] Berger, R.L. (1982) Multiparameter hypothesis testing and acceptance sampling. Tecnometrics, 24:295300.
[5] Rohmel, J., Gerlinger, C. Benda, N. and Lauter, J. (2006) On Testing simultaneously NonInferiority in Two Multiple Primary Endpoints and Superiority in at Least One of Them. Biometrical Journal, 48:916933.
[6] F. Pesarin (2001) Multivariate Permutation Tests with Applications in Biostatistics. Wiley, Chichester.

Flexible groupsequential designs for clinical trials with treatment selection
Nigel Stallard; Tim Friede Warwick Medical School, University of Warwick, UK 
Most statistical methodology for phase III clinical trials focuses on the comparison of a single experimental treatment with a control treatment. Recently, however, there has been increasing interest in methods for trials that combine the definitive analysis associated with phase III clinical trials with the treatment selection element of a phase II clinical trial.
A groupsequential design for clinical trials that involve treatment selection was proposed by Stallard and Todd (Statistics in Medicine, 22, 689703, 2003). In this design, the best of a number of experimental treatments is selected on the basis of data observed at the first of a series of interim analyses. This experimental treatment then continues together with the control treatment to be assessed in one or more further analyses. The method was extended by Kelly, Stallard and Todd (Journal of Biopharmaceutical Statistics, 15, 641658, 2005) to allow more than one experimental treatment to continue beyond the first interim analysis. This design controls the type I error rate under the global null hypothesis, but may not control error rates under individual null hypotheses if the treatments selected are not the best performing.
In some cases, for example when additional safety data are available, the restriction that the best performing treatments continue may be unreasonable. This talk will describe an extension of the approach of Stallard and Todd that controls the type I error rates under individual null hypotheses whilst allowing the experimental treatments that continue at each stage to be chosen in any way.

Compatible simultaneous lower confidence bounds for the Holm procedure and other closed Bonferroni based tests
Klaus Strassburger; Frank Bretz German Diabetes Center, LeibnizInstitute at the Heinrich–HeineUniversity Düsse 
In this contribution we present simultaneous confidence intervals being compatible with a certain class of onesided closed test procedures using weighted Bonferroni tests for each intersection hypothesis. The class of multiple test procedures covered in this talk includes gatekeeping procedures based on Bonferroni adjustments, fixed sequence procedures, the simple weighted or unweighted Bonferroni procedure by Holm and the fallback procedure. These procedures belong to a class of short cut procedures, which are easy to implmenet. It will be shown that the corresponding confidence bounds have a straight forward representation. For the stepdown procedure of Holm we illustrate the construction of compatible confidence bounds with a numerical example. The resulting bounds will be compared with those of the classical singlestep procedure. Assets and drawbacks will be discussed.

Multiple treatment comparison based on a nonlinear binary dynamic model
Brajendra Sutradhar; Vandna Jowaheer Memorial University of Newfoundland, Canada 
When an individual patient receives one of the multiple treatments and provides repeated binary responses over a small period of time, the efficient comparison of the treatment effects requires to take the longitudinal correlations of the binary responses into account. In this talk, we use a nonlinear binary dynamic model that allows the full range for correlations and estimate the regression effects including the treatment effects by using the GQL (generalized quasilikelihood) approach that provides consistent as well as efficient estimates. We then
demonstrate how to test the treatment effects based on the asymptotic distributions of their estimators.

A Weighted Hochberg Procedure
Ajit Tamhane; Lingyun Liu Northwestern University 
It is often of interest to differentially weight the hypotheses in terms of their importance. Let $H_1,\ldots,H_n$ be $n \geq 2$
null hypotheses with prespecified positive weights $w_1,\ldots,w_n$ which add up to 1, and with pvalues,
$p_1,\ldots,p_n$ respectively. It is desired to test them, taking
into account their weights, while controlling the type I familywise error rate (FWER) at a designated level $\alpha$. The wellknown weighted Bonferroni (WBF) test rejects any $H_i$ with
$p_i \leq w_i\alpha$. Weighted Holm (WHM) and weighted Simes (WSM) procedures for this problem were proposed by Holm (1979), Hochberg and Liberman (1994) and Benjamini and Hochberg (1997); however, a weighted Hochberg (WHC) procedure is lacking. Benjamini and Hochberg proposed the following stepdown WHM procedure: Let $p_{(1)} \leq \cdots \leq p_{(n)}$ be the ordered pvalues, and let $H_{(1)}, \ldots, H_{(n)}$ and $w_{(1)}, \ldots, w_{(n)}$ be the corresponding hypotheses and weights, respectively. Then reject $H_{(i)}$ iff $p_{(j)} \leq [w_{(j)}/\sum_{k=j}^n w_{(k)}]\alpha$ for $j=1, \ldots, i$; ; otherwise accept all remaining hypotheses. They also proposed the following WSM test: Reject $H_0= \bigcap_{i=1}^n H_i$ iff
\[ p_{(i)} \leq \frac{\sum_{k=1}^i w_{(k)}}{\sum_{k=1}^n w_{(k)}}\alpha \] for some $i=1, \ldots, n$. We consider the following WHC procedure that uses the same critical constants as WHM given above, but operates in the stepup manner: Accept $H_{(i)}$ iff $p_{(j)} > [w_{(j)}/\sum_{k=j}^n w_{(k)}]\alpha$ for $j=n, \ldots, i$; otherwise reject all remaining hypotheses. We show that this procedure is not closed in general in the sense of Marcus, Peritz and Gabriel (1976) under the WSM test for subset intersection hypotheses except when the weights are equal. In the course of this demonstration we fill the gap in the incomplete closure proof given by Hochberg (1988) for the equal weights case. Also, a direct proof based on finding a lower bound on the probability of accepting all true hypotheses (see, e.g., Liu 1996) fails for unequal weights. However, simulation studies indicate that WHC does control FWER in the limited number of cases that we have studied. We propose a conservative version of WHC using the critical matrix approach of Liu (1996) and compare its conservatism with WHC in the simulation study.

Unbiased estimation after modification of a group sequential design
Nina Timmesfeld; Schäfer, Helmut, Müller, HansHelge Institut of medical Biometry and Epidemiology, PhilippsUniversityMarburg 
It is well known that the classical groupsequential designs perform well in terms of expected sample size for various effect sizes, while the type I and type II error rates are controlled. For ethical and economical reasons such a design is chosen in many clinical trials. Although the planning of the study was carefully done, it might happen that a design change is reasonable. The design can be changed with control of the type I
error rate by the method of Müller and Schäfer (2004) at any time during the course of the trial.
At the end of a study additional inference is required such as confidence bounds and estimates for the effect size. In the case of group sequential designs an unbiased estimator can be obtained by the method of Liu and Hall(1999).
In this talk, we will present a method to modify this estimator to keep the unbiasedness after design modifications, in particular after modification of the sample size.
Müller HH, Schäfer H. A general statistical principle for changing a design any time during the course of a trial. Statistics in Medicine 2004; 23:2497–2508.
Liu A, Hall W. Unbiased estimation following a group sequential test. Biometrika 1999; 86:71–78.

Sample size calculation for microarray data analysis using normal mixture model
Masaru Ushijima Japanese Foundation for Cancer Research 
Sample size calculation is an important procedure when designing a microarray study, especially for medical research. This paper concerns sample size calculation in the identification of differentially expressed genes between two patient groups. We use a mixture model, involving differentially expressed and nondifferentially expressed genes.
To calculate the sample size, parameters to be given are as follows: (1) the number of differentially expressed genes, (2) the distribution of the true differences, (3) Type I error rate (e.g. FDR, FWER), (4) statistical power (e.g. sensitivity). We propose a sample size calculation method using FDR, familywise power proposed by Tsai et al. (Bioinformatics, 2005, 21:15028), and a normal mixture model. The sample sizes for twosample ttest are computed for several settings and the simulation studies are performed.

A new method to identify significant endpoints in a closed test setting
Carlos Vallarino; Joe Romano, Michael Wolf, Dick Bittman Takeda Pharmaceuticals NA 
We present a new multiple testing procedure that has a maximin property under the normal assumption. The new method alters the rejection region of the simple sum test to make it consonant, i.e. to guarantee that rejection of the intersection hypothesis, in a closed test setting, implies the significance of at least one endpoint. Consonance is a desirable property which increases the ability to reject false individual null hypotheses. Designed to perform well when testing related endpoints, the new procedure is applied to PROactive, a cardiovascular (CV)outcome trial of patients with type 2 diabetes and CVdisease history. Had the PROactive trial considered its two main endpoints as coprimary, the new method shows how efficacy for one key endpoint could have been established.

Controversy? What controversy?  An attempt to structure the debate on adaptive designs
Marc Vandemeulebroecke Novartis Pharma AG 
From their beginnings, concepts for consecutive analyses of accumulating data have evoked lively debate. Classical sequential analysis has been provocatively criticized, and group sequential approaches have been controversially discussed. Since recently, the merits and pitfalls of adaptive designs are passionately debated.
Starting from striking examples, we will in this talk try to dissect the debate. We identify what we consider the main discussion points, sketch their scope, and ponder their relative importance. We propose to standardize and render more precisely the terminology. We hope that this can contribute to the creation of a frame of reference for the current controversy on adaptive designs.

FDR control for discrete test statistics
Anja Victor; Scheuer C, Cologne J, Hommel G Institute of medical biometry, epidemiology and informatics, University Mainz, G 
In genetic association studies considering e.g. Single Nucleotide Polymorphisms (SNPs) one deals with categorical data and dependencies between SNPs may occur (because of linkage Disequilibrium, LD). Additionally genetic association studies exhibit many different study situations ranging from genomewide scans to the examination of just a few selected candidate loci. The proportions of true null hypotheses will vary greatly between these situations, which influences FDR control.
We will focus on multiple testing procedures that take the categorical structure of the SNP data into account. The most popular FWER control procedure for discrete data is Tarone’s procedure (Tarone 1990). However Tarone’s procedure is not monotone in the alevel. Therefore Hommel & Krummenauer published an improvement (Hommel & Krummenauer 1998). Recently Gilbert (Gilbert 2005) transferred Tarone’s procedure to FDR control by explorative Simes procedure (Simes 1986, Benjamini & Hochberg 1995). However in Gilbert’s procedure the finally attained boundary for the pvalues by Simes procedure can be higher than the boundary for the previous selection of hypotheses for the „Tarone subset“, such that no rejection may occur for small pvalues outside the “Tarone subset” but for larger ones inside.
We discuss ideas how the Hommel&Krummenauer procedure can be extended to FDR control and how Gilbert’s procedure can be improved. Additionally we examine the advantages of using test procedures adapted to discrete test statistics in genetic association studies. Therefore we compare Gilbert’s FDRcontrolling procedure with the Hommel&Krummenauer procedure and additionally with classical FWER controlling procedures and the classical FDR controlling procedure. Results suggest that increase in power by exploiting the discrete nature can only be achieved when the number of subjects is small. Superiority of FDR control is more prominent if a larger proportion of null hypotheses are false.
References
Benjamini Y., and Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B, 57, 289–300.
Gilbert PB. (2005) A modified false discovery rate multiplecomparisons procedure for discrete data, applied to human immunodeficiency virus genetics. Applied Statistics 44, 143–158.
Hommel, G. and Krummenauer, F. (1998) Improvements and modifications of Tarone’s multiple test procedure for discrete data. Biometrics 54,673–681.
Simes, RJ. (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751–754.
Tarone, RE. (1990) A modified Bonferroni method for discrete data. Biometrics 46: 515–522

An Application of the Closed Testing Principle to Enhance OneSided Confidence Regions for a Multivariate Location Parameter
Michael Vock University of Bern, Institute of Mathematical Statistics 
If a onesided test for a multivariate location parameter is inverted, the resulting confidence region may have an unpleasant shape. In particular, if the null and alternative hypothesis are both composite and complementary, the confidence region usually does not resemble the alternative parameter region in shape, but rather a reflected version of the null parameter region.
We illustrate this effect and show one possibility of obtaining confidence regions for the location parameter that are smaller and have a more suitable shape for the type of problems investigated. This method is based on the closed testing principle applied to a family of nested hypotheses.

Proportion of true null hypotheses in non highdimensional multiple testing problems: procedures and comparison
Mario Walther; Claudia Hemmelmann; Rüdiger Vollandt Institute of medical statistics, computer science and documentation, Friedrisch 
When testing multiple hypotheses simultaneously, a quantity of
interest is the proportion of true null hypotheses. Knowledge about this proportion can improve the power of different multiple test procedures which control the generalized familywise error rate, the false discovery rate or the false discovery proportion. For instance in stepwise procedures the critical values, with which the pvalues have to be compared, can be increased, if an upper bound of the proportion of true null hypotheses is known.
There are a lot of authors who concerned with establishing methods of estimating the proportion of true null hypotheses. Most of the introduced procedures are based on several thousands pvalues, which are often assumed to be independent. These procedures work very well, however, problems arise when the dimension of the multiple testing problem is only in the few hundreds and the data are correlated. There the latter one is for example the case in EEG, proteomic or fMRI data.
Within this framework we pose the question, what is a "good"
estimation of the proportion of true null hypotheses. We therefore introduce several criteria to evaluate the efficiency of the estimations. One criterion will be the probability that a certain estimation method overestimates the proportion of true null hypotheses. Another criterion will be whether the confidence interval of the proportion of true null hypotheses is contained in a range of a prespecified accuracy.
In this talk, we will explain methods for estimating the proportion of true null hypotheses, which are also suitable for non highdimensional multiple testing problems with correlated
pvalues. Furthermore we will evaluate and compare the quality of the estimators regarding the introduced criteria in a simulation study.

Sample size reestimation and hypotheses tests for trials with multiple treatment arms
Jixian Wang; Franz Koenig Novartis Pharma AG 
Sample size reestimation (SSRE) provides a useful tool to change a design during the conduct of a study when an interim look reveals that the original sample size is inadequate. For trials comparing an active treatment with a control, a common way to control the type I error is to construct an asymptotically normal distributed weighted test statistic combining the information before and after the interim look.
We consider sample size reestimation methods for comparing multiple active treatments with a control, where we allow the change of sample size for one arm to depend on the interim information across all arms. We propose several ways to construct weighted statistics combining the information
before and after SSRE as well as related test procedures to control the overall type I error. When the change of sample size is proportional across all treatment arms, it is possible to construct statistics so that the Dunnett test can be used as if there was no SSRE. For arbitrary SSREs, we propose other procedures including a closed test based on weighted statistics with marginally standard normal distribution and a test using a multivariate generalization of weighted test statistics in combination with the closure principle. A practical example is used to illustrate the proposed approaches. The properties of the procedures are evaluated by simulations.

ResamplingBased Control of the False Discovery Rate under Dependence
Michael Wolf; Joseph Romano, Azeem Shaikh University of Zurich 
This paper considers the problem of testing s null hypotheses
simultaneously while controlling the false discovery rate (FDR).
The FDR is defined to be the expected value of the fraction of
rejections that are false rejections (with the fraction understood to be 0 in the case of no rejections). Benjamini and Hochberg (1995)provide a method for controlling the FDR based on pvalues for each of the null hypotheses under the assumption that the pvalues are independent. Subsequent research has since shown that this procedure is valid under weaker assumptions on the joint distribution of the pvalues. Related procedures that
are valid under no assumptions on the joint distribution of the
pvalues have also been developed. None of these procedures, however, incorporate information about the dependence structure of the test statistics. This paper develops methods for control of the FDR under weak assumptions that incorporate such information and, by doing so, are better able to detect false null hypotheses. We illustrate this property via a simulation study and an empirical application to the evaluation of hedge funds.

On Identification of Inferior Treatments Using the NewmanKeuls Type Procedure
Samuel Wu; Weizhen Wang; David Annis University of Florida 
We are concerned with selecting a subset of treatments such that the probability of including ALL best treatments exceeds a prespecified level. In this paper, we provide a stochastic ordering of the Studentized range statistics under a balanced oneway anova model. Based on this result we show that, when restricted to the multiple comparisons with the best, the NewmanKeuls type procedure strongly controls experimentwise error rate for a sequence of null hypotheses regarding the number of largest treatment means.

Knowledgebased approach to handling multiple testing in functional genomics studies
Adam Zagdanski; Przemyslaw Biecek, Rafal Kustra University of Toronto, Canada and Wroclaw University of Technology, Poland 
We propose a novel method for multiple testing problem inherent in functional genomics studies. One novelty of the method is that it directly incorporates prior knowledge about gene annotations to adjust the pvalues. We describe general methodology to perform knowledgebased multiple testing adjustment and focus on an application of this approach in Gene Set Functional EnrichmentAnalysis (GSFEA). We apply and evaluate our method using a database of known ProteinProtein Interactions to perform largescale gene function prediction. In this study Gene Ontology Biological Process (GOBP) taxonomy is employed as the knowledgebase standard for describing gene functions. An extensive simulation study is carried out to investigate a behaviour of the proposed adjustment procedure under different scenarios. Empirical analysis, based on both real and simulated data, reveals that our approach yields an improvement of a number of performance criteria, including an empirical False Discovery Rate (FDR). We derive theoretical connections between our method and the stratified False Discovery Rate approach proposed by [1], and also describe similarities to the weighted pvalue FDR control introduced recently by [2]. Finally we show how our method can be adopted to other multiple hypothesis problems where some form of prior information about the relationships among tests is available.
REFERENCES
[1] L.Sun, R.V. Craiu, A.D. Paterson, S.B. Bull (2006)
“Stratified false discovery control for largescale hypothesis testing with application to genomewide association studies”
Genet Epidemiol. 2006 Sep, 30(6):51930.
[2] Ch.R. Genovese, K. Roeder and L.Wasserman (2006)
“False discovery control with pvalue weighting”
Biometrika 2006, 93(3):509524.

Multistage designs controlling the False Discovery or the Family Wise Error Rate
Sonja Zehetmayer; Peter Bauer, Martin Posch Section of Medical Statistics, Medical University of Vienna, Austria 
When a large number of hypotheses are investigated, conventional singlestage designs may lack power due to low sample sizes for the individual hypotheses. We propose multistage designs where in each interim analysis 'promising' hypotheses are screened which are investigated in further stages. Given a fixed overall number of observations, this allows to spend more observations for promising hypotheses than with singlestage designs, where the observations are equally distributed among all considered hypotheses. We propose multistage procedures controlling either the Family Wise Error Rate (FWE) or the False Discovery Rate (FDR) and derive optimal stopping boundaries and sample size allocations (across stages) to maximize the power of the procedure.
Optimized twostage designs lead to a considerable increase in power compared to the classical singlestage design. We show that going from two to three stages additionally leads to a distinctive increase in power. Adding a fourth stage leads to a further improvement, which is, however, less pronounced. Surprisingly, we found only small differences in power between optimized integrated designs, where the data of all stages is used in the final test statistics, and optimized pilot designs where only the data from the final stage is used for testing. However, the integrated design controlling the FDR appeared to be more robust against misspecifications in the planning phase. Additionally, we found that with increasing number of stages the drop in power when controlling the FWE instead of the more liberal FDR becomes negligible.
Our investigations show that the crucial point is not the choice of the error rate or the type of design (integrated or pilot), but the sequential nature of the trial where nonpromising hypotheses are dropped in early phases of the experiment so that test decisions among the selected hypotheses can be based on considerably larger sample sizes compared to the classical singlestage design.

Adaptive seamless designs for subpopulation selection based on time to event endpoints
Emmanuel Zuber; Werner Brannath, Michael Branson, Frank Bretz, Paul Gallo, Martin Posch, Amy Rac Novartis Pharma AG, Basel, Switzerland 
A targeted therapy might primarily benefit a subpopulation of patients. Thus, the ability to select a sensitive patient population may be crucial for the development of such a therapy. Traditionally, one would need to start with a hypothesis generating phase II study to identify a subpopulation. The specific sensitivity of that subpopulation would have to be confirmed independently in a second phase II study, before a phase III study could be run in the selected target population. A formal claim of efficacy would be based on the phase III data only.
A more efficient approach is presented using an adaptive phase II/III seamless design, to combine into a single twostage study the selection of either the full or the subpopulation, with the proof of efficacy.
From a separate concomitant exploratory study, a subpopulation is to be identified independently before the end of stage 1 of the combined phase II/III study. At the end of stage 1, Bayesian tools are used to confirm the hypothesis of a more sensitive subpopulation. One may then decide at this step to adapt the conduct of the trial by limiting to that subpopulation the further recruitment into stage 2, and by choosing the hypothesis testing strategy. Thus, the independent confirmation of the subpopulation is more reliable, being made on the same clinical endpoint and in the same setting as the final phase III demonstration of efficacy. The latter is efficiently based on the combined data from stage 1 and 2, in the selected population, with an adapted testing strategy.
The use of the adaptive design methodology with a time to event endpoint relies on the asymptotic independent increment property of the logrank test statistics. The overall type I error rate is controlled thanks to the concomitant use of adaptive design methodology and of the closed testing principle for the testing in the different populations. The use of Bayesian decision tools such as predictive powers and a posterior distribution of treatment effect does not affect the overall type I error rate. It allows to account in a statistical manner, for the uncertainty of interim data and external information into the adaptation decision making.
Simulations are necessary for the design of such a complex study, to determine sample size and to assess its operating characteristics as a function of the Bayesian decision rules, and of the unknown prevalence of the subpopulation. Properties of treatment effect estimates and the preservation of trial integrity after its adaptation are also studied by simulations, compared to more conventional group sequential designs.


Home
General Information
Social Events
Program
Presentations
Abstracts
Invited Talks
Contributed Talks
Posters
Organizing Committee
Registration
Hotels/Excursions
Travel Information
Sponsors
Newsletter
Gallery
Downloads
Abstract Book
MCP2007 Flyer
Archive
MCP2009  Japan
MCP2005  Shanghai
MCP2002  Bethesda
MCP2000  Berlin
MCP1996  Tel Aviv
