Anat Reiner (Tel Aviv Univeristy, Israel)

Using the False Discovery Rate Criteria for Simultaneous Hypothesis Testing in Epidemiological Research

Some well established statistical methods have been broadly used for the purpose of analyzing epidemiololgical data. The methods selected depend mostly on the data structure and the data collecting method, ignoring possible insufficiency of the procedure under certain circumstances. An analysis containing multiple comparisons is an example to a situation in which cautious considerations need to be made before applying an analytical technique and interpreting its results. This is due to the increased type I error arisen by simultaneously performing multiple statistical tests, and the possible loss of power that might occur as a result of attempting to control the increase of the type I error. In search of relevant cases for implementation of the FDR criterion for multiple comparisons, we attempted to identify the typical statistical procedures applied for dealing with problems addressed by researchers of epidemiology, through a survey of randomly sampled articles out of the 1993 to 1995 volumes of the American Journal of Epidemiology and the American Journal of Public Health. It was recognized from the survey that one of the most widely used analytical tools is the multiple logistic regression model fitting procedure, that aims to predict the probability of attaining a cerain medical condition, and also produces estimates of the odds ratios for the subgroups of interest, therefore involving multiple hypotheses testing. It was therefore concluded that focusing the discussion and analysis of the multiple testing problem in the cases where a logistic regression procedure is applied will yield quite a good coverage of the problem as it faced by epidemiological reaearch activity. Ottenbacher(1998), who analyses the size of the type I error in a sample of published epidemiological articles, enhances the need to apply procedures that deal with multiple comparisons, and suggests reducing the significance level by using a more conservative criteria, that will take multiplicity into account. He mentions the Bonferroni method as an example, with the drawback of its resulting in a drastic loss of power. He mentions the Benjamini and Hochberg method (1995) with a similar drawback, and suggests the alternative of using a less conservative criteria than the FWER (Family-wise Error Rate). In fact Ottenbacher fails to recognize that the Benjamini and Hochberg method adopts exactly the same idea: using a less conservative criteria that still provides sufficient information concerning the type I error. Moretheless, the criteria it suggests to control, the FDR (False Discovery Rate), which is the expected rate of false rejections, is structurally defined and theoretically supported. On this ground it became worthwhile to study the performance of methods that control the FDR in different scenarios that represent the various data structures confronted in epidemiological research. For this purpose, simulative databases were created, containing multiple explanatory variables and one dichotomeous dependent variable. Each database was defined using a unique combination of characteristics, such as sample size, number of multiple hypotheses, proportion of false hypotheses, extent of significance and type of dependency between the test statistics. Overall, 48 different data configurations were created. the odds-ratios from 1 were calculated. The hypotheses were tested using each of 10 different methods to set a corrected significance level, given a desired type I error. 5 of the methods controlled the FWER, and 5 of them controlled the FDR. Data was repeatedly simulated and modeled 2500 times, for each type of configurations. Averages and standard deviations were calculated for the FDR and the test power. These measurements were used to thoroughly investigate the performance of each method. The performances of the methods that control the FDR were compared to the methods that conrol the FWER, and also compared against each other. Results show a consistent advantage of the methods that control the FDR in terms of test power. The optimal data characteristics in terms of power gain that resulted from using the FDR criteria is a high proportion of false hypotheses, accompanied by a high total number of hypotheses and a low significance of them, in case of independence or positive dependence, and a high significance in situations of a relatively low power, as in the case of general dependency. In case of independence or positive dependence, the Benjamini and Hochberg original method, and their later developed adaptive method, always achieve the best results in terms of absolute power, and sesitivity of power to conditions that yield low power by definition. In case of general dependence, the Benjamini and Liu method always achieves the best results in terms of absolute power.