Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates
Author contributions: A.E. and T.E.N. designed research; A.E. and T.E.N. performed research; A.E., T.E.N., and H.K. analyzed data; and A.E., T.E.N., and H.K. wrote the paper.
Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of a number of fMRI studies and may have a large impact on the interpretation of weakly significant neuroimaging results.
The most widely used task functional magnetic resonance imaging (fMRI) analyses use parametric statistical methods that depend on a variety of assumptions. In this work, we use real resting-state data and a total of 3 million random task group analyses to compute empirical familywise error rates for the fMRI software packages SPM, FSL, and AFNI, as well as a nonparametric permutation method. For a nominal familywise error rate of 5%, the parametric statistical methods are shown to be conservative for voxelwise inference and invalid for clusterwise inference. Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape. By comparison, the nonparametric permutation test is found to produce nominal results for voxelwise as well as clusterwise inference. These findings speak to the need of validating the statistical methods being used in the field of neuroimaging.
Since its beginning more than 20 years ago, functional magnetic resonance imaging (fMRI) (1, 2) has become a popular tool for understanding the human brain, with some 40,000 published papers according to PubMed. Despite the popularity of fMRI as a tool for studying brain function, the statistical methods used have rarely been validated using real data. Validations have instead mainly been performed using simulated data (3), but it is obviously very hard to simulate the complex spatiotemporal noise that arises from a living human subject in an MR scanner.
Through the introduction of international data-sharing initiatives in the neuroimaging field (4–10), it has become possible to evaluate the statistical methods using real data. Scarpazza et al. (11), for example, used freely available anatomical images from 396 healthy controls (4) to investigate the validity of parametric statistical methods for voxel-based morphometry (VBM) (12). Silver et al. (13) instead used image and genotype data from 181 subjects in the Alzheimer’s Disease Neuroimaging Initiative (8, 9), to evaluate statistical methods common in imaging genetics. Another example of the use of open data is our previous work (14), where a total of 1,484 resting-state fMRI datasets from the 1,000 Functional Connectomes Project (4) were used as null data for task-based, single-subject fMRI analyses with the SPM software. That work found a high degree of false positives, up to 70% compared with the expected 5%, likely due to a simplistic temporal autocorrelation model in SPM. It was, however, not clear whether these problems would propagate to group studies. Another unanswered question was the statistical validity of other fMRI software packages. We address these limitations in the current work with an evaluation of group inference with the three most common fMRI software packages [SPM (15, 16), FSL (17), and AFNI (18)]. Specifically, we evaluate the packages in their entirety, submitting the null data to the recommended suite of preprocessing steps integrated into each package.
The main idea of this study is the same as in our previous one (14). We analyze resting-state fMRI data with a putative task design, generating results that should control the familywise error (FWE), the chance of one or more false positives, and empirically measure the FWE as the proportion of analyses that give rise to any significant results. Here, we consider both two-sample and one-sample designs. Because two groups of subjects are randomly drawn from a large group of healthy controls, the null hypothesis of no group difference in brain activation should be true. Moreover, because the resting-state fMRI data should contain no consistent shifts in blood oxygen level-dependent (BOLD) activity, for a single group of subjects the null hypothesis of mean zero activation should also be true. We evaluate FWE control for both voxelwise inference, where significance is individually assessed at each voxel, and clusterwise inference (19–21), where significance is assessed on clusters formed with an arbitrary threshold.
In brief, we find that all three packages have conservative voxelwise inference and invalid clusterwise inference, for both one- and two-sample t tests. Alarmingly, the parametric methods can give a very high degree of false positives (up to 70%, compared with the nominal 5%) for clusterwise inference. By comparison, the nonparametric permutation test (22–25) is found to produce nominal results for both voxelwise and clusterwise inference for two-sample t tests, and nearly nominal results for one-sample t tests. We explore why the methods fail to appropriately control the false-positive risk.
One thousand group analyses were performed for each parameter combination.
We thank Robert Cox, Stephen Smith, Mark Woolrich, Karl Friston, and Guillaume Flandin, who gave us valuable feedback on this work. This study would not be possible without the recent data-sharing initiatives in the neuroimaging field. We therefore thank the Neuroimaging Informatics Tools and Resources Clearinghouse and all of the researchers who have contributed with resting-state data to the 1,000 Functional Connectomes Project. Data were also provided by the Human Connectome Project, WU-Minn Consortium (principal investigators: David Van Essen and Kamil Ugurbil; Grant 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research, and by the McDonnell Center for Systems Neuroscience at Washington University. We also thank Russ Poldrack and his colleagues for starting the OpenfMRI Project (supported by National Science Foundation Grant OCI-1131441) and all of the researchers who have shared their task-based data. The Nvidia Corporation, which donated the Tesla K40 graphics card used to run all the permutation tests, is also acknowledged. This research was supported by the Neuroeconomic Research Initiative at Linköping University, by Swedish Research Council Grant 2013-5229 (“Statistical Analysis of fMRI Data”), the Information Technology for European Advancement 3 Project BENEFIT (better effectiveness and efficiency by measuring and modelling of interventional therapy), and the Wellcome Trust.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
See Commentary on page 7699.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1602413113/-/DCSupplemental.
- 1. Ogawa S, et al Intrinsic signal changes accompanying sensory stimulation: Functional brain mapping with magnetic resonance imaging. Proc Natl Acad Sci USA. 1992;89(13):5951–5955.
- 2. Logothetis NKWhat we can do and what we cannot do with fMRI. Nature. 2008;453(7197):869–878.
- 3. Welvaert M, Rosseel YA review of fMRI simulation studies. PLoS One. 2014;9(7):e101953.
- 4. Biswal BB, et al Toward discovery science of human brain function. Proc Natl Acad Sci USA. 2010;107(10):4734–4739.
- 5. Van Essen DC, et al WU-Minn HCP Consortium The WU-Minn Human Connectome Project: An overview. Neuroimage. 2013;80:62–79.
- 6. Poldrack RA, Gorgolewski KJMaking big data open: Data sharing in neuroimaging. Nat Neurosci. 2014;17(11):1510–1517.
- 7. Poldrack RA, et al Toward open sharing of task-based fMRI data: The OpenfMRI project. Front Neuroinform. 2013;7(12):12.
- 8. Mueller SG, et al The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin N Am. 2005;15(4):869–877, xi–xii.
- 9. Jack CR, Jr, et al The Alzheimer’s Disease Neuroimaging Initiative (ADNI): MRI methods. J Magn Reson Imaging. 2008;27(4):685–691.
- 10. Poline JB, et al Data sharing in neuroimaging research. Front Neuroinform. 2012;6(9):9.
- 11. Scarpazza C, Sartori G, De Simone MS, Mechelli AWhen the single matters more than the group: Very high false positive rates in single case voxel based morphometry. Neuroimage. 2013;70:175–188.
- 12. Ashburner J, Friston KJVoxel-based morphometry—the methods. Neuroimage. 2000;11(6 Pt 1):805–821.
- 13. Silver M, Montana G, Nichols TEAlzheimer’s Disease Neuroimaging Initiative False positives in neuroimaging genetics using voxel-based morphometry data. Neuroimage. 2011;54(2):992–1000.
- 14. Eklund A, Andersson M, Josephson C, Johannesson M, Knutsson HDoes parametric fMRI analysis with SPM yield valid results? An empirical study of 1484 rest datasets. Neuroimage. 2012;61(3):565–578.
- 15. Friston K, Ashburner J, Kiebel S, Nichols T, Penny W Statistical Parametric Mapping: The Analysis of Functional Brain Images. Elsevier/Academic; London: 2007.
- 16. Ashburner JSPM: A history. Neuroimage. 2012;62(2):791–800.
- 17. Jenkinson M, Beckmann CF, Behrens TE, Woolrich MW, Smith SMFSL. Neuroimage. 2012;62(2):782–790.
- 18. Cox RWAFNI: Software for analysis and visualization of functional magnetic resonance neuroimages. Comput Biomed Res. 1996;29(3):162–173.
- 19. Friston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, Evans ACAssessing the significance of focal activations using their spatial extent. Hum Brain Mapp. 1994;1(3):210–220.
- 20. Forman SD, et al Improved assessment of significant activation in functional magnetic resonance imaging (fMRI): Use of a cluster-size threshold. Magn Reson Med. 1995;33(5):636–647.
- 21. Woo CW, Krishnan A, Wager TDCluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations. Neuroimage. 2014;91:412–419.
- 22. Nichols TE, Holmes APNonparametric permutation tests for functional neuroimaging: A primer with examples. Hum Brain Mapp. 2002;15(1):1–25.
- 23. Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TEPermutation inference for the general linear model. Neuroimage. 2014;92:381–397.
- 24. Brammer MJ, et al Generic brain activation mapping in functional magnetic resonance imaging: A nonparametric approach. Magn Reson Imaging. 1997;15(7):763–770.
- 25. Bullmore ET, et al Global, voxel, and cluster tests, by theory and permutation, for a difference between two groups of structural MR images of the brain. IEEE Transactions on Medical Imaging. 1999;18(1):32–42.
- 26. Carp JThe secret lives of experiments: Methods reporting in the fMRI literature. Neuroimage. 2012;63(1):289–300.
- 27. Eklund A, Dufort P, Villani M, Laconte SBROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs. Front Neuroinform. 2014;8:24.
- 28. Lieberman MD, Cunningham WAType I and type II error concerns in fMRI research: Re-balancing the scale. Soc Cogn Affect Neurosci. 2009;4(4):423–428.
- 29. Woolrich MW, Behrens TE, Beckmann CF, Jenkinson M, Smith SMMultilevel linear modelling for FMRI group analysis using Bayesian inference. Neuroimage. 2004;21(4):1732–1747.
- 30. Hayasaka S, Nichols TEValidating cluster size inference: Random field and permutation methods. Neuroimage. 2003;20(4):2343–2356.
- 31. Kriegeskorte N, et al Artifactual time-course correlations in echo-planar fMRI with implications for studies of brain function. Int J Imaging Syst Technol. 2008;18(5-6):345–349.
- 32. Kiebel SJ, Poline JB, Friston KJ, Holmes AP, Worsley KJRobust smoothness estimation in statistical parametric maps using standardized residuals from the general linear model. Neuroimage. 1999;10(6):756–766.
- 33. Hayasaka S, Phan KL, Liberzon I, Worsley KJ, Nichols TENonstationary cluster-size inference with random field and permutation methods. Neuroimage. 2004;22(2):676–687.
- 34. Salimi-Khorshidi G, Smith SM, Nichols TEAdjusting the effect of nonstationarity in cluster-based and TFCE inference. Neuroimage. 2011;54(3):2006–2019.
- 35. Tom SM, Fox CR, Trepel C, Poldrack RAThe neural basis of loss aversion in decision-making under risk. Science. 2007;315(5811):515–518.
- 36. Duncan KJ, Pattamadilok C, Knierim I, Devlin JTConsistency and variability in functional localisers. Neuroimage. 2009;46(4):1018–1026.
- 37. Ioannidis JPWhy most published research findings are false. PLoS Med. 2005;2(8):e124.
- 38. Pavlicová M, Cressie NA, Santner TJTesting for activation in data from FMRI experiments. J Data Sci. 2006;4(3):275–289.
- 39. Scarpazza C, Tognin S, Frisciata S, Sartori G, Mechelli AFalse positive rates in voxel-based morphometry studies of the human brain: Should we be worried? Neurosci Biobehav Rev. 2015;52:49–55.
- 40. Meyer-Lindenberg A, et al False positives in imaging genetics. Neuroimage. 2008;40(2):655–661.
- 41. Nichols T, Hayasaka SControlling the familywise error rate in functional neuroimaging: A comparative review. Stat Methods Med Res. 2003;12(5):419–446.
- 42. Fair DA, et al A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. Neuroimage. 2007;35(1):396–405.
- 43. Eklund A, Dufort P, Forsberg D, LaConte SMMedical image processing on the GPU—past, present and future. Med Image Anal. 2013;17(8):1073–1094.
- 44. Gorgolewski KJ, et al NeuroVault.org: A repository for sharing unthresholded statistical maps, parcellations, and atlases of the human brain. Neuroimage. 2016;124(Pt B):1242–1244.