4.6 Article

A scalable bootstrap for massive data

Publisher

WILEY
DOI: 10.1111/rssb.12050

Keywords

Bootstrap; Computational efficiency; Estimator quality assessment; Massive data; Resampling

Funding

  1. US Army Research Laboratory
  2. US Army Research Office [W911NF-11-1-0391]
  3. National Science Foundation [1122732]
  4. Direct For Computer & Info Scie & Enginr
  5. Office of Advanced Cyberinfrastructure (OAC) [1122732] Funding Source: National Science Foundation

Ask authors/readers for more resources

The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets-which are increasingly prevalent-the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the 'bag of little bootstraps' (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the m out of n bootstrap and subsampling. In addition, we present results from a large-scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available