Very Fast Sampling Algorithms for Big Data
New sampling, bootstrap, and permutation test algorithms which are orders of magnitude faster than builtin SAS Procs, Stata, and MATLAB.
New bootstrap, permutation test, and sampling algorithms are developed and presented (along with SAS code) and achieve speed increases orders of magnitude faster than builtin SAS Procs. Subsequent tests against Stata show an even greater speed premium (many tens of thousands of times faster) and tests against MATLAB show a speed premium of at least an order of magnitude, depending on data/strata/sample size. The algos, when implemented using SAS, have great utility on "big data" sampling applications.
"Bootstraps, Permutation Tests, and Sampling Orders of Magnitude Faster Using SAS®," forthcoming, Computational Statistics: WIRE Reviews, Vol. 5(4),
John Douglas ("J.D.") Opdyke, DataMineit, LLC, May, 2013
While permutation tests and bootstraps have very wideranging application, both share a common potential drawback: as dataintensive resampling methods, both can be runtime prohibitive when applied to large or even mediumsized data samples drawn from large datasets. The data explosion over the past few decades has made this a common occurrence, and it highlights the increasing need for faster, and more efficient and scalable, permutation test and bootstrap algorithms.
Seven bootstrap and six permutation test algorithms coded in SAS are compared. The fastest algorithms ("OPDY" for the bootstrap, "OPDN" for permutation tests) are new, use no modules beyond Base SAS, and achieve speed increases orders of magnitude faster than the relevant "builtin" SAS procedures (OPDY is over 200x faster than Proc SurveySelect; OPDN is over 240x faster than Proc SurveySelect, over 350x faster than NPAR1WAY (which crashes on datasets less than a tenth the size OPDN can handle), and over 720x faster than Proc Multtest).
OPDY also is much faster than hashing, which crashes on datasets smaller  sometimes by orders of magnitude  than OPDY can handle. OPDY is easily generalizable to multivariate regression models, and OPDN, which uses an extremely efficient drawbydraw randomsamplingwithoutreplacement algorithm, can use virtually any permutation statistic, so both have a very wide range of application. And the time complexity of both OPDY and OPDN is sublinear, making them not only the fastest, but also the only truly scalable bootstrap and permutation test algorithms, respectively, in SAS.
Keywords: Bootstrap, Permutation, SAS, Big Data, Scalable, Hashing, With Replacement, Without Replacement, Sampling
Top Stories Past 30 Days

