Sampling/Bootstrapping

This post shows how to calculate statistics in the way I believe should be the default for data scientists, bootstrapping. If you are not familiar with this approach and think it sounds intriguing, check out this page to find a great book to get a fantastic start on bootstrapping and practical statistics.

Note

The quality and simplicity of the APL code in this post was improved thanks to the kind feedback provided by rak1507. It’s awesome to have experienced community members like rak1507 that are willing to read through material written by people newer to array programming and offer feedback in a supportive way.

Why Bootstrapping

Because this opinions seems to put me in the minority of data scientists I am writing a short piece on why bootstrapping here.

In classical statistics, very clever algebraic formulas are used to approximate a sampling distribution, and that approximation can be used to calculate a p-value or a confidence interval or other statistics. These formulas rely on assumptions about the data and do not work if those baked in assumptions are not true. In other words they are really shortcuts to calculating an answer that work in specific situations.

In modern days, we do not need to approximate a sampling distribution using algebra. We can do something much more elementary, more powerful, and more flexible. Thanks to modern computers, we can just sample our data repeatedly to create an actual sampling distribution and calculate based off of that. You get the same answer. So why do I advocate for a bootstrapping first approach?

It is simpler and more intuitive. This means it is far easier to craft custom tests and statistics based on whatever you want and reason about what things are.
Bootstrapping assumes nothing other than you have data. Classical statistical formulas are shortcuts that are enabled with baked in assumptions about the data. This means the same boostrapping approach works in basically all situations, where classical statistical formulas only apply in the particular situations they were designed for.

For this reason I believe it should be the default and you can change to computational shortcuts in the situations where it makes sense (ie you are very confident you understand assumptions, confident they are true in your problem, and the amount of data makes it non-trivial to bootstrap).

Note

Much of this next bit is heavily inspired by Overview of Statistics: UnLocking the Power of Data By Lock, Lock, Lock, Lock, and Lock Published by Wiley (2012). I have summarized key points that I think are relevant to what I want to communicate. For example, the quotes I am using are quotes I originally saw in their article.

Many of the top statisticians have known bootstrapping is a more elementary but more flexible approach for longer than the approach was computationally feasible. For example, in 1936 Sir R.A. Fisher (who created the foundations of statistical inference) spoke about using this bootstrapping approach:

Actually, the statistician does not carry out this very simple and very tedious process, but his conclusions have no justification beyond the fact that they agree with those which could have been arrived at by this elementary method.

While these methods were tedious in 1936, they are trivial thanks to modern computers. We no longer have to do clever algebraic tricks to approximate a sampling distribution - we can just create a sampling distribution, as George Cobb pointed out in the journal Technology Innovations in Statistical Education.

What we teach is largely the technical machinery of numerical approximations based on the normal distribution and its many subsidiary cogs. This machinery was once necessary, because the conceptually simpler alternative based on permutations was computationally beyond our reach. Before computers statisticians had no choice. These days we have no excuse.

Sampling Distribution

A sampling distribution is a distribution of samples. Let’s talk about what that means.

First, a sample is a subset of our data. [1, 0, 2, 1] is a sample of [0, 1, 2, 3]. A few things about samples: + We will sample by taking a random selection of values from our data. + We are sampling with replacement, meaning we can pick the same data point multiple times.

Next, a distribution is mostly just a bunch of data points. In our case we are going to have a bunch of samples.

So we need a way to create a sample, and then we need to do that a bunch of times. Let’s get started.

]box on -style=max

┌→────────────────┐
│Was ON -style=max│
└─────────────────┘

Create 1 Sample

Let’s start with creating a sample

Create some data for us to sample
Get 5 random numbers between 1 and 10 (no duplicate numbers)

⎕←V ← 5?10

┌→─────────┐
│4 3 10 1 5│
└~─────────┘

Next we need to get a random sample of indexes from our data V. We can do that in 3 steps: 1. Get the total number of elements in our data array with ≢V (tally the Vector) 1. Create an array of the size of the sample we want and fill it with ≢V using 10⍴≢V. Create an array of dimension 10 with containing the tally of the vector. APL will broadcase to make all elements equal to ≢V automatically. 1. ? will roll a die for each element between 1 and the value of the element. This gives us random index locations for each sample we want.

Put that all together and we have code that:

Get random sample of indexes
Get 10 random numbers between 1 and ≢V (duplicates allowed)

⎕←S←?10⍴≢V

┌→──────────────────┐
│3 2 2 2 2 5 1 2 4 2│
└~──────────────────┘

Since that created random index locations, we can look those indexes up in our original vector V to get our random sample.

V[S]

┌→───────────────────┐
│10 3 3 3 3 5 4 3 1 3│
└~───────────────────┘

If we put that together we get a nice compact way of drawing a sample.

V[?10 ⍴ ≢V]

┌→───────────────────┐
│3 10 3 3 1 4 3 4 3 3│
└~───────────────────┘

Create sampling distribution

We drew a sample, but really what we want to do is draw a whole bunch of samples. All we have to do is create a matrix of indices instead of a vector and the exact same approach works.

This is the same as above, except instead of 10 ⍴ to create an array of shape 10, we use 5 10 ⍴ to create an array of shape 5 by 10.

For convenience I store the shapes in a variable for later use.

V[?(n←5) (ss←10) ⍴ ≢V]

┌→──────────────────────────┐
↓ 4 4  3  4  1  5  4 3  3 10│
│ 1 3 10  5  3  4 10 1  4 10│
│ 5 1 10 10  5  3  3 4  1  4│
│10 5 10  1 10  3  1 4 10  1│
│10 5  5  5  1 10  3 3  1 10│
└~──────────────────────────┘

Calculate statistics

Now that we know how to calculate a sampling distribution we can calculate some statistics. I will start with confidence intervals, then move into p values.

Confidence Intervals

Lets do a bigger sample and calculate our confidence interval using 10000 random numbers between 1 and 100.

data ← ?10000/100 

10↑data ⍝ look at first 10 values

┌→───────────────────────────┐
│50 3 15 43 93 60 96 29 71 58│
└~───────────────────────────┘

Next we can calculate a sampling distribution and look a a few of them. We use the code from before but with 1000 pulls.

sampling_distribution←data[? (n←1000) (ss←10) ⍴ ≢ data]
5↑sampling_distribution

┌→─────────────────────────────┐
↓20 56 92 100 34 89 28 92 10 21│
│34 95 89  69 35 81 25 25 80 87│
│77 68 32  77 57 20 10 20 21 95│
│37 73 19  79 11 88 13  1 90 68│
│70 42 10  74 62 34 82 17  3 19│
└~─────────────────────────────┘

We want to do a confidence interval on the mean so we need to calculate the mean of each of these samples.

+/ Row-wise sum (trailing axis)
ss÷⍨ divides each element by ss (ss defined when creating sampling distribution)

sample_means ← ss÷⍨+/ sampling_distribution
8↑sample_means

┌→────────────────────────────────────┐
│54.2 62 47.7 47.9 41.3 63.5 50.7 44.3│
└~────────────────────────────────────┘

Now we calculate at 90% confidence interval on our sample mean. That means we are 90% confident our mean will land in the given interval range. This is easy to do because we have calculated the mean of a good sampling distribution so we just need to cut off the top and bottom 5% of values and 90% of the values landed in that range.

⍋ sorts ascending, then cut off first 50 and take the first 900 of that

order90 ← 900↑50↓⍋sample_means

Get min and max of middle 90% of sample means, which is our 90% confidence interval. Because our data is sorted we can just get the first and last value.

sample_means[⊣/order90]
sample_means[⊢/order90]

    
36.2

    
65.5

We know we are 90 percent confident that a mean based on a sample size of 10 will land in that range because we did that and found that to be true.

P values

Let’s say we have 2 sets of data and we want to know whether some statistics are different between them. We have 10,000 samples of our original data, and we ran an experiment and got 100 datapoints with our new process. We want to calculate a p value to see if it supports our hypothesis that it had a statistically significant impact.

Note

Statistically significant impact does not necessarily mean practically significant. This test is doing the basic (are these 2 means different), but often that isn’t really that helpful of a question. Often we want to ask “are these 2 means different by at least X”. After reviewing the simple examples think through how you might be able to design that test via bootstrapping!

P value on equal means

baseline ← 1-⍨2×⍨?10000/0
experiment ← 0.5-⍨?100/0

These should have roughly the same means so we should get a large p value and show the difference is not statistically significant

Let’s run the test and see what we get. First let’s get our statistic from our experiment (mean).

⎕←experiment_mean ← (+/experiment) ÷ ≢experiment

             
0.02131968282

Now let’s create our sampling distribution on our baseline.

sampling_distribution←baseline[? (n←1000) (ss←10) ⍴ ≢ baseline]

Calculate the means of each.

sampling_means ← ss ÷⍨ +/sampling_distribution

We then calculate a p value by seeing what percentage of sample means our experiment mean is more extreme than. We can check this on both ends of the distribution and we would take the smaller one normally.

n ÷⍨ +/ experiment_mean>sampling_means
n ÷⍨ +/ experiment_mean<sampling_means

     
0.545

     
0.455

P value on different means

baseline ← ?10000/0
experiment ← 0.2-⍨?100/0

These should have different means so we should get a large p value and show the different is not practically significant

Let’s run the test and see what we get. First let’s get our statistic from our experiment (mean).

⎕←experiment_mean ← (+/experiment) ÷ ≢experiment

           
0.302065664

Now let’s create our sampling distribution on our baseline.

sampling_distribution← baseline[? (n←1000) (ss←10) ⍴ ≢ baseline]

sampling_means ← ss ÷⍨ +/sampling_distribution

We then calculate a p value by seeing what percentage of sample means our experiment mean is more extreme than. We can check this on both ends of the distribution, but we would take the smaller one. We can see our p value is quite small - it successfully detected that we likely have a different mean.

n ÷⍨ +/ sampling_means > experiment_mean
n ÷⍨ +/ sampling_means < experiment_mean

     
0.993

     
0.007