Final code:

import numpy as np import matplotlib.pyplot as plt mu, sigma = 0, 0.1 pop_normal = np.random.normal(mu, sigma, 10000000) print 'Population Mean:', np.mean(pop_normal) bins = 15 plt.figure(1) plt.suptitle('Distribution of population', fontsize=16) plt.hist(pop_normal, bins = bins) plt.xlabel('Values') plt.ylabel('Frequency') plt.savefig('Popula.jpg') print 'Random sample:', pop_normal[np.random.randint(pop_normal.size)] for i in range(10): print i+1, '. random sample:', pop_normal[np.random.randint(pop_normal.size)] sample_means = np.arange(60.0) for i in range(10): samples = pop_normal[np.random.randint(pop_normal.size, size = 3)] np.put(sample_means, i, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 10)] np.put(sample_means, i + 10, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 20)] np.put(sample_means, i + 20, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 30)] np.put(sample_means, i + 30, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 50)] np.put(sample_means, i + 40, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 100)] np.put(sample_means, i + 50, np.mean(samples)) fig2, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(nrows=2, ncols=3) plt.suptitle('Sampling 10 times with various sample sizes', fontsize=16) ax1.hist(sample_means[0:10], bins = bins) ax1.set_title('Sample size n=3') ax1.set_xlabel('Value of Means') ax1.set_ylabel('Frequency') ax2.hist(sample_means[10:20], bins = bins) ax2.set_title('Sample size n=10') ax2.set_xlabel('Value of Means') ax2.set_ylabel('Frequency') ax3.hist(sample_means[20:30], bins = bins) ax3.set_title('Sample size n=20') ax3.set_xlabel('Value of Means') ax3.set_ylabel('Frequency') ax4.hist(sample_means[30:40], bins = bins) ax4.set_title('Sample size n=30') ax4.set_xlabel('Value of Means') ax4.set_ylabel('Frequency') ax5.hist(sample_means[40:50], bins = bins) ax5.set_title('Sample size n=50') ax5.set_xlabel('Value of Means') ax5.set_ylabel('Frequency') ax6.hist(sample_means[50:60], bins = bins) ax6.set_title('Sample size n=100') ax6.set_xlabel('Value of Means') ax6.set_ylabel('Frequency') plt.tight_layout() fig2.subplots_adjust(top = 0.88) plt.savefig('Dist1.jpg') fig3, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(nrows=2, ncols=3) plt.suptitle('Sample size n=35 sampled by various times', fontsize=16) sample_size = 35 sample_means_fixed_sample_size = [] for i in range(100): samples = pop_normal[np.random.randint(pop_normal.size, size = 35)] sample_means_fixed_sample_size.append(np.mean(samples)) ax1.hist(sample_means_fixed_sample_size[:3], bins = bins) ax1.set_title('Sampling 3 times') ax1.set_xlabel('Value of Means') ax1.set_ylabel('Frequency') ax2.hist(sample_means_fixed_sample_size[:10], bins = bins) ax2.set_title('Sampling 10 times') ax2.set_xlabel('Value of Means') ax2.set_ylabel('Frequency') ax3.hist(sample_means_fixed_sample_size[:20], bins = bins) ax3.set_title('Sampling 20 times') ax3.set_xlabel('Value of Means') ax3.set_ylabel('Frequency') ax4.hist(sample_means_fixed_sample_size[:30], bins = bins) ax4.set_title('Sampling 30 times') ax4.set_xlabel('Value of Means') ax4.set_ylabel('Frequency') ax5.hist(sample_means_fixed_sample_size[:50], bins = bins) ax5.set_title('Sampling 50 times') ax5.set_xlabel('Value of Means') ax5.set_ylabel('Frequency') ax6.hist(sample_means_fixed_sample_size, bins = bins) ax6.set_title('Sampling 100 times') ax6.set_xlabel('Value of Means') ax6.set_ylabel('Frequency') plt.tight_layout() fig3.subplots_adjust(top = 0.88) plt.savefig('Dist2.jpg') Population Mean: 5.0419863565e-06 Random sample: 0.179611197522 1 . random sample: -0.0593561539804 2 . random sample: -0.0410881258393 3 . random sample: 0.080267599471 4 . random sample: -0.0102813068294 5 . random sample: -0.00665999430229 6 . random sample: 0.102028677547 7 . random sample: 0.00381388640002 8 . random sample: -0.121671547511 9 . random sample: -0.00176703777575 10 . random sample: 0.00259089874219

Central Limit Theorem is one of the most important concepts of statistics. It lets you find the Mean of the population you are examining. You take a bunch of random samples, calculate their Mean, take another bunch of random samples, calculate their Means and you repeat a couple of times. Then you have a set of Means. If you plot them you get a normal distribution histogram with a Mean which equals the Mean of the population.

I tried to find an official definition or a mathematical definition but interestingly the sites a have checked all phrase the definition slightly different. However if you strip the definition to the bones they all say the same: if you population has a finite level of variance, you can keep sampling with sufficiently large sample sizes, record their Means and as the sample Means accumulate the distribution of your sample Means will be a normal distribution with its Mean being the same as the population’s.

Anyway here is the definition of Wikipedia:

It is really handy as finding the Mean is the first step in analysing your data and most of the times you simply cannot have access to the whole population. Would you like to know the average foot size of the residents of Hungary? It is a relatively small country with about 10 million inhabitants, yet measuring the individual foot sizes is impossible. You either need hundreds or thousands of people to conduct your survey which will cost you a fortune or hire only a few of them but then the survey will take forever. And it is just a small country and you are measuring one feature (foot size).

So in general, you cannot survey the whole population, you need to find another way to get your analysis started (by finding the Mean).

In the below exercise I am going to test the Central Limit Theorem. Is it true what they say? Does it work?

Yes it does.

I import the modules, create a normally distributed population of 10,000,000 between -1 and 1, plot it, and print out its Mean.

import numpy as np import matplotlib.pyplot as plt mu, sigma = 0, 0.1 pop_normal = np.random.normal(mu, sigma, 100000) bins = 15 plt.figure(1) plt.suptitle('Distribution of population', fontsize=16) plt.hist(pop_normal, bins = bins) plt.xlabel('Values') plt.ylabel('Frequency') plt.savefig('Population.jpg') print 'Population Mean:', np.mean(pop_normal) Population Mean: 5.0419863565e-06

The histogram shows that the distribution is indeed normal, and the population Mean is technically 0 as you would expect. Of course since all values are randomly generated, every run of the code would generate slightly different outcomes.

It is important to point out: this is what I don’t know! I have plotted the distribution and printed out the Mean for illustration purposes only. The whole point of using CLT is to find the Mean without measuring the whole population. So consider as you have never seen the above plot and Mean.

Now, let’s assume that we don’t need CLT to find the Mean! Start with the simplest method: pick one sample, maybe it is all you need. Maybe the value of a random sample always be the Mean of the population.

I print out the value of a single sample along with the population Mean (which, I repeat: I shouldn’t know)

print 'Population Mean:', np.mean(pop_normal) print 'Random sample:', pop_normal[np.random.randint(pop_random.size)] Population Mean: 5.0419863565e-06 Random sample: 0.179611197522

Nope, this did not work, I got 0.179 while the population Mean is 0.000005. However by random chance it is possible to pick a sample which is in fact equals to the population Mean so I repeat this method ten times.

print 'Population Mean:', pop_normal[np.random.randint(pop_normal.size)] for i in range(10): print i+1, '. random sample Mean:', np.mean(pop_normal) Population Mean: 5.0419863565e-06 1 . random sample: -0.0593561539804 2 . random sample: -0.0410881258393 3 . random sample: 0.080267599471 4 . random sample: -0.0102813068294 5 . random sample: -0.00665999430229 6 . random sample: 0.102028677547 7 . random sample: 0.00381388640002 8 . random sample: -0.121671547511 9 . random sample: -0.00176703777575 10 . random sample: 0.00259089874219

I can see a few values which are only a few percent off the Mean but it is easy to see that randomly selecting samples is not the way to find the population Mean. According to the definition of CLT I need to keep sampling a sufficiently large sample and their Mean will show me the Mean of the population.

What is sufficiently large? The examples and tutorials I visited all mention sample size of 30 as a minimum but for skewed or not normally distributed populations the sample size should be even bigger.

I use the sample sizes 3, 10, 20, 30, 50, 100 and sample the population 10 times with each sample size, then I plot the distribution of their Means.

sample_means = np.arange(60.0) for i in range(10): samples = pop_normal[np.random.randint(pop_normal.size, size = 3)] np.put(sample_means, i, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 10)] np.put(sample_means, i + 10, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 20)] np.put(sample_means, i + 20, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 30)] np.put(sample_means, i + 30, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 50)] np.put(sample_means, i + 40, np.mean(samples)) samples = pop_normal[np.random.randint(pop_normal.size, size = 100)] np.put(sample_means, i + 50, np.mean(samples)) fig2, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(nrows=2, ncols=3) plt.suptitle('Sampling 10 times with various sample sizes', fontsize=16) ax1.hist(sample_means[0:10], bins = bins) ax1.set_title('Sample size n=3') ax1.set_xlabel('Value of Means') ax1.set_ylabel('Frequency') ax2.hist(sample_means[10:20], bins = bins) ax2.set_title('Sample size n=10') ax2.set_xlabel('Value of Means') ax2.set_ylabel('Frequency') ax3.hist(sample_means[20:30], bins = bins) ax3.set_title('Sample size n=20') ax3.set_xlabel('Value of Means') ax3.set_ylabel('Frequency') ax4.hist(sample_means[30:40], bins = bins) ax4.set_title('Sample size n=30') ax4.set_xlabel('Value of Means') ax4.set_ylabel('Frequency') ax5.hist(sample_means[40:50], bins = bins) ax5.set_title('Sample size n=50') ax5.set_xlabel('Value of Means') ax5.set_ylabel('Frequency') ax6.hist(sample_means[50:60], bins = bins) ax6.set_title('Sample size n=100') ax6.set_xlabel('Value of Means') ax6.set_ylabel('Frequency') plt.tight_layout() fig2.subplots_adjust(top = 0.88) plt.savefig('Distribution_ten_times.jpg')

It doesn’t really show much does it? I take random samples 10 times, it gives me 10 sample Means but they do not show the distribution I expected. I got 2 or 3 Means near the population Mean sometimes, but generally they are just all over the place.

Let’s try another approach. This time the sample size will be fixed at 35 and I run the sampling 100 times. I print out the results of the first 3 runs then the first 10 runs e. t. c. simulating sampling done by 3, 10, 20, 30, 50 ,100 times and plot the distribution of the resulting Means.

fig3, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(nrows=2, ncols=3) plt.suptitle('Sample size n=35 sampled by various times', fontsize=16) sample_size = 35 sample_means_fixed_sample_size = [] for i in range(100): samples = pop_normal[np.random.randint(pop_normal.size, size = 35)] sample_means_fixed_sample_size.append(np.mean(samples)) ax1.hist(sample_means_fixed_sample_size[:3], bins = bins) ax1.set_title('Sampling 3 times') ax1.set_xlabel('Value of Means') ax1.set_ylabel('Frequency') ax2.hist(sample_means_fixed_sample_size[:10], bins = bins) ax2.set_title('Sampling 10 times') ax2.set_xlabel('Value of Means') ax2.set_ylabel('Frequency') ax3.hist(sample_means_fixed_sample_size[:20], bins = bins) ax3.set_title('Sampling 20 times') ax3.set_xlabel('Value of Means') ax3.set_ylabel('Frequency') ax4.hist(sample_means_fixed_sample_size[:30], bins = bins) ax4.set_title('Sampling 30 times') ax4.set_xlabel('Value of Means') ax4.set_ylabel('Frequency') ax5.hist(sample_means_fixed_sample_size[:50], bins = bins) ax5.set_title('Sampling 50 times') ax5.set_xlabel('Value of Means') ax5.set_ylabel('Frequency') ax6.hist(sample_means_fixed_sample_size, bins = bins) ax6.set_title('Sampling 100 times') ax6.set_xlabel('Value of Means') ax6.set_ylabel('Frequency') plt.tight_layout() fig3.subplots_adjust(top = 0.88) plt.savefig('Distribution_35_samples.jpg')

It looks much better!

What am I looking at here? I can see the Means of the first 3 sampling runs in the first subplot, they are actually quite close to the Population Mean. In the next subplot I add 7 more sample Means to the histogram, in the next 10 more to make it show 20 sample Means, and so on. I can see that while most new Means show up randomly left and right of 0 (the population Mean) there is a tendency that the middle of the subplots are more populated than their left or right sides. Then in the last subplot, which shows all 100 sample means I can finally see the normal distribution I expected, with most values near 0, the population Mean.