Processing, Programming

Inferential Statistics, summary

Context

After we are done with EDA, which tells “what” we can see in the existing data, we move on to “what else” we can infer from the data.

The topics were mostly around, since we can never “have” entire population data since its a costly affair in both time and money, we extrapolate our findings from samples and assume that it will apply on the population.

The parameters that we pick are mean and standard deviation. we calculate the mean and standard deviation of the sample (or as many samples as we can) and calculate the mean and standard deviation of the population.

Probability

What are the chances of something happening.

The data that we have, has multiple different scenarios inherent in it. The scenario that we have at hand is the event we are interested in. We call that scenario a Random Variable. Think of it as a variable which can have many values.

Now, for each data point, the function can result in different values, all those different values are the values that represent the Random Variable. Once we are done calculating the value of the function for all the data points, we can create a frequency plot, where x-axis are the values of the random variable and the y-axis is frequency(how many times that value appeared as an output of the function).

From this frequency plot, we can create a probability plot where the x-axis is still the random variable and the y-axis is the probability of the values assumed by the random variable.

Obviously the data points we had for creating these plots are not all the population data, they are the samples. We can create many many such plots for different different samples. We should collect them together and create an averaged out plot so that any biases are sent out.

Because of the “Central limit theorem”, it becomes possible to calculate the population parameters (mean and standard deviation) from the sample plots. Once we have these sample plots, we can say that with a 95% confidence level, the output of the random variable in the population will lie between a certain margin of error.

Probability distributions

The term distributions is nothing but the sequence of values. Generally when someone is talking about “Probability distribution”, imagine a plot with an x-axis with some random variable and the y-axis with probability of that random variable.

For discreet data, think of a bar chart with many bars.

The tricky part, which the data scientists and mathematicians have solved and “given to us in a silver platter” is this. In a dataset, we can have different scenarios and different data for those scenarios. think of the scenario as a function, the parameter to the function is your random variable and the output of the function is the probability of that random variable happening, now plot it. it will assume a graph. The data scientists have created a bunch of functions for us for given situations. They call those functions as <something> distribution. for example, binomial distribution, geometric distribution, exponential distribution. We should use different functions for different scenarios to calculate the probability.

We can calculate exact probability for a discreet random variable with these functions

For continuous data, we can either have similar functions where we input the random variable and out comes the probability. There is another way too which we use. Instead of using a function, we use a mapping table (z table, t table, etc), where we have pre-recorded input/output for z-value(random variable converted to z-value) and the probability. The difference between discreet and continuous data is that we can give exact probability of a random variable happening, but we can not give exact probability of a continuous random variable happening. So we give a range, with a confidence level and a margin of error.

a difference between the probability plot between discreet and continuous, is that the plot for discreet is probability vs random-variable. for continuous it’s probability-density vs random-variable. this means that for discreet, i can get the exact probability for a random variable. for continuous, z-table gives a cumulative-probability (the difference between – infinity to this random variable).

Hypothesis vs Inferences

Inferences is about how to assume something about a population from a sample(or many samples). It’s possible because of the pillar called central limit theorem.

Hypothesis is about comparison of 2 things. A population level claim is compared with a sample. Or 2 samples are compared with each other. These 2 samples can be paired (eg same patients were examined in the morning then in the evening, in the afternoon they were experimented with some medicine) or unpaired (some patients in hospital 1 and some patients in hospital 2).

The way these comparisons are done is by assuming a probability of error. Let’s say the population mean is 35 and the sample’s mean is about 32. may be these two are close enough, may be not. That depends on the scenario. But the way we talk about it is by assuming a probability of error (significance level).

The term “significance level” may have come after a lot of discussions between statisticians, mathematicians, etc. What made me frown was the word ‘significance’, what are we talking about?

The way I understand it now is this. There are scenarios where a slight increase/decrease from the mean has a big impact. There can be scenarios where a slight increase/decrease from the mean has a very small impact. If a slight increase/decrease from the mean has a big impact, this slight increase/decrease is “significant”. In this case, the significance value will be high. This also means that with slight increase/decrease the chances of null hypothesis being rejected are also high.

Now you can connect this with the mental image you have of a normal distribution with red-colored rejection areas on the left and right (for 2-tailed). If the significance level (alpha) is high, it means with slight increase/decrease from the mean, the probability of error also becomes higher and it will soon fall in the rejection area.

On the other hand if the significance level (alpha) is low, it means that slight increase/decrease from the mean doesn’t matter much. even if the sample mean is far from population mean, its ok. since the rejection area (the red area under the curve) is also very far and small.

In both critical-value way and p-value way, we have the acceptance area, rejection area, significance level. its just that in critical-value way we calculate in this way, significance level => z-value => upper-crital, lower-critical, and then we check if sample mean is in the acceptance area or not.

in p-value, we calculate the z-value of the sample mean and then it’s probability. then we compare that probability with the significance level probability.