Stats for data science (Part-1)

Often people who are interested in the data science field have a perception that “data science is all about dealing with data”, which is partially true. But the main question is “how would you deal with the data?”, to build a machine learning model first we need to understand the underlying patterns in the data, which will help to decide on which specific ML algorithm should be used for providing a solution for the problem. To understand the patterns we need some basic knowledge of statistics, today we will be discussing a few very basic and important areas of statistics that will be helpful to start a career in the field of data science.

Data

Data is a collection of facts and records which provides an understanding of “what has happened”, “when it has happened” and “what is going on”.

With the help of data science methodologies, we use the available and relevant data and try to estimate “what could happen in the future” and “when it is likely to happen”. For producing these results as a data scientist. we need to have a good amount of knowledge on statistics, which helps to understand what the data is representing. For instance, consider you were asked to talk about global warming in a seminar, what is the first thing you are going to do? Find out what is global warming, it’s history, what impact it has on the planet, and then you need to bring out some useful information, consolidate it and prepare for the speech. So, here global warming is the data that you have, and using that data you need to bring out information, if it is subjective data we can browse in the internet or read some books to gather the information but, what if the data is categorical or numerical data (categorical means data which is presented using strings values, for example, dog breeds or cat breeds, etc. Numerical means data which is presented using numbers, for example, last two years’ temperatures, etc.), here we need statistics to bring out information from such data.

We have two categories in statistics,

Descriptive statistics, 2. Inferential statistics.

Descriptive statistics: When we have data, we try to represent that data with some figures instead of giving all of it.

Here, we try to find the central tendency. Which is by calculating the mean, median, or mode, we try to represent the population. Consider this example, you were given a task to represent the heights of students in your class, and all the student’s heights are given to you, now how you are going to represent this?

[5’10,5’7,6,5’6,5’5,5’8,5’9,5’7,5’4,5’5,5’6,5’7,5’7,5’3,4’9,6’1,5’5,5’7……,5,8]

Arithmetic means = 5’10+5’7+6+5’6+5’5+5’8+5’9+5’7+5’4+5’5+5’6+5’7+5’7+5’3+4’9+6’1+5’5+5’7……+5,8 / total number of students.

One way is by calculating the average of the heights and saying, “the average height of our class is 5’7”. This is what descriptive statistics is.

Note: Central tendency is also called average, which is the “arithmetic mean”.

Inferential statistics: When we have data, we try to take a sample from the data and try to infer some useful information out of it, which represents the whole data. In simple terms, we take sample data from the whole data and try to relate the sample inferences with the whole data. (In statistics, the whole data is called population).

Before we discuss inferential statistics let me introduce two terms “Sample” and “Population”. Consider the same example of heights but here we need to represent the heights of people in a particular state, the population of a state would be huge then how could we represent this data?

It is very hard to calculate the arithmetic mean of the whole population, so we consider a sample that represents the population. We calculate the AM (Arithmetic mean) of the heights in the sample and conclude by saying, the estimated average height of people in this particular state is 5’8.

The population means is represented by μ.
The sample mean is represented by x̄.

The main reason we consider sample mean is to understand how the data is distributed, for example, if doctors have discovered a new drug that could cure cancer, there is no way they could test this on the whole population who has cancer, they need to consider a small sample, test this drug analyse the results then, bring some more volunteers into the sample, test on them and again analyse the results, document the results and then take a final call on the drug. Instead, if doctors test the drug on the whole population who have cancer it would be a disaster because government need to spend a lot of money for this programme, it is hard to convince everyone and document every persons health condition, in the same way when we have huge data it is always recommended to consider a sample, analyse the data, and infer some useful information from it and if required build a machine learning model and check out the results. Depending on the results we can decide how to work with the population, this will save time, and cost and help to discover new ways of executing models.

Now, we have an idea of what central tendency is and how and why we calculate the sample mean from a population but, there is a problem here, if we represent the whole data using just the mean then we are losing a lot of information which could be useful in further stages of the study. For instance, bringing back the same example of calculating the AM of heights of students in a class, say the same task is asked by another student from a different class, and he found that the average height of students is 5’8, where most of the student’s height is between 4’5 to 5’6 but there are some students with heights between 5’8 and 6’3 and your class average height is also 5’7 then, how could this be a fair result? Here we are missing a lot of information, to solve this problem we calculate the “Variance”.

Variance
It is a measure to understand how far each data point is spread out from the mean.

Here, σ2 is the population variance, s2 is the sample variance, X is the data point, x̄ is the mean and n is the total number of data points. We are taking the square of it to avoid negative numbers.

Consider a small example, we have 6 numbers [2,4,12,9,8,7], let’s see how variance will help us here,
A.M = 2+4+12+9+8+7/6 => 7, So the AM is 7
Variance = (2–7)²+(4–7)²+(12–7)²+(9–7)²+(8–7)²+(7–7)² => 25+9+25+4+1+0/7 => 9.1
This will give us how widely the data points are spread from the mean.

In simple terms, a variance will let us know how far is each point located from the mean. In the case of mean, we have a population mean and sample means, in the same way, we have population variance and sample variance.

Now, if we take a closer look at the sample variance formula, we may have a doubt, “ Why we are subtracting ‘1’ from the total number of data points?” Before answering this let me ask another question, “What is the probability that the sample mean is closer to the population means?”, this is a valid question. If the sample means is far away from the population mean while calculating the variance, then we are underestimating the population variance. It may be confusing while we think about it but let me give an example,
Consider a a set of number [2,10,5,9,12,7,8,4,19,23,5,15…], here say the population mean is 14 and sample mean is 8. We can see a difference between the population mean and sample mean, why? Population means is calculated by summing up all the data points and dividing by the total number of data points. But sample mean is calculated by considering a random sample, which can be located anywhere in the dataset, and this sample mean can be very far from the population mean, this will be a problem while calculating the variance. So, to overcome this we subtract 1 from the total number of data points in the sample.

There is another way to find out how widely the data points are spread out from the mean, and this is the most popular one which is called “Standard Deviation”.

Standard Deviation

There is no major difference in the formula from that of variance, just that we need to take the square root of it. But why?
If we are calculating the variance of the heights of students in a class, we end up in meter square (m²), which is a little strange, because height is generally calculated in either meters or feet but not its square, so to avoid that we consider taking the square root of it. This is one of the reasons why we consider standard deviation over variance.

Random Variable

A random variable is a function that relates a random action to a random number. This statement might be a little confusing but consider this example, you are watching a cricket match, and you wanted to guess what would be the next ball? So, you decided to flip a coin, if it is “heads” then a boundary or if it is “tails” then either wicket or no run. Here there is a random action, the next ball could be a six, four, or just a single or even a wicket, you don’t know so the best way is you map a random number to each action and represent it.

To be cont…(Part-2)

Conclusion

Data science is a vast field, no body can be an expert in it but, there is always a chance to become better compared to others. To understand the patterns in the data, we need to familiar with some crucial statistical approaches, this article has discussed very basic and initial stats that is needed to understand the concepts in data science, in part 2 we will discuss some crucial statistical approaches which are building blocks to become a fine machine learning engineer or a fine data scientist.