Data is a collection of facts and records which provides an understanding of “what has happened”, “when it has happened” and “what is going on”.
With the help of data science methodologies, we use the available and relevant data and try to estimate “what could happen in the future” and “when it is likely to happen”. For producing these results as a data scientist. we need to have a good amount of knowledge on statistics, which helps to understand what the data is representing. For instance, consider you were asked to talk about global warming in a seminar, what is the first thing you are going to do? Find out what is global warming, it’s history, what impact it has on the planet, and then you need to bring out some useful information, consolidate it and prepare for the speech. So, here global warming is the data that you have, and using that data you need to bring out information, if it is subjective data we can browse in the internet or read some books to gather the information but, what if the data is categorical or numerical data (categorical means data which is presented using strings values, for example, dog breeds or cat breeds, etc. Numerical means data which is presented using numbers, for example, last two years’ temperatures, etc.), here we need statistics to bring out information from such data.
We have two categories in statistics,
- Descriptive statistics, 2. Inferential statistics.
Descriptive statistics: When we have data, we try to represent that data with some figures instead of giving all of it.
Here, we try to find the central tendency. Which is by calculating the mean, median, or mode, we try to represent the population. Consider this example, you were given a task to represent the heights of students in your class, and all the student’s heights are given to you, now how you are going to represent this?
[5’10,5’7,6,5’6,5’5,5’8,5’9,5’7,5’4,5’5,5’6,5’7,5’7,5’3,4’9,6’1,5’5,5’7……,5,8]
Arithmetic means = 5’10+5’7+6+5’6+5’5+5’8+5’9+5’7+5’4+5’5+5’6+5’7+5’7+5’3+4’9+6’1+5’5+5’7……+5,8 / total number of students.
One way is by calculating the average of the heights and saying, “the average height of our class is 5’7”. This is what descriptive statistics is.
Note: Central tendency is also called average, which is the “arithmetic mean”.
Inferential statistics: When we have data, we try to take a sample from the data and try to infer some useful information out of it, which represents the whole data. In simple terms, we take sample data from the whole data and try to relate the sample inferences with the whole data. (In statistics, the whole data is called population).
Before we discuss inferential statistics let me introduce two terms “Sample” and “Population”. Consider the same example of heights but here we need to represent the heights of people in a particular state, the population of a state would be huge then how could we represent this data?