Manage AJAX calls the RIGHT way!

Every time I get to the part of connecting my client to a server, I try to group my ajax calls in one place to keep it manageable, Also to adapt fast, in a fast changing developing environment…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Statistics for Machine learning data analysis.

I covered the Mean, Median, Variance last time. wake me up when you teaching hypothesis testing.

Statistics are very important in data analysis for machine learning. They are the first contact to know about our data. we can ask questions from basic to advanced concepts.

Example questions:

In this article, I’m mainly focusing on the initial step of any machine learning project and it has many names such as data analysis or exploratory data analysis, etc. We have different names for the same process because they are people from pure statistical background and few people works with physics and few people from electronics background. The beauty of Many machine learning is many of the core ideas are taken from different fields. for instance, there is a metric called ROC, which is developed during the 2nd world war and we are currently using it for model performance.

If you know about the data type of feature then our analysis about that particular feature becomes easy. For example, you come to know that a variable is categorical then your analysis becomes super easy meaning that you have a tool kit to perform data analysis about it. In simple terms data type let us know about the type of data analysis to perform.

For example, if a feature is categorical then we use a bar plot to see the frequency of each value for that feature and can see is what are different values that feature taking and how is the distribution for these values.

One more example is if the feature is discrete then we can know that the bar plot is not useful because the discrete feature can take infinite possible values and plotting all these values in a bar plot is not going to give any insights for us. instead of bar plot, we use histograms or probability density functions and from that, we can get more insights about that feature.

I hope you convinced now know about the data type of feature and lets now see what are the different data types are there in data analysis.

Continuous:

This feature can take infinite possible values. An example of a continuous feature variable is person height in inches and weight of an item and many more.

categorical or Discrete :

This feature will take only a few values. Each data point can contain any one of the values from these values. For example, the Weather feature can take one of the below values for each data point.

In categorical we have two of data namely nominal or ordinary.

Nominal meaning there is no order in the values of the feature and they all are equally the same. Simply any event of weather is greater than other events.

Ordinal meaning that there exists a special order in the feature values.

Binary values:

There are exists a few features that can take only two possible values such as true or false, 0 /1 and yes or no.

For example :

Sample Questions:

If I give you a bunch of values and ask you about give me a single value to which describes that feature values then what will you do?

Can you give me one number which describes about all the salaries of different cities in the USA?

Here are the possible answers to give a single description

Minimum:

you can give the minimum value and it will tell us about the minimum value but it will not tell us much about the data. It will only talk about one instance of the feature but not about the population.

You can say among all the cities Miami has a minimum salary but it's not saying anything about the whole population. It's particularly talking about Miami.

Maximum:

you can give the maximum value and it will tell us about the maximum value but it will not tell us much about the data. It’s also only talking about one instance of the feature but not about the population.

You can say among all the cities San Francisco has a minimum salary but it’s not saying anything about the whole population. It’s particularly talking about San Francisco, not about all the cities.

MaximumMinimum (Range):

The above value tells about variance in the data.

If this value is low then we can say that in the data there is not much variance in the data.

If this value is higher then there is a lot of variance in the data and also there is a chance for higher variance also.

Average value (Mean):

This value will tell us about on average, the salary in the united states is $96k and this value applicable to the population.

To define the mean mathematically,

The sum of all values divided by the number of values.

You can think mean as a balance point. so that when you place a fulcrum at mean the points would be in a balanced position as shown in the picture.

Median value:

We can also give the Median value from all the salaries. The median value is nothing but giving the middle value after arranging it in the sorted order.

Examples:

The first step is to sort them in the ascending order, The number will be in the order of 10, 20, 30 and the middle value is our median. so here 20 is the median.

The obvious first step is to sort the numbers, and after sorting the order is 10, 20, 30, 40. In the above example, there is an odd number of data points, so picking middle number will work in the above case but here we have even a number of data points so we have to choose two middle numbers and find their mean and that’s our median for this dataset.

Sorting order: 10, 20, 30, 40

Median is Mean(20,30) = (20 + 30) / 2 = 50/2 =25

Important note:

This Median more robust than mean because mean is sensitive to the outliers and Median is not sensitive to the outliers. Let's explore this concept with an example.

For example, we have 6 observations with values respectively,

100, 123, 134, 15000, 121, 135.

The mean would be,

mean is 2603

The above mean is 2603 but if you observe the data points then 15000 could be an outlier and while calculating the mean that extreme data values also involve in the calculation. Because of this, our mean is very high.

Now, calculate the median for our dataset,

The median value is 128.5, so if we observe carefully The median is not considering the extreme values and because our median is closely matching with all the data points.

From all the above, Only the mean & median are considering each data point in feature values. But note that the mean is sensitive to outliers and the median is not sensitive to outliers.

variability is a measure of whether the data values are tightly bounded or spread out. Much of the data analysis lies around the variability.

Examples:

from the above picture, we are given a task to classify the given records as good or bad with two features namely feature1 and feature2. We can clearly separate the data points with a line. If you observe clearly, even though we have two features there is no variability in the feature1 meaning that for all the records the value is the same and even if we remove that feature still we can classify the data points.

After removing the feature1, then still we can able to classify the data perfectly.

By removing the feature2 and rotating our data then we can able to classify perfectly. All we did here is removing the feature1 which doesn’t have any variability.

Variance:

The sum of squared deviations from the mean divided by n where n is the number of data values.

Variance calculated based on how much it is deviating from the mean. The formula itself is a self-exploratory. One can easily understand by looking at the formula of the variance.

Variance formula

The variance here is 0 because all numbers are the same and there is no variability in the data.

Standard deviation:

taken from meme generator

It is the square root of the variance. “Standard” here means “standardized”, meaning the standard deviation and mean are in the same units, unlike variance.

Median absolute deviation:

we earlier seen that mean is sensitive to outliers so the variance depends on the mean so both variance and standard deviation are impacted by the outliers.

In the variance, we are essentially calculating how much data point is dispersed from the mean. if there are any outliers in the dataset then both mean and variance will be impacted by them. So in order to avoid this, we will have a Median absolute deviation.

Note: We already have seen how Median would work even if there are any outliers in data.

Thanks for reading. I hope you enjoyed while reading. The next article will be on the Interquartile range (IQR) and box-plots.

Add a comment

Related posts:

5 Warning Signs Your Morning Routine Needs a Total Make Over

Here are five warning signs your mourning routine needs a total make-over. Discover what they are and why you should look for these red flags.

Being single feels like a crime at 31.

My 30th birthday happened to coincide with a colleague’s last day at work, and about twenty of us were out for a farewell lunch. X points to me, and announces to the table, and to Y, that it is my…

Strangers

I think it is funny how strangers can become such crucial pieces in our lives. People we never thought of, people we never knew before, that suddenly, changed it all. Beautiful also, interesting as…