Statistical analysis of data helps us make sense of the information as a whole. This has applications in a lot of fields like biostatistics and business analytics.
Instead of going through individual data points, just one look at their collective mean value or variance can reveal trends and features that we might have missed by observing all the data in raw format. It also makes the comparison between two large data sets way easier and more meaningful.
Keeping these needs in mind, Python has provided us with the statistics module.
In this tutorial, you will learn about different ways of calculating averages and measuring the spread of a given set of data. Unless stated otherwise, all the functions in this module support int
, float
, decimal
and fraction
based data sets as input.
Calculating the Mean
You can use the mean(data)
function to calculate the mean of some given data. It is calculated by dividing the sum of all data points by the number of data points. If the data is empty, a StatisticsError will be raised. Here are a few examples:
import statistics from fractions import Fraction as F from decimal import Decimal as D statistics.mean([11, 2, 13, 14, 44]) # returns 16.8 statistics.mean([F(8, 10), F(11, 20), F(2, 5), F(28, 5)]) # returns Fraction(147, 80) statistics.mean([D("1.5"), D("5.75"), D("10.625"), D("2.375")]) # returns Decimal('5.0625')
You learned about a lot of functions to generate random numbers in our last tutorial. Let's use them now to generate our data and see if the final mean is equal to what we expect it to be.
import random import statistics data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mean(data_points) # returns 50.618 data_points = [ random.triangular(1, 100, 80) for x in range(1,1001) ] statistics.mean(data_points) # returns 59.93292281437689
With the randint()
function, the mean is expected to be close to the mid-point of both extremes, and with the triangular distribution, it is supposed to be close to low + high + mode / 3
. Therefore, the mean in the first and second case should be 50 and 60.33 respectively, which is close to what we actually got.
Calculating the Mode
Mean is a good indicator of the average, but a few extreme values can result in an average that is far from the actual central location. In some cases it is more desirable to determine the most frequent data point in a data set. The mode()
function will return the most common data point from discrete numerical as well as non-numerical data. This is the only statistical function that can be used with non-numeric data.
import random import statistics data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mode(data_points) # returns 94 data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mode(data_points) # returns 49 data_points = [ random.randint(1, 100) for x in range(1,1001) ] statistics.mode(data_points) # returns 32 mode(["cat", "dog", "dog", "cat", "monkey", "monkey", "dog"]) # returns 'dog'
The mode of randomly generated integers in a given range can be any of those numbers as the frequency of occurrence of each number is unpredictable. The three examples in the above code snippet prove that point. The last example shows us how we can calculate the mode of non-numeric data.
Calculating the Median
Relying on mode to calculate a central value can be a bit misleading. As we just saw in the previous section, it will always be the most popular data point, irrespective of all other values in the data set. Another way of determining a central location is by using the median()
function. It will return the median value of given numeric data by calculating the mean of two middle points if necessary. If the number of data points is odd, it returns the middle point. If the number of data points is even, it returns the average of two median values.
The problem with the median()
function is that the final value may not be an actual data point when the number of data points is even. In such cases, you can either use median_low()
or median_high()
to calculate the median. With an even number of data points, these functions will return the smaller and larger value of the two middle points respectively.
import random import statistics data_points = [ random.randint(1, 100) for x in range(1,50) ] statistics.median(data_points) # returns 53 data_points = [ random.randint(1, 100) for x in range(1,51) ] statistics.median(data_points) # returns 51.0 data_points = [ random.randint(1, 100) for x in range(1,51) ] statistics.median(data_points) # returns 49.0 data_points = [ random.randint(1, 100) for x in range(1,51) ] statistics.median_low(data_points) # returns 50 statistics.median_high(data_points) # returns 52 statistics.median(data_points) # returns 51.0
In the last case, the low and high median were 50 and 52. This means that there was no data point with value 51 in our data set, but the median()
function still calculated the median to be 51.0.
Measuring the Spread of Data
Determining how much the data points deviate from the typical or average value of the data set is just as important as calculating the central or average value itself. The statistics module has four different functions to help us calculate this spread of data.
You can use the pvariance(data, mu=None)
function to calculate the population variance of a given data set.
The second argument in this case is optional. The value of mu, when provided, should be equal to the mean of the given data. The mean is calculated automatically if the value is missing. This function is helpful when you want to calculate the variance of an entire population. If your data is only a sample of the population, you can use the variance(data, xBar=None)
function to calculate the sample variance. Here, xBar is the mean of the given sample and is calculated automatically if not provided.
To calculate the population standard definition and sample standard deviation, you can use the pstdev(data, mu=None)
and stdev(data, xBar=None)
functions respectively.
import statistics from fractions import Fraction as F data = [1, 2, 3, 4, 5, 6, 7, 8, 9] statistics.pvariance(data) # returns 6.666666666666667 statistics.pstdev(data) # returns 2.581988897471611 statistics.variance(data) # returns 7.5 statistics.stdev(data) # returns 2.7386127875258306 more_data = [3, 4, 5, 5, 5, 5, 5, 6, 6] statistics.pvariance(more_data) # returns 0.7654320987654322 statistics.pstdev(more_data) # returns 0.8748897637790901 some_fractions = [F(5, 6), F(2, 3), F(11, 12)] statistics.variance(some_fractions) # returns Fraction(7, 432)
As evident from the above example, smaller variance implies that more data points are closer in value to the mean. You can also calculate the standard deviation of decimals and fractions.
Final Thoughts
In this last tutorial of the series, we learned about different functions available in the statistics module. You might have observed that the data given to the functions was sorted in most cases, but it doesn't have to be. I have used sorted lists in this tutorial because they make it easier to understand how the value returned by different functions is related to the input data.
No comments:
Post a Comment