Center and Spread Measurement – Data Mining Fundamentals Part 22

Center and Spread measurement is the next topic in our discussion on data exploration and visualization. We discuss measuring of center such as the median and mean, and look at measures of spread such as range and variance.

Other measures we care about, measures of center.
Median versus mean is an age old debate
on the internet going all the way back
about whether the median or the mean
is the better way to measure the center of a data.
And as is often the case with age
old debates on the internet, the answer is both.
Means are easy to calculate, but very sensitive to outliers.
Means also can give you a real sense of the skew
if you have a skewed data.
Means can give you a sense of the skew of your data
very nicely.
So on the other hand, the median is the number such that–
is the number such that 50% of values are below it
and 50% of values are above it.
The median is the 50th percentile value.
There’s also something called a trimmed mean, which I want
to talk about a great deal.
So medians tell you exactly where your center is.
So if you really want to know what
the exact middle of your data is, such that 50% of people
are below it and 50% are above it, median’s great.
It’s basically immune to outliers.
It’s very good that way.
But it’s harder to calculate in some ways,
and it doesn’t tell you anything about the skew of your data.
If you do have a really long tail,
the mean will let you know about that in particularly.
It’s the difference between the median and the mean
that is often what we care about because that’s what tells us
about how our data is skewed.
We want both numbers.
One is not necessarily better than the other.
The last summary statistics that we tend to care about
are measures of spread, range and variance.
So variance or standard deviation
are the most common measure of a spread of a set of points.
It tells us about how different the points are very nicely.
Variance and standard deviation are
effectively measures of the spread of our data
very directly.
Range is the difference between maximum and minimum,
which is definitely something we might care about.
But range, variance, and standard deviation
are all very sensitive to outliers, so there
are other measures that we use.
So we use interquartile range, which
is the difference between the 75th percentile value
and the 25th percentile value in a set of data.
And we’ll sometimes use the median absolute deviation,
which is essentially the median of the variances.
And sometimes, we’ll use the average absolute deviation too,
which is the mean of the variances.
So all of these show up as we’re trying to calculate
summary statistics.

Part 23:
Histograms & Box Plots

Part 21:
Data Visualization & Exploration

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] Online Experimentation and A/B Testing
[Blog] Power BI and R: Intro to Visualizations

(420)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>