# Center and Spread Measurement – Data Mining Fundamentals Part 22

January 6, 2017 10:00 pm

Center and Spread measurement is the next topic in our discussion on data exploration and visualization. We discuss measuring of center such as the median and mean, and look at measures of spread such as range and variance.

Other measures we care about, measures of center.

Median versus mean is an age old debate

on the internet going all the way back

about whether the median or the mean

is the better way to measure the center of a data.

And as is often the case with age

old debates on the internet, the answer is both.

Means are easy to calculate, but very sensitive to outliers.

Means also can give you a real sense of the skew

if you have a skewed data.

Means can give you a sense of the skew of your data

very nicely.

So on the other hand, the median is the number such that–

is the number such that 50% of values are below it

and 50% of values are above it.

The median is the 50th percentile value.

There’s also something called a trimmed mean, which I want

to talk about a great deal.

So medians tell you exactly where your center is.

So if you really want to know what

the exact middle of your data is, such that 50% of people

are below it and 50% are above it, median’s great.

It’s basically immune to outliers.

It’s very good that way.

But it’s harder to calculate in some ways,

and it doesn’t tell you anything about the skew of your data.

If you do have a really long tail,

the mean will let you know about that in particularly.

It’s the difference between the median and the mean

that is often what we care about because that’s what tells us

about how our data is skewed.

We want both numbers.

One is not necessarily better than the other.

The last summary statistics that we tend to care about

are measures of spread, range and variance.

So variance or standard deviation

are the most common measure of a spread of a set of points.

It tells us about how different the points are very nicely.

Variance and standard deviation are

effectively measures of the spread of our data

very directly.

Range is the difference between maximum and minimum,

which is definitely something we might care about.

But range, variance, and standard deviation

are all very sensitive to outliers, so there

are other measures that we use.

So we use interquartile range, which

is the difference between the 75th percentile value

and the 25th percentile value in a set of data.

And we’ll sometimes use the median absolute deviation,

which is essentially the median of the variances.

And sometimes, we’ll use the average absolute deviation too,

which is the mean of the variances.

So all of these show up as we’re trying to calculate

summary statistics.

**Part 23**:

Histograms & Box Plots

**Part 21**:

Data Visualization & Exploration

**Complete Series**:

Data Mining Fundamentals

**More Data Science Material**:

[Video] Online Experimentation and A/B Testing

[Blog] Power BI and R: Intro to Visualizations

(457)