# Data Visualization & Exploration – Data Mining Fundamentals Part 21

January 6, 2017 9:00 pm

Data exploration is visualization and calculation to better understand characteristics of data. We will tell you the key motivations of data exploration as well as the techniques used in data exploration.

So I’m going to go through and talk

a bit about the kinds of summary statistics we like to use now.

Frequency, accounts, mean, and standard deviation.

So summary statistics are numbers

that summarize properties of the data, exactly what they

sound like.

Most can be calculated pretty quickly in a single pass

through the data, in one pass, which is very nice.

Most of them can be calculated in just

about any language you care to do them in, whether you’re

doing it in SQL, or R, or Python,

or anything else that you care to do it.

Summary statistics are pretty easy to calculate.

So two for categorical data, our most common summary statistics

are frequency and mode.

So the frequency of an attribute is the percentage

measuring how often the value occurs in the data set.

So for example, if the attribute is gender,

then the value female will occur a bit less than 50%

of the time.

The value male will occur a bit less than 50% of the time.

And something else will occur some small percentage

of the time.

So we can think of those numbers as being percentages.

On the other hand, the mode of an attribute

is the most frequent attribute value.

So in this case, we might say aha.

In this case, we might have something like marital status,

single, married, divorced.

Depending on our data set, we may

want to know what the most common value is.

Do we have mostly single people, mostly married people,

or mostly divorced people in our data set?

That will change the way we look at the data.

Frequency and mode are typically used with categorical data.

Though sometimes when you have continuous data,

it’s useful too.

Though more often when we’ve got continuous attributes,

we think more in terms of percentiles.

So this is more useful than direct frequency or the concept

of mode, for the most part.

So percentiles are pretty simply defined.

I have a formal definition here.

But the easier way to understand it is by looking at it there.

So percentile is you count the number of people who

have a smaller value than you.

And you count the percentage of the total group

that is that number.

And you are thus at that percentile.

So if you are the fourth tallest person

in a group of 20th, that means 80% of people

are shorter than you.

And it means that you are at the 80th percentile.

And so if the height is 1.85 meters,

then 1.85 meters is the 80th percentile height in this group

that we care about.

**Part 22**:

Center and Spread Measurement

**Part 20**:

Summary Statistics

**Complete Series**:

Data Mining Fundamentals

**More Data Science Material**:

[Video] Introduction to Web Scraping with Python and BeautifulSoup

[Blog] US-AI vs Chin-AI

(637)