Data Visualization & Exploration – Data Mining Fundamentals Part 21

Data exploration is visualization and calculation to better understand characteristics of data. We will tell you the key motivations of data exploration as well as the techniques used in data exploration.

So I’m going to go through and talk
a bit about the kinds of summary statistics we like to use now.
Frequency, accounts, mean, and standard deviation.
So summary statistics are numbers
that summarize properties of the data, exactly what they
sound like.
Most can be calculated pretty quickly in a single pass
through the data, in one pass, which is very nice.
Most of them can be calculated in just
about any language you care to do them in, whether you’re
doing it in SQL, or R, or Python,
or anything else that you care to do it.
Summary statistics are pretty easy to calculate.
So two for categorical data, our most common summary statistics
are frequency and mode.
So the frequency of an attribute is the percentage
measuring how often the value occurs in the data set.
So for example, if the attribute is gender,
then the value female will occur a bit less than 50%
of the time.
The value male will occur a bit less than 50% of the time.
And something else will occur some small percentage
of the time.
So we can think of those numbers as being percentages.
On the other hand, the mode of an attribute
is the most frequent attribute value.
So in this case, we might say aha.
In this case, we might have something like marital status,
single, married, divorced.
Depending on our data set, we may
want to know what the most common value is.
Do we have mostly single people, mostly married people,
or mostly divorced people in our data set?
That will change the way we look at the data.
Frequency and mode are typically used with categorical data.
Though sometimes when you have continuous data,
it’s useful too.
Though more often when we’ve got continuous attributes,
we think more in terms of percentiles.
So this is more useful than direct frequency or the concept
of mode, for the most part.
So percentiles are pretty simply defined.
I have a formal definition here.
But the easier way to understand it is by looking at it there.
So percentile is you count the number of people who
have a smaller value than you.
And you count the percentage of the total group
that is that number.
And you are thus at that percentile.
So if you are the fourth tallest person
in a group of 20th, that means 80% of people
are shorter than you.
And it means that you are at the 80th percentile.
And so if the height is 1.85 meters,
then 1.85 meters is the 80th percentile height in this group
that we care about.

Part 22:
Center and Spread Measurement

Part 20:
Summary Statistics 

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] Introduction to Web Scraping with Python and BeautifulSoup
[Blog] US-AI vs Chin-AI


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>