Data exploration is visualization and calculation to better understand characteristics of data. We will tell you the key motivations of data exploration as well as the techniques used in data exploration.
So I’m going to go through and talk
a bit about the kinds of summary statistics we like to use now.
Frequency, accounts, mean, and standard deviation.
So summary statistics are numbers
that summarize properties of the data, exactly what they
Most can be calculated pretty quickly in a single pass
through the data, in one pass, which is very nice.
Most of them can be calculated in just
about any language you care to do them in, whether you’re
doing it in SQL, or R, or Python,
or anything else that you care to do it.
Summary statistics are pretty easy to calculate.
So two for categorical data, our most common summary statistics
So the frequency of an attribute is the percentage
measuring how often the value occurs in the data set.
So for example, if the attribute is gender,
then the value female will occur a bit less than 50%
The value male will occur a bit less than 50% of the time.
And something else will occur some small percentage
So we can think of those numbers as being percentages.
On the other hand, the mode of an attribute
is the most frequent attribute value.
So in this case, we might say aha.
In this case, we might have something like marital status,
single, married, divorced.
Depending on our data set, we may
want to know what the most common value is.
Do we have mostly single people, mostly married people,
or mostly divorced people in our data set?
That will change the way we look at the data.
Frequency and mode are typically used with categorical data.
Though sometimes when you have continuous data,
Though more often when we’ve got continuous attributes,
we think more in terms of percentiles.
So this is more useful than direct frequency or the concept
of mode, for the most part.
So percentiles are pretty simply defined.
I have a formal definition here.
But the easier way to understand it is by looking at it there.
So percentile is you count the number of people who
have a smaller value than you.
And you count the percentage of the total group
And you are thus at that percentile.
So if you are the fourth tallest person
in a group of 20th, that means 80% of people
And it means that you are at the 80th percentile.
And so if the height is 1.85 meters,
then 1.85 meters is the 80th percentile height in this group
Center and Spread Measurement
Data Mining Fundamentals
More Data Science Material:
[Video] Introduction to Web Scraping with Python and BeautifulSoup
[Blog] US-AI vs Chin-AI