Data Sampling – Data Mining Fundamentals Part 12

Data sampling is a data preprocessing technique and is the main technique employed for data selection. It is often used for both the preliminary investigation of data and final data analysis.

Another very common method of pre-processing is sampling.
So those of you, like Ron, who are from a statistics
background, will understand sampling quite well.
So, sampling is the main technique
that we use for data selection.
It’s used almost always for preliminary investigation
of the data, but it’s often used even for the final data
analysis, even in data science.
Statisticians have been sampling for the duration–
for the length of time and the discipline
has existed because obtaining the entire set of data
of interest is either too expensive, too time
consuming or even in a lot of cases,
theoretically impossible.
There is no way that you can sample–
that you can obtain the entire set of some kinds of data,
it’s just not possible.
So you have to sample carefully.
Data miners sample often, because processing
our entire set of data is too expensive or time consuming.
If you talk about someone– like a group something
like, LinkedIn, or Facebook, or Google,
you’re talking about hundreds of terabytes
into petabytes worth of data that they
have stored in their servers.
It just would take– it you cannot process that kind
of data on anything remotely resembling a human lifespan,
even with modern technology.
We can process a lot of data, but there’s still
a fundamental limit of what we can process,
and on top of that, there’s a fundamental limit
of what we as humans can look at, what we can really get–
what we can look at all at the same time.
So, when you’re sampling, there is one thing more than anything
else that you have to keep in mind, which is, representation.
So, the key principle when you’re sampling
is, that the sample will work almost as well as using
the entire data set, if and only if, the sample
is representative.
So– and representative is sort of one of those fun words
that means something different for every data set, right?
So, sometimes representative is as
easy as unweighted random sampling.
Other times, this is particularly true
if you were doing something like anomaly detection.
We need to make sure that whatever sample
we take has an appropriate proportion of anomalies
versus normal data.
In other contexts, it gets even more complicated.
Sometimes you want to make sure we balance out our different
classes in a classification context or that certain kinds
of attribute values that are needed–
that are even target values but attribute values–
are all represented in a certain way.
And Balachander notes that sampling will typically
exclude outliers and may have noise
and that’s absolutely true.
Sampling, if done improperly, can absolutely
add noise to your data or, well, not really add
noise in our context, but certainly
but certainly can introduce noise.
And outliers are probably not going
to appear because you don’t sample enough to make
them appear, and that’s true.
And that’s actually one of the advantages of that of sampling
is that it will exclude outliers most of the time.
So if we aren’t in an anomaly detection
context then we don’t care– and we don’t want outliers
muddying the waters, so to speak,
we’ll want to exclude them, and sampling can help us do that.

Part 13:
Data Sampling Types

Part 11:
Data Aggregation

Complete Series:

More Data Science Material:
[Video] Business Data Analysis with Excel
[Blog] Exploratory Data Analysis in R Using ggplot2 and dplyr


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>