Data Sampling Types – Data Mining Fundamentals Part 13

In this short tutorial, we go over the different data sampling types commonly used when employing this technique. Sampling types include, random sampling, stratified sampling, sampling without and with replacement. We will also dive into the issues of sample size, and how that can effect your sampling.

There are several different types
of sampling that are important.
That will come up as we talk about over the course
of the boot camp.
So there’s simple random sampling,
where there’s an equal probability of selecting
any particular item.
There’s stratified sampling, where
we split the data into several partitions
and draw out random samples from each partition.
If we’re doing stratified sampling
with equal sized partitions, then that’s
equivalent to simple random sampling.
But in a lot of cases, we don’t do it
with equal sized partitions, we do it
with smaller or larger– we have different sized partitions
to draw from, which is what makes
it fundamentally different from simple random sampling.
Or we are drawing different numbers of points out
of the different partitions.
So those are two different– our two fundamental ways
of actually grouping the data.
And then when we’re actually sampling,
there’s two kinds of sampling that come up–
the sampling without replacement,
which is what most people think of when
they’re thinking of sampling.
So sampling without replacement is if we have a bag,
and it’s got five red balls and four blue balls
and three green balls in it.
And we reach into the bag and pull a ball out
and we saw, aha, I drew a red ball.
Then we take that red ball, and we put it on the table.
And then if we want another item,
we reach back in and pull out a different ball.
So now the second time we draw, instead
of there being five reds and four blues and three greens,
there’s four reds, four blues, and three greens.
So that’s the sampling without replacement–
we do not replace what we’re sampling back into the bag.
On the other hand, there are uses–
and this actually one of the most important–
a fundamental concept of a very common type of modeling
uses sampling with replacement as part of it.
So in sampling with replacement, instead of taking the red ball
out and then putting it on the table and drawing again,
we reach into the bag, pull out a ball and say, aha, it’s red,
note down on a piece of paper say that it’s red,
then put the red ball back, shake it up,
and draw another ball out again.
Record its color, put it back in the bag.
So without replacement, with replacement, that’s
exactly what it sounds like.
But they end up having very different mathematical results.
And as a result, and because of that,
they are used in different contexts.
All right, so the last thing we need to think–
another aspect we need to think about around sampling
is what size of sample we want to do.
And I really like this picture because I
think that it very excellently illustrates
the problems with sample sizes.
So when we sample, we do lose information,
just like with aggregation.
So you have to be careful not to make your sample too small.
So if we look over here, we have this data set,
and it’s just position data.
This is, I think, some sort of lithography picture.
So we’ve got these black structures,
and then we’ve got this sine wave
in the background and then a little bit of just random noise
scattered all over the place.
So if we subsample this by a quarter,
so we sample 2000 points, we can still
see the structures, the big thick structures,
are still represented.
But the sine wave has almost entirely disappeared.
We’ve lost that background image.
And if we go down even farther, if we subsample
by another quarter down to 500 points,
we’ve lost even the information of these things.
Like you can look at this and you
can kind of see the structures, but only because you know what
the structures need to look like.
If I showed you just this graph first,
you wouldn’t pick out the structures.
You wouldn’t be able to, there’s just not enough information
So we want to reduce our sample size,
we want to sample a small enough size that we can process
it efficiently, that we can analyze it efficiently, that we
can explore it efficiently.
But we have to be really careful not to take too small a sample.
And unfortunately, there really isn’t a good rule
of thumb on this necessarily.
But you just you need to play with it.
You need to take lots of different samples
of different sizes.
You need to do this to figure out
when your information starts to disappear.

Part 14:
Dimensionality Reduction

Part 12:
Sampling for Data Selection

Complete Series:

More Data Science Material:
[Video] Data Manipulation with dplyr
[Blog] Importance of Data Normalization Prior to Analytics
[Blog] Building Data Visualization Tools


Category: All, Data Mining
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>