# Data Sampling Types – Data Mining Fundamentals Part 13

January 6, 2017 1:00 pm

In this short tutorial, we go over the different data sampling types commonly used when employing this technique. Sampling types include, random sampling, stratified sampling, sampling without and with replacement. We will also dive into the issues of sample size, and how that can effect your sampling.

There are several different types

of sampling that are important.

That will come up as we talk about over the course

of the boot camp.

So there’s simple random sampling,

where there’s an equal probability of selecting

any particular item.

There’s stratified sampling, where

we split the data into several partitions

and draw out random samples from each partition.

If we’re doing stratified sampling

with equal sized partitions, then that’s

equivalent to simple random sampling.

But in a lot of cases, we don’t do it

with equal sized partitions, we do it

with smaller or larger– we have different sized partitions

to draw from, which is what makes

it fundamentally different from simple random sampling.

Or we are drawing different numbers of points out

of the different partitions.

So those are two different– our two fundamental ways

of actually grouping the data.

And then when we’re actually sampling,

there’s two kinds of sampling that come up–

the sampling without replacement,

which is what most people think of when

they’re thinking of sampling.

So sampling without replacement is if we have a bag,

and it’s got five red balls and four blue balls

and three green balls in it.

And we reach into the bag and pull a ball out

and we saw, aha, I drew a red ball.

Then we take that red ball, and we put it on the table.

And then if we want another item,

we reach back in and pull out a different ball.

So now the second time we draw, instead

of there being five reds and four blues and three greens,

there’s four reds, four blues, and three greens.

So that’s the sampling without replacement–

we do not replace what we’re sampling back into the bag.

On the other hand, there are uses–

and this actually one of the most important–

a fundamental concept of a very common type of modeling

uses sampling with replacement as part of it.

So in sampling with replacement, instead of taking the red ball

out and then putting it on the table and drawing again,

we reach into the bag, pull out a ball and say, aha, it’s red,

note down on a piece of paper say that it’s red,

then put the red ball back, shake it up,

and draw another ball out again.

Record its color, put it back in the bag.

So without replacement, with replacement, that’s

exactly what it sounds like.

But they end up having very different mathematical results.

And as a result, and because of that,

they are used in different contexts.

All right, so the last thing we need to think–

another aspect we need to think about around sampling

is what size of sample we want to do.

And I really like this picture because I

think that it very excellently illustrates

the problems with sample sizes.

So when we sample, we do lose information,

just like with aggregation.

So you have to be careful not to make your sample too small.

So if we look over here, we have this data set,

and it’s just position data.

This is, I think, some sort of lithography picture.

So we’ve got these black structures,

and then we’ve got this sine wave

in the background and then a little bit of just random noise

scattered all over the place.

So if we subsample this by a quarter,

so we sample 2000 points, we can still

see the structures, the big thick structures,

are still represented.

But the sine wave has almost entirely disappeared.

We’ve lost that background image.

And if we go down even farther, if we subsample

by another quarter down to 500 points,

we’ve lost even the information of these things.

Like you can look at this and you

can kind of see the structures, but only because you know what

the structures need to look like.

If I showed you just this graph first,

you wouldn’t pick out the structures.

You wouldn’t be able to, there’s just not enough information

there.

So we want to reduce our sample size,

we want to sample a small enough size that we can process

it efficiently, that we can analyze it efficiently, that we

can explore it efficiently.

But we have to be really careful not to take too small a sample.

And unfortunately, there really isn’t a good rule

of thumb on this necessarily.

But you just you need to play with it.

You need to take lots of different samples

of different sizes.

You need to do this to figure out

when your information starts to disappear.

**Part 14**:

Dimensionality Reduction

**Part 12**:

Sampling for Data Selection

**Complete Series**:

https://tutorials.datasciencedojo.com/video-series/data-mining-fundamentals/

**More Data Science Material**:

[Video] Data Manipulation with dplyr

[Blog] Importance of Data Normalization Prior to Analytics

[Blog] Building Data Visualization Tools

(1352)

**Tags:**Data Mining