# Dimensionality Reduction – Data Mining Fundamentals Part 14

January 6, 2017 2:00 pm

Dimensionality reduction has a specific purpose for data preprocessing. When dimensionality increases, data becomes increasingly sparse in the space that it occupies. Dimensionality reduction will help you avoid this.

The next kind of thing we’re going to talk about

is what’s called the curse of dimensionality.

So this is as much a data– this is sort of a data quality

issue, but it’s something that we

have to be careful about when we’re doing data processing.

So the curse of dimensionality is that as your number

of dimensions increases– so as the number of columns,

number of attributes you have in your data set increases–

the data inherently becomes increasingly

sparse in that space, since in a lot of contexts,

for a lot of different algorithms,

definitions of density and distances

between points of similarity and dissimilarity

are really important to things like clustering methods

and outlier detection– so anomaly detection.

And this all becomes less meaningful.

If you add enough dimensions, every point

looks like an outlier.

So a great illustration of this is

that if we randomly generate 500 points in a n dimensional space

and we compute the difference between the maximum distance

between any pair of points and the minimum distance

between any pair of points–

and this has been normalized in a log taken

to make it look pretty–

we can see that in two dimensions with 500 randomly

generated points, the maximum distance is about

three and a quarter times larger than the minimum distance.

Actually, this is 10 to the three and a quarter times

larger, because there’s a log base 10 here.

As we increase the number of dimensions,

though, that spacing falls off really sharply.

And by the time we get down here 30, 40, 50 dimensions,

our points are so sparse that the minimum distance

between points and the maximum distance

is almost the same thing.

This 50 point represents a factor of something like 10

to the 0.25.

Like, the fourth root of 10 is the difference

between the maximum distance and the minimum distance.

This is a very small number.

It’s really hard to define outliers

when you have such high dimensional

data, because every point is an outlier in some ways,

because the space is so sparse.

So the solution to this data quality problem

is something called dimensionality reduction.

So we can do dimensionality reduction via aggregation

or other sorts of column combination.

But there are also a number of mathematical techniques.

Two of the big popular ones are Principal Component Analysis,

or PCA, and Singular Value Decomposition, also called SVD.

And those are mathematical techniques

that will run automatically that will reduce the dimensionality

of your data.

PCA actually usually goes from n dimensions–

so as many dimensions as you have to have– all the way down

to two dimensions.

[BEEPING] Natalie, they are kind of the same thing,

but they aren’t exactly the same thing.

I’m not going to go into great detail,

because we don’t spend a lot of time

on dimensionality reduction over the course of the boot camp.

But my understanding is that they are distinct techniques,

though they have the same goal.

They have the same goal.

But they are achieved via different mathematical methods.

**Part 15**:

Feature Subset Selection

**Part 13**:

Types of Sampling

**Complete Series**:

Data Mining Fundamentals

**More Data Science Material**:

[Video] Event Log Mining with R

[Blog] Custom R Models in Azure Machine Learning

(556)