Dimensionality reduction has a specific purpose for data preprocessing. When dimensionality increases, data becomes increasingly sparse in the space that it occupies. Dimensionality reduction will help you avoid this.
The next kind of thing we’re going to talk about
is what’s called the curse of dimensionality.
So this is as much a data– this is sort of a data quality
issue, but it’s something that we
have to be careful about when we’re doing data processing.
So the curse of dimensionality is that as your number
of dimensions increases– so as the number of columns,
number of attributes you have in your data set increases–
the data inherently becomes increasingly
sparse in that space, since in a lot of contexts,
for a lot of different algorithms,
definitions of density and distances
between points of similarity and dissimilarity
are really important to things like clustering methods
and outlier detection– so anomaly detection.
And this all becomes less meaningful.
If you add enough dimensions, every point
So a great illustration of this is
that if we randomly generate 500 points in a n dimensional space
and we compute the difference between the maximum distance
between any pair of points and the minimum distance
between any pair of points–
and this has been normalized in a log taken
we can see that in two dimensions with 500 randomly
generated points, the maximum distance is about
three and a quarter times larger than the minimum distance.
Actually, this is 10 to the three and a quarter times
larger, because there’s a log base 10 here.
As we increase the number of dimensions,
though, that spacing falls off really sharply.
And by the time we get down here 30, 40, 50 dimensions,
our points are so sparse that the minimum distance
between points and the maximum distance
is almost the same thing.
This 50 point represents a factor of something like 10
Like, the fourth root of 10 is the difference
between the maximum distance and the minimum distance.
This is a very small number.
It’s really hard to define outliers
when you have such high dimensional
data, because every point is an outlier in some ways,
because the space is so sparse.
So the solution to this data quality problem
is something called dimensionality reduction.
So we can do dimensionality reduction via aggregation
or other sorts of column combination.
But there are also a number of mathematical techniques.
Two of the big popular ones are Principal Component Analysis,
or PCA, and Singular Value Decomposition, also called SVD.
And those are mathematical techniques
that will run automatically that will reduce the dimensionality
PCA actually usually goes from n dimensions–
so as many dimensions as you have to have– all the way down
[BEEPING] Natalie, they are kind of the same thing,
but they aren’t exactly the same thing.
I’m not going to go into great detail,
because we don’t spend a lot of time
on dimensionality reduction over the course of the boot camp.
But my understanding is that they are distinct techniques,
though they have the same goal.
But they are achieved via different mathematical methods.
Feature Subset Selection
Types of Sampling
Data Mining Fundamentals
More Data Science Material:
[Video] Event Log Mining with R
[Blog] Custom R Models in Azure Machine Learning