Feature Selection – Data Mining Fundamentals Part 15

Feature selection is another way of performing dimensionality reduction. We discuss the many techniques for feature subset selection, including the brute-force approach, embedded approach, and filter approach. Feature subset selection will reduce redundant and irrelevant features in your data.

All right, so another way to reduce dimensionality
of data other than just PCA is a lot of times
we have redundant or irrelevant features.
So this is going back to Theresa’s questions
about dimensions being independent.
So if we have redundant features or irrelevant features,
that will increase our dimensionality artificially.
It contains little to no information,
but it increases our dimensionality.
So we want to be very careful about trying to detect these.
So a redundant feature example, for instance,
is that the purchase price of a product and the amount of sales
tax paid on that product, those things are,
based on the state, completely connected.
You can calculate one from the other.
They’re perfectly correlated.
So, as a result, you want to get rid of it
because it increases your dimensionality
without adding new information.
Same thing with irrelevant features.
A student’s ID number, the vast majority of the time,
is irrelevant to the task of predicting student’s GPA.
And these types of redundant and irrelevant features
don’t just harm us via increased dimensionality.
Redundant features effectively weigh features multiple times.
If we have the same information contained in two columns, two
separate columns, that model thinks are both important,
we have double-weighted that information.
Similarly, irrelevant features can confuse our model.
The model will try to do some fitting based on those features
and it’ll just sort of diffuse the effectiveness of the model.
So one of our big steps of data pre-processing
is making sure we figure out what attributes
are redundant and irrelevant and aggressively cutting them out
of our data set.
And there’s a number of different techniques
you can use to do this kind of subset selection.
You can brute force it, just try all your different feature
subsets.
Some algorithms, some of the most popular algorithms
used, actually, naturally do feature selection.
And so that’s always good.
Sometimes you have a filter approach
where you use your exploration and what
you know about the data set in order to filter out
the bad features.
And sometimes you can get the data science inception
going on where you use a data mining algorithm on your data
mining algorithm in order to find
the best subset of attributes.
But that’s feature subset selection.
It doesn’t share a lot.
I’m going to move on a little quickly.
Please ask questions as they are as they arise to you.
But we’re running a little bit behind, which is great.
I love the discussions we’ve had and it’s important.
The front half of this presentation
is more critical than the back half.
But I am going to start increasing
the pace a little bit, just as a heads up.
So please ask your questions as they come up.
So another common technique, and this goes with aggregation
to a certain extent, is feature creation.
So we have the cursor dimensionality on the one hand,
but other times we don’t have enough features.
We don’t have enough information.
There is more information that we could have.
So we can either extract things, say
combine two columns in order to get new information, so,
for instance, in sales we could determine the tag
price from the total amount paid,
filtering out the sales tax, which might be important.
Other times we have aggregation and things
like that with feature construction.
And last, and really mostly least
because we don’t do this that much, is mapping data
to a new space.
So those of you from a scientific background
are probably familiar with the Fourier transform, which
takes data that is in the time domain
and converts it to be in the frequency domain, which
allows you to pick out different pieces of information.
We don’t do this kind of transformation
that much in data science because it
tends to require transforming the entire data object.
But it is something to be aware of,
to have in your back in the back of your head.
Because there are some times that you really do
want to do some sort of massive transformation like this.
Particularly in an anomaly detection time series context,
you might want to do things like take a Fourier
transform of your data.

Part 16:
Data Transformation

Part 14:
Dimensionality Reduction

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] R Programming for Excel Users
[Blog] Natural Language Processing with R Programming Books

(731)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>