Data Transformation – Data Mining Fundamentals Part 16

We discuss the transformation of data in data preprocessing, such as attribute transformation. Attribute transformation is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values.

And in addition to doing something very complicated,
like a Fourier transform, you can
take a lot more similar, a lot more straightforward
transformations of your data.
So very common transformations are
taking the exponential of a data,
taking the logarithm of a data value,
taking the absolute value of a data value.
All of these allow us to–
all of these types of things allow us to
very nicely to try to bring out different dependencies
in our data, to try to correlate our data attributes better
with whatever our target is.
The other two things here I’m going
to take special time to talk about
because they show up a lot.
So standardization and normalization
are probably the most common kinds
of transformations that are applied to data, to attributes,
in data science.
Standardization is where we take our numeric data
and we divide the numeric data, each numeric value,
by the mean–
Sorry.
We subtract the mean and divide by the standard deviation
of our dataset.
So what this does is it forces our data to have a mean of 0
and a standard deviation of 1.
So that’s why it’s standardization.
The reason why we do this is that a lot of times
is that it’s a way of scaling our data down.
If you have, for instance, age and annual income,
there are a lot of different–
really the majority of model of algorithms
will overweight your data science
or will overweight your annual incomes, so if you
have age and annual income.
But if we standardize both of those,
then age and annual income are going
to be weighted in exactly the same way.
A somewhat less extreme version to do the same thing
is normalization where we simply subtract
the minimum from every data value
and then divide by the maximum.
And that maps the entire data onto the range from 0 to 1.
It distorts the separation between the values
to a certain extent.
But it does scale it very nicely so that age–
again taking the age versus annual income distinction–
age and annual income will end up on the same 0 to 1 scale.
They’ll be weighted the same way by our algorithms.

Part 17:
Similarity and Dissimilarity

Part 15:
Feature Subset Selection

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] Introduction to Azure ML: Data Exploration
[Blog] A look into Major League Baseball: Does the shift work?

(2943)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>