# Data Transformation – Data Mining Fundamentals Part 16

January 6, 2017 4:00 pm

We discuss the transformation of data in data preprocessing, such as attribute transformation. Attribute transformation is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values.

And in addition to doing something very complicated,

like a Fourier transform, you can

take a lot more similar, a lot more straightforward

transformations of your data.

So very common transformations are

taking the exponential of a data,

taking the logarithm of a data value,

taking the absolute value of a data value.

All of these allow us to–

all of these types of things allow us to

very nicely to try to bring out different dependencies

in our data, to try to correlate our data attributes better

with whatever our target is.

The other two things here I’m going

to take special time to talk about

because they show up a lot.

So standardization and normalization

are probably the most common kinds

of transformations that are applied to data, to attributes,

in data science.

Standardization is where we take our numeric data

and we divide the numeric data, each numeric value,

by the mean–

Sorry.

We subtract the mean and divide by the standard deviation

of our dataset.

So what this does is it forces our data to have a mean of 0

and a standard deviation of 1.

So that’s why it’s standardization.

The reason why we do this is that a lot of times

is that it’s a way of scaling our data down.

If you have, for instance, age and annual income,

there are a lot of different–

really the majority of model of algorithms

will overweight your data science

or will overweight your annual incomes, so if you

have age and annual income.

But if we standardize both of those,

then age and annual income are going

to be weighted in exactly the same way.

A somewhat less extreme version to do the same thing

is normalization where we simply subtract

the minimum from every data value

and then divide by the maximum.

And that maps the entire data onto the range from 0 to 1.

It distorts the separation between the values

to a certain extent.

But it does scale it very nicely so that age–

again taking the age versus annual income distinction–

age and annual income will end up on the same 0 to 1 scale.

They’ll be weighted the same way by our algorithms.

**Part 17**:

Similarity and Dissimilarity

**Part 15**:

Feature Subset Selection

**Complete Series**:

Data Mining Fundamentals

**More Data Science Material**:

[Video] Introduction to Azure ML: Data Exploration

[Blog] A look into Major League Baseball: Does the shift work?

(2943)