Data Cleaning – Data Mining Fundamentals Part 10

We introduce Data Preprocessing, known as data cleaning, and the different strategies used to tackle it. There are many strategies for data preprocessing, and because data science is such a heterogeneous field, none of these strategies are strictly independent.

So now we get to the much foreshadowed data
preprocessing section.
So data preprocessing is sometimes called data cleaning,
but data preprocessing should involve more steps
than just cleaning the data, just removing the problems
with the data.
So data cleaning is kind of a subset of preprocessing.
But most of what we do during data preprocessing
is, in fact, data cleaning.
So, again, lots of different terms
to refer to basically the same thing.
So there’s a lot of different types of preprocessing.
And I’m going to talk about a lot of different strategies,
aggregation sampling, all the ones on the screen here.
I’m going to talk about all these different strategies.
But we don’t want to use all of these different strategies
on every data set.
There’s a lot of different strategies we can use,
but for any given data set, we’re
only going to use a couple of them usually.
We don’t want to overwhelm.
We’re not going to need every technique and every tool
in our toolbox every time.
Another note before we keep going, not all of these
are strictly independent.
They all get– these terms categories
are all things you see thrown around
and terms you see used around the industry.
But, because, again data science is such a heterogeneous field,
not all of these things are strictly independent.
So if you see some overlap in what
I’m talking about between different attributes,
that’s why.

Part 11:
Data Aggregation

Part 9:
Missing Values and Duplicated Data

Complete Series:

More Data Science Material:
[Video] What is a Data Engineer?
[Blog] A Comprehensive Tutorial on Classification using Decision Trees


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>