We introduce Data Preprocessing, known as data cleaning, and the different strategies used to tackle it. There are many strategies for data preprocessing, and because data science is such a heterogeneous field, none of these strategies are strictly independent.
So now we get to the much foreshadowed data
So data preprocessing is sometimes called data cleaning,
but data preprocessing should involve more steps
than just cleaning the data, just removing the problems
So data cleaning is kind of a subset of preprocessing.
But most of what we do during data preprocessing
is, in fact, data cleaning.
So, again, lots of different terms
to refer to basically the same thing.
So there’s a lot of different types of preprocessing.
And I’m going to talk about a lot of different strategies,
aggregation sampling, all the ones on the screen here.
I’m going to talk about all these different strategies.
But we don’t want to use all of these different strategies
There’s a lot of different strategies we can use,
but for any given data set, we’re
only going to use a couple of them usually.
We don’t want to overwhelm.
We’re not going to need every technique and every tool
in our toolbox every time.
Another note before we keep going, not all of these
are strictly independent.
They all get– these terms categories
are all things you see thrown around
and terms you see used around the industry.
But, because, again data science is such a heterogeneous field,
not all of these things are strictly independent.
So if you see some overlap in what
I’m talking about between different attributes,
Missing Values and Duplicated Data
More Data Science Material:
[Video] What is a Data Engineer?
[Blog] A Comprehensive Tutorial on Classification using Decision Trees