Data quality is the most overlooked step in data mining. Understanding your data quality problems is very important to creating robust models that will actually work in production.
Now we’ve got sort of that basic definition,
there’s a basic sort of, we understand
what attributes are in data objects and the different types
We can move on to talking about data quality.
Now data quality is particularly by new data scientists,
one of the most commonly overlooked or shortened
or poorly shortened steps.
Pieces of it get ignored, get skipped because it just
doesn’t seem that necessary.
But understanding your data quality problems
and understanding where they could come from
is very, very important to creating
robust models that will actually work in production.
You have to know what to expect in order
to handle it appropriately.
So there are three fundamental questions around data quality,
We have to ask this of every dataset we get.
One, what problems do we have to worry about?
How do we detect those problems, and what can we
Those are the three fundamental questions
you should ask yourself every time
upon approaching a new dataset.
And your early exploration should really
be, some of your earliest explorations
should really be focused at answering these questions.
So I am going to give you some examples of how we answer
each of these three questions and some
of the categories of things coming up.
So there are three very common kinds
of data quality problems–
noise and outliers, missing values, and duplicate data.
These show up in production all the time.
So let’s go through and think about these in this context.
Ordered Data & Graph Data
Data Mining Fundamentals
More Data Science Material:
[Video] Data Visualization with R and ggplot2
[Blog] Enhance your AI superpowers with Geospatial Visualization