Missing values can occur because information is not collected, or attributes are not applicable to all cases. We will tell you several ways to handle your missing values, as well as solutions for dealing with duplicate data, which can be a major issue when merging data from heterogeneous sources.
Another one that shows up very frequently is missing values.
So sometimes missing values are because information is not
So whether you’re looking at census information or survey
information in particular, people
will often decline to give their age and weight
or will decline to give their annual income.
So you just have missing values.
Other times, the attributes that you’re collecting
may not be applicable to all cases, right?
If you’re asking people about the annual income
of each member of their household on a survey,
well, the children in the household
don’t have an annual income.
It doesn’t make sense, so you just
code that as a missing value.
And we’ll talk a lot more about handling
missing values when we talk when we get to data pre-processing.
But kind of the fundamental ways we can handle it
are throw out all the data objects
that have any missing values.
We can estimate our missing values
using means or medians or something else.
We can with some algorithms but not all,
ignore the missing values on a row by row basis.
Or we can just throw the attribute out entirely,
which is something we might want to do
if we have an attribute that is 80% missing,
we probably just wanted to throw that column out.
And one of the ways, one of the other things you
can do sometimes in some algorithms
is you replace missing values adaptively
with this happens a lot in categorical,
where you’ll count the probabilities of an attribute
appearing, an attribute value appearing
over your whole dataset, and then
replace all the missing values with such
that those probabilities don’t change.
And we’ll talk a little bit more about that when
we get to preprocess, I guess I want to sort of get
the basic sort of this is how you
handle missing values in a very basic sense out there.
And the third category then, alongside missing values, noise
and outliers, is duplicate data.
So this is particularly a problem when data objects are,
when we’re merging data from heterogeneous sources.
So if we have some data from Google Analytics coming
from our website and we have some other data
from actual uses, you know click counts, and sort
of dwelling time and things like that,
that’s from another system, or maybe we have a Java applet,
as much as those things still exist
on the internet, that collects some data inside of it.
If we want to merge that data, we will sometimes
have duplicate data objects.
We’ll have the same person with multiple email addresses.
We’ll have the same person represented
with two different IDs, because they’re coming
from two different systems.
So generally speaking, duplicate data though,
is pretty easy to handle, assuming
that you can detect it properly, which
is get rid of the duplicates.
[LAUGHS] Merge it together.
But if you’ve got data that’s heterogeneous, that’s
from multiple sources, then you do
have to be really careful about filtering out your duplicates.
More Data Science Material:
[Video] Combining Datasets in dplyr
[Blog] 30 Data Sets to Uplift your Skills in Data Science