Data Noise – Data Mining Fundamentals Part 8

Data noise can overlap valid data and outliers. Noise can appear because of human inconsistency and labeling. We will provide you with several examples of data noise, and how data noise can be measured and recorded.

So, those of you who have scientific or signal processing
background are probably familiar with the term noise.
Noise in a data science context is
when we have an invalid signal of some sort that
overlaps valid data.
This obscures our actual attribute values.
And, fundamentally, what it means
is that some of our data objects have invalid values
in some of the attributes.
They don’t have real–
they have inaccurate values there.
So, examples of this in real life–
we have the distortion of a person’s voice
over the phone, snow on old television screens,
particularly the old CRT television screens.
Noise can appear because of human inconsistency
and labeling.
You see this a lot in sports, for instance
that require human judging.
There’s a lot of inconsistency in how
people get labeled there.
And, just in general, if you’re trying to say rank web sites,
for instance, human inconsistency in labeling
can be a real problem.
So, as sort of a practical example of what noise
can do when there’s a lot of it–
this is a pretty straightforward signal.
We’ve got two sine waves here with different frequencies
but the same amplitude– there’s a blue one and a green one–
and, so, we could generate the sine wave.
It looks very clean, very pretty.
We can even distinguish the two different sine waves.
If we add those two waves together and then throw noise
at it– just sort of basic white noise
like you might see in any kind of randomization thing–
and you end up with something that looks like this.
So, the noise has completely obscured our actual signal.
So, noise is, again, fundamentally, invalid data
points that are obscuring our signals.
So we have to be–
there’s always some noise in any system.
It’s just the nature of the universe, sadly.
But understanding where your noise is at its worst
and how you can deal with it is very important.
But even recognizing that it’s there is the first step–
recognizing which of your attributes
are noisy versus which are not–
are more noisy verses which of them are less noisy.
Sort of the complimentary problem–
complementary problem– to noise is the problem of outliers.
So, outliers often look like noise at first.
They’re data objects that have characteristics
that are considerably different from most of the other objects
in the data set.
So, if we look at the visual here–
we’ve got some sort of two-dimensional graphing
of our data and most of each dot– each pixel point,
represents a data object that’s been plotted on the graph.
So, we’ve got four clusters– very nicely defined clusters–
and then we’ve got these three other points just hanging out
in the middle of nowhere, far away from all
of the other data.
So, the big distinction between outliers and noise
is that outliers are actually valid values.
The data was collected properly– it’s clean,
but it’s outside of the normal range.
The data object, for some reason,
doesn’t look like a normal object.
All right– so that’s outliers and noise.
Those are sort of the first category of data quality
problems that get encountered a lot.

Part 9:
Missing Values and Duplicated Data

Part 7:
Data Quality

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] Automated Web Scraping in R: Auto Scheduling your Script
[Blog] 5 Design Problems You Should Be Solving with Data


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>