Data Aggregation – Data Mining Fundamentals Part 11
January 6, 2017 11:00 am
Data aggregation is our first data cleaning strategy. Aggregation is combining two or more attributes (or objects) into a single attribute (or object).
So first strategy– and this one is first
because we see it a lot– is aggregation.
So we’ll combine two or more attributes or objects
into a single attribute or object.
So this can be where we are trying
to reduce the scale of our data, reduce the number
of attributes or objects.
So we could, for instance, combine two attributes,
to combine a high-temperature attribute
and a low-temperature attribute in order to get a temperature
We’ve now combined two columns into one column.
Basically every algorithm has some time dependence
on the number of attributes it runs,
and certainly in terms of visualization and exploration,
there’s only so many attributes that you
can look at at the same time or hold in your head
at the same time.
On the other hand, we might want to combine
a bunch of different objects.
If we have users who have many different sessions,
or who navigate to many different pages,
we’ll have dwell times that are different for every page
and every session, and we might want
to combine average all those dwell times in order
to get one data object that is the average user
behavior for each user, rather than the 10
or 15 different sessions for that user.
So the reason why we do this is exactly that.
If we want to average user times,
for instance, we’re changing our scale.
We want to aggregate cities into regions, states, or countries.
We want to aggregate dwell times across sessions
or across pages.
And one of the big advantages of aggregation,
particularly averaging, is that aggregated data
tends to have less variability.
It’s a way of reducing the effective noise.
Well it’s a way of reducing the effect of random noise.
If you’ve got human labeling errors,
then you’ve got human labeling errors.
If you’ve got sampling procedure errors,
you have sampling procedure errors.
But if you’ve got random errors, say
random noise, then aggregated data will very much
tend to reduce that.
So as an example of that– and I really like this next page
these two are graphs of precipitation in Australia.
So these are histograms.
So the height of each block is the number
of locations where precipitation was measured which
had, in this case, a standard deviation of the X value
when we measured it on an on a monthly basis.
So we’re measuring the average monthly precipitation
and measuring the standard deviation
of that monthly precipitation at 500 different land
locations in Australia.
When we do that on a monthly basis,
we get this very wide spread of standard deviations.
Some places are very consistent in their rainfall.
There’s these two peaks, and then you
have this long tail of places that are just
all over the place in terms of the variability
On the other hand, if we take those exact same land locations
and instead, find the average yearly precipitation–
the variance standard deviation of that–
we get this very nice single peaked, mostly single
peaked, very short-tailed histogram.
We’ve significantly reduced our variability.
We’ve reduced our random noise in our dataset
by increasing the scale by aggregating our data
over a longer time period.
So that’s one of the big reasons that we use aggregation.
Data Mining Fundamentals