Histograms & Box Plots – Data Mining Fundamentals Part 23

Histograms and box plots are the most popular visualization techniques. In this tutorial, we discuss the unique benefits of both, and provide examples of when you can use each for your data exploration and visualization.

So I’m going to take a quick shot
through a couple of different visualization techniques
right now, different types of graphs.
We’re going to go in much greater detail
into this during the boot camp.
Pretty much almost all of the first day of the boot camp
is different visualizations.
How we use them, why we use them, all of that
sort of thing.
One of the most common and popular types of visualization
is a histogram.
So histograms show the distribution
of values of a single variable.
We divide the values into bins, and then count the number
of objects in each bin.
And the height of a bar on our graph
indicates the number of objects in a given bin.
So one of the important pieces of a histogram
is that the shape of the histogram
is going to depend on the number of bins you use.
You usually have to experiment with different numbers of bins
to extract the most interesting information.
So here we see two graphs of the petal width
of some data set of flowers.
It’s actually from that iris data
set we were looking at– we were touching on briefly earlier
with different bin widths.
So we can see here more clearly in the second than in the first
that we have two very clear spikes.
Maybe a third little spike here, and then a sort of a long messy
tail over in this side.
You can also construct two dimensional histograms
that shows the joint distribution of two
different attributes.
So here we’re plotting– we’re counting the number of objects
in petal width, the number of objects in each petal length
bin.
And then adding up the numbers in each
bins to get the height of our count.
Two dimensional histograms are really
nice for exploring correlations between different attributes.
Another very common visualization technique
is the box plot.
The box plot displays the distribution of data.
We’ve got a little box here where the edges of the box
are the 75th and 25th percentiles.
The median, or the 50th percentile,
is shown as a middle bar.
Then we show the 10th and 90th percentiles up above.
And if there are any outliers, which outliers
are a certain distance past the 90th and 10th percentiles
we’ll mark them explicitly.
So for instance, here’s an example
of that iris data again, sepel length and sepal
width, petal length and petal width shown in various box
plots.
So we’ve got centimeters on the left side, the values
on the left side, and then each attribute
has its own distribution.
And we can see that the sepals are pretty well,
but you know, clustered together.
Petal length is all over the place
and petal width is a little less all over the place.
So box slots are very easy, very good for visualizing
that kind of distribution.

Part 24:
Scatter Plots & Contour Plots

Part 22:
Center and Spread Measurement

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] Introduction to Machine Learning with R and caret
[Blog] Which Machine Learning Tools Should I Learn?

(2130)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>