# Histograms & Box Plots – Data Mining Fundamentals Part 23

January 6, 2017 11:00 pm

Histograms and box plots are the most popular visualization techniques. In this tutorial, we discuss the unique benefits of both, and provide examples of when you can use each for your data exploration and visualization.

So I’m going to take a quick shot

through a couple of different visualization techniques

right now, different types of graphs.

We’re going to go in much greater detail

into this during the boot camp.

Pretty much almost all of the first day of the boot camp

is different visualizations.

How we use them, why we use them, all of that

sort of thing.

One of the most common and popular types of visualization

is a histogram.

So histograms show the distribution

of values of a single variable.

We divide the values into bins, and then count the number

of objects in each bin.

And the height of a bar on our graph

indicates the number of objects in a given bin.

So one of the important pieces of a histogram

is that the shape of the histogram

is going to depend on the number of bins you use.

You usually have to experiment with different numbers of bins

to extract the most interesting information.

So here we see two graphs of the petal width

of some data set of flowers.

It’s actually from that iris data

set we were looking at– we were touching on briefly earlier

with different bin widths.

So we can see here more clearly in the second than in the first

that we have two very clear spikes.

Maybe a third little spike here, and then a sort of a long messy

tail over in this side.

You can also construct two dimensional histograms

that shows the joint distribution of two

different attributes.

So here we’re plotting– we’re counting the number of objects

in petal width, the number of objects in each petal length

bin.

And then adding up the numbers in each

bins to get the height of our count.

Two dimensional histograms are really

nice for exploring correlations between different attributes.

Another very common visualization technique

is the box plot.

The box plot displays the distribution of data.

We’ve got a little box here where the edges of the box

are the 75th and 25th percentiles.

The median, or the 50th percentile,

is shown as a middle bar.

Then we show the 10th and 90th percentiles up above.

And if there are any outliers, which outliers

are a certain distance past the 90th and 10th percentiles

we’ll mark them explicitly.

So for instance, here’s an example

of that iris data again, sepel length and sepal

width, petal length and petal width shown in various box

plots.

So we’ve got centimeters on the left side, the values

on the left side, and then each attribute

has its own distribution.

And we can see that the sepals are pretty well,

but you know, clustered together.

Petal length is all over the place

and petal width is a little less all over the place.

So box slots are very easy, very good for visualizing

that kind of distribution.

**Part 24**:

Scatter Plots & Contour Plots

**Part 22**:

Center and Spread Measurement

**Complete Series**:

Data Mining Fundamentals

**More Data Science Material**:

[Video] Introduction to Machine Learning with R and caret

[Blog] Which Machine Learning Tools Should I Learn?

(2458)