# Euclidean Distance & Cosine Similarity – Data Mining Fundamentals Part 18

January 6, 2017 6:00 pm

Euclidean distance and cosine similarity are the next aspect of similarity and dissimilarity we will discuss. We will show you how to calculate the euclidean distance and construct a distance matrix.

This series is part of our pre-bootcamp course work for our data science bootcamp.

So when we’ve got real values–

and this is sort of a primer for the boot

camp, a reminder for those of you

who’ve been out of math classes for a while–

when we’ve got continuous data, purely continuous data,

we will often use Euclidean distance as the distance,

as a way of measuring similarity, actually, really,

as a way of measuring dissimilarity

because it’s higher the more unlike the objects are.

So this formula might be a little

intimidating to some people.

But I promise you that you are familiar with Euclidean

distance.

You just maybe don’t know the term.

Euclidean distance is what you’d hear

called a distance formula, just the distance formula,

in your high school algebra classes.

And most people have seen it in two dimensions, and sometimes

three.

But one of the very nice things about the Euclidean distance

is that it generalizes very naturally to as many dimensions

as you want.

So in order to calculate the Euclidean distance between two

data objects, we take the difference in each attribute

value, square it, and then sum that and take the square root.

So for instance, we have four points here

at 0,2 2,0, 3,1 and 5,1 that are all

plotted at different points.

And we can construct a distance matrix

describing how dissimilar all of our points are.

So 0.1 0.4 are the most dissimilar.

They’re the farthest apart, whereas 0.2 and 0.3

are the most similar.

They’re the closest together.

0.3 is also fairly similar to 0.4,

whereas 0.2 is somewhat less similar from 0.4.

So another distance metric that we see, particularly

in the context of documents, is called cosine similarity.

So we have documents.

We have turned them into term vectors.

We can find how similar–

and cosine similarity is a measure of similarity,

not of dissimilarity.

We can find how dissimilar the two documents are

by thinking of each of them as vectors, taking their dot

product–

which, for those of you who never

had it or don’t remember your college vector

calculus classes–

you take each attribute, attribute by attribute,

and you multiply them together across your two

different objects.

So 3 times 1, 2 times 0, 0 times 0.

Maybe this is play and this is coach and this is tournament.

And so we’ll do our count, and then we’ll

multiply them all together document to document,

and sum that all up.

And then we end up dividing by the product of the magnitudes.

So the product of the magnitudes is

just you square each attribute, add them all up,

and take the square root.

So in this case we have a dot product of 5.

We have a D1 and a D2 of 6.481 and 2.245.

Those are our magnitudes.

So we multiply these two together and divide 5 by that.

And we end up with a cosine similarity of .315.

Cosine similarity is a really nice metric for documents

because it gives us this very clean 0 to 1 measurement that

suffers less from the curse of dimensionality

than something like Euclidean distance does.

So because document vectors tend to get very, very

long because there’s a lot of different words in a given

language, and given documents might have lots

of different words in them, cosine similarity

is a way to avoid some of the curse of dimensionality.

And we’ll talk about this more when

we talk about encoding documents more directly in the boot camp.

**Part 19**:

Evaluating Correlation

**Part 17**:

Similarity and Dissimilarity

**Complete Series**:

Data Mining Fundamentals

**More Data Science Material**:

[Video] Time Series Forecasting in Minutes

[Blog] The 18 Best Data Science Podcasts on SoundCloud, Apple Podcast, and Spotify

(3232)