Euclidean distance and cosine similarity are the next aspect of similarity and dissimilarity we will discuss. We will show you how to calculate the euclidean distance and construct a distance matrix.
This series is part of our pre-bootcamp course work for our data science bootcamp.
So when we’ve got real values–
and this is sort of a primer for the boot
camp, a reminder for those of you
who’ve been out of math classes for a while–
when we’ve got continuous data, purely continuous data,
we will often use Euclidean distance as the distance,
as a way of measuring similarity, actually, really,
as a way of measuring dissimilarity
because it’s higher the more unlike the objects are.
So this formula might be a little
intimidating to some people.
But I promise you that you are familiar with Euclidean
You just maybe don’t know the term.
Euclidean distance is what you’d hear
called a distance formula, just the distance formula,
in your high school algebra classes.
And most people have seen it in two dimensions, and sometimes
But one of the very nice things about the Euclidean distance
is that it generalizes very naturally to as many dimensions
So in order to calculate the Euclidean distance between two
data objects, we take the difference in each attribute
value, square it, and then sum that and take the square root.
So for instance, we have four points here
at 0,2 2,0, 3,1 and 5,1 that are all
plotted at different points.
And we can construct a distance matrix
describing how dissimilar all of our points are.
So 0.1 0.4 are the most dissimilar.
They’re the farthest apart, whereas 0.2 and 0.3
They’re the closest together.
0.3 is also fairly similar to 0.4,
whereas 0.2 is somewhat less similar from 0.4.
So another distance metric that we see, particularly
in the context of documents, is called cosine similarity.
We have turned them into term vectors.
and cosine similarity is a measure of similarity,
We can find how dissimilar the two documents are
by thinking of each of them as vectors, taking their dot
which, for those of you who never
had it or don’t remember your college vector
you take each attribute, attribute by attribute,
and you multiply them together across your two
So 3 times 1, 2 times 0, 0 times 0.
Maybe this is play and this is coach and this is tournament.
And so we’ll do our count, and then we’ll
multiply them all together document to document,
And then we end up dividing by the product of the magnitudes.
So the product of the magnitudes is
just you square each attribute, add them all up,
and take the square root.
So in this case we have a dot product of 5.
We have a D1 and a D2 of 6.481 and 2.245.
Those are our magnitudes.
So we multiply these two together and divide 5 by that.
And we end up with a cosine similarity of .315.
Cosine similarity is a really nice metric for documents
because it gives us this very clean 0 to 1 measurement that
suffers less from the curse of dimensionality
than something like Euclidean distance does.
So because document vectors tend to get very, very
long because there’s a lot of different words in a given
language, and given documents might have lots
of different words in them, cosine similarity
is a way to avoid some of the curse of dimensionality.
And we’ll talk about this more when
we talk about encoding documents more directly in the boot camp.
Similarity and Dissimilarity
Data Mining Fundamentals
More Data Science Material:
[Video] Time Series Forecasting in Minutes
[Blog] The 18 Best Data Science Podcasts on SoundCloud, Apple Podcast, and Spotify