Correlation and visually evaluating is the next step in our discussion on similarity and dissimilarity. Correlation measures the linear relationship between objects, and to visually evaluate correlation, you will need to build a scatter plot.
Another very common one, that I’m sure Ron in particular
is very familiar with, as a statistician, is correlation.
So correlation measures, essentially,
the linear relationship between the objects.
It tells us if object p and q move together,
is kind of the way to think about it.
So what we do with this is we standardize each
of the objects’ attributes.
And then we take their dot product.
And it gives us a value between 1 and negative 1–
so it’s not exactly a standard similarity measurement–
that we can square it and then it becomes between 0 and 1
and becomes a standard similarity measurement.
That’s sometimes called the coefficient of determination.
R is the coefficient of determination.
R squared is the correlation.
I don’t remember my statistics classes well enough.
The two tend to get used in data science very interchangeably.
So here, for those of you who haven’t
had that much statistics or who don’t remember,
is a visual example of our correlations.
So when correlation is negative 1,
which is the lowest possible value,
we have a very linear relationship.
As one object goes up, the other comes down,
whatever up and down happen to mean in this context.
And with a correlation of 1, we have
the objects are going up together or coming down
And as we get to correlations that are closer to 0,
we can see that this data clearly
has very little relationship.
Whereas if we get closer to 1 and negative 1,
we see a sharper and sharper linear relationship
Correlation is one of the metrics
that we use to evaluate regression models.
So we’ll talk about it more in that context.
But I just wanted to make sure we introduced it
so people had heard the word if you
haven’t had much of a statistics background,
Euclidean Distance & Cosine Similarity
Data Mining Fundamentals
More Data Science Material:
[Video] Event Log Mining with R
[Blog] High Dimensional Data: Breaking the Curse of Dimensionality with Python