# Clustering Introduction

We will look at the fundamental concept of clustering, different types of clustering methods and the weaknesses. Clustering is an unsupervised learning technique that consists of grouping data points and creating partitions based on similarity. The ultimate goal is to find groups of similar objects.

#### Transcript

Hello everyone, my name is Arham. In this video, we will look at the fundamental concept
of clustering and types of clustering methods.
Clustering is grouping data points in creating partitions based on similarity.
If two things are similar in some ways,
They often share other characteristics.
Almost everything we perceive is in the form of clusters when we look up at the night sky
we see clusters of stars and we name them after shapes they resemble.
Similarly, a cluster is a set of similar data points or a set of points
that are more similar to each other
than two points in other clusters.
It is classified as an unsupervised learning technique. And the key difference
from other machine learning techniques
is that clustering does not have a response class.
After grouping observations, a human needs to visually look at the clusters
and optionally associate meaning to each cluster.
The ultimate prediction is the set of clusters themselves, and this technique
works only with data that is in numeric form.
This means that any categorical variable needs to be converted to a numeric variable by binarization
This is popularly known as one hot encoding.
There are many methods to predict clusters by calculating similarity.
And I will now introduce you to four different types of clustering methods.
The first one is, centroid based clustering.
Each cluster is represented by a centroid which derives
clusters based on the distance of the data point to the centroid of the clusters
One of the most widely used centroid based algorithms is K-Means.
K here stands for number of clusters and K needs to be defined by the user
This method starts by randomly placing centroids and iterates
Until the centroids find the shortest sum of distance between point to center.
It minimizes the aggregate intra cluster distances and every cycle results in different clusters
The second one is connectivity based clustering
The clusters are defined by grouping the nearest neighbor, based on distance between the data points
The idea is that nearby data points are more related than other points farther away
The key aspect is that one cluster contains other clusters
Because of this structure, the clusters represent a hierarchy.
This method works in two ways. It either starts from the smallest cluster and each step
two clusters that are similar are combined into a bigger cluster in a bottom-up manner,
Or starts from the biggest cluster and each steps divides into two in a top-down manner.
Clusters are represented by a dendogram here, which explicitly shows the hierarchy of clusters
The third one is distribution based clustering.
This method each cluster belongs to a normal distribution
The idea is that data points are divided based on probability of belonging to the same normal distribution
It is similar to centroid based clustering, except that distribution based clustering uses
Probability to compute the clusters rather than using the mean
The user needs to define the number of clusters
This method goes through an iterative process of optimizing the clusters and a popular example is
expectation maximization algorithm which uses a normal distribution for clustering the data points
The fourth one is density based clustering.
Clusters here are defined by areas of concentrated density.
This method begins by searching for areas of dense data points
and assigns those areas to the same clusters.
It’s based on connecting points with cell certain distance.
A cluster contains all linked data points within a distance threshold.
And considering the sparse areas as noise or borders between clusters.
I will now go through some clustering weaknesses.
In most clustering methods
we need to supply the number of clusters. We can use an approximation method to estimate
the number of clusters called as elbow method
Lastly, remember that clustering algorithms are always sensitive to outliers.
When you search for something on Google
or go on to Amazon to buy something, you are presented with links or products that are relevant
to your search by means of clustering.
All of the methods we looked at today boil down to the basic idea that we want to find groups of
similar objects. If you have any other topics you’d like us to cover leave a comment down below.
Give us a like if you found this useful, and if you want to see more
Check out other videos at tutorials.datasciencedojo.com. Thanks for watching!

Outline:

– What is clustering?
– Types of clustering methods:
1. Centroid-based clustering
2. Connectivity-based clustering
3. Distribution-based clustering
4. Density-based clustering
– Clustering weaknesses

Previous video:
Introduction to Precision, Recall and F1

Next video:
Natural Language Processing 101

More Data Science Material:
[Video] Introduction to Data Mining
[Video] Introduction to Web Scraping
[Blog] What Machine Learning Tools Should I Learn?

(1081) 