We will look at the fundamental concept of clustering, different types of clustering methods and the weaknesses. Clustering is an unsupervised learning technique that consists of grouping data points and creating partitions based on similarity. The ultimate goal is to find groups of similar objects.

Hello everyone, my name is Arham. In this video, we will look at the fundamental concept

of clustering and types of clustering methods.

Clustering is grouping data points in creating partitions based on similarity.

If two things are similar in some ways,

They often share other characteristics.

Almost everything we perceive is in the form of clusters when we look up at the night sky

we see clusters of stars and we name them after shapes they resemble.

Similarly, a cluster is a set of similar data points or a set of points

that are more similar to each other

than two points in other clusters.

It is classified as an unsupervised learning technique. And the key difference

from other machine learning techniques

is that clustering does not have a response class.

After grouping observations, a human needs to visually look at the clusters

and optionally associate meaning to each cluster.

The ultimate prediction is the set of clusters themselves, and this technique

works only with data that is in numeric form.

This means that any categorical variable needs to be converted to a numeric variable by binarization

This is popularly known as one hot encoding.

There are many methods to predict clusters by calculating similarity.

And I will now introduce you to four different types of clustering methods.

The first one is, centroid based clustering.

Each cluster is represented by a centroid which derives

clusters based on the distance of the data point to the centroid of the clusters

One of the most widely used centroid based algorithms is K-Means.

K here stands for number of clusters and K needs to be defined by the user

This method starts by randomly placing centroids and iterates

Until the centroids find the shortest sum of distance between point to center.

It minimizes the aggregate intra cluster distances and every cycle results in different clusters

The second one is connectivity based clustering

The clusters are defined by grouping the nearest neighbor, based on distance between the data points

The idea is that nearby data points are more related than other points farther away

The key aspect is that one cluster contains other clusters

Because of this structure, the clusters represent a hierarchy.

This method works in two ways. It either starts from the smallest cluster and each step

two clusters that are similar are combined into a bigger cluster in a bottom-up manner,

Or starts from the biggest cluster and each steps divides into two in a top-down manner.

Clusters are represented by a dendogram here, which explicitly shows the hierarchy of clusters

The third one is distribution based clustering.

This method each cluster belongs to a normal distribution

The idea is that data points are divided based on probability of belonging to the same normal distribution

It is similar to centroid based clustering, except that distribution based clustering uses

Probability to compute the clusters rather than using the mean

The user needs to define the number of clusters

This method goes through an iterative process of optimizing the clusters and a popular example is

expectation maximization algorithm which uses a normal distribution for clustering the data points

The fourth one is density based clustering.

Clusters here are defined by areas of concentrated density.

This method begins by searching for areas of dense data points

and assigns those areas to the same clusters.

It’s based on connecting points with cell certain distance.

A cluster contains all linked data points within a distance threshold.

And considering the sparse areas as noise or borders between clusters.

I will now go through some clustering weaknesses.

In most clustering methods

we need to supply the number of clusters. We can use an approximation method to estimate

the number of clusters called as elbow method

Lastly, remember that clustering algorithms are always sensitive to outliers.

When you search for something on Google

or go on to Amazon to buy something, you are presented with links or products that are relevant

to your search by means of clustering.

All of the methods we looked at today boil down to the basic idea that we want to find groups of

similar objects. If you have any other topics you’d like us to cover leave a comment down below.

Give us a like if you found this useful, and if you want to see more

Check out other videos at tutorials.datasciencedojo.com. Thanks for watching!

**Outline**:

– What is clustering?

– Types of clustering methods:

1. Centroid-based clustering

2. Connectivity-based clustering

3. Distribution-based clustering

4. Density-based clustering

– Clustering weaknesses

**Previous video:**

Introduction to Precision, Recall and F1

**Next video:**

Natural Language Processing 101

**More Data Science Material:**

[Video] Introduction to Data Mining

[Video] Introduction to Web Scraping

[Blog] What Machine Learning Tools Should I Learn?

(934)