Introduction

Clustering is basically a type of unsupervised learning method. It is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
in this notebook we will introduce some algorithms for clustering and note some of their characteristics.

No description has been provided for this image

No description has been provided for this image

Clustering

There are different approaches to clustering. We could differentiate these sub-categories as follows.

Porbablistic methods
Non-Probablistic methods
- Partition-based clustering
- Hierarchical clustering
- Density-based clustering

Partition-Based Clustering

Partition-based clustering techniques try to create partitions of data based on a distance measurement applied to data points. The most common algorithm of this approach is k-means clustering.
K-means clustering tries to minimize distances within a cluster and maximize the distance between different clusters. K-means algorithm is not capable of determining the number of clusters. We need to define it when creating the KMeans object which may be a challenging task.
K-means is an iterative process. It is built on expectation-maximization algorithm. After number of clusters are determined, it works by executing the following steps:

Randomly select centroids (center of cluster).
Calculate distance of all data points to all centroids.
Assign each data point to the closest centroid. data points assigned to a centroid form a cluster.
Find the new centroids of each cluster by taking the mean of all data points in the cluster.

Repeat 2-4 until convergence. Convergence could happen when the centroids stop changing or when the total cost function reaches a minimum threshold.

Pros And cons

K-Means has the advantage that it’s pretty fast, as all we’re really doing is computing the distances between points and group centers; very few computations! It thus has a linear complexity O(n). Although it may sound fast, but as the number of data points grow larger this time becomes very noticable. because ve run O(n) computation at each iteration. but the number of iterations may vary.
On the other hand, K-Means has a couple of disadvantages. Firstly, you have to select how many groups/classes there are. This isn’t always trivial and ideally with a clustering algorithm we’d want it to figure those out for us because the point of it is to gain some insight from the data. K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent. It is also sensitive to outliers.

Hierarchical clustering

In this approach we can start with a single cluster and dividing it to smaller clusters, or assume that each point is a cluster and combining smaller clusters to reach a single cluster. These mehtods are Divisive clustering and Agglomerative clustering respectively.

Hierarchical clustering is useful and gives better results if the underlying data has some sort of hierarchy.
In agglomerative algorithm, it is not wise to combine all data points into one cluster. The termination constraint could be reducing number of clusters to a certain number or define a distance threshold and when two clusters have a bigger distance than the threshold we do not combine those clusters.

Some common applications of hierarchical clustering:

Genetic or other biological data can be used to create a dendrogram to represent mutation or evolution levels. Phylogenetic trees are used to show evolutionary relationships based on similarities and differences.
Hierarchical clustering is also used for grouping text documents.
Another common use case of hierarchical clustering is social network analysis.
Hierarchical clustering is also used for outlier detection.

Pros and Cons
You do not have to specify the number of clusters beforehand. and it always generates the same clusters. K-means clustering may result in different clusters depending on the how the centroids (center of cluster) are initiated. But It is a slower algorithm compared to k-means. Hierarchical clustering takes long time to run especially for large data sets.

Density-Based clustering

The noted methods are efficient when our clusters shape are relatively simple and without outliers. They do not have a mechanism to distinguish outliers. in some tasks e.g. anomaly detection we require to find outliers. Density-based methods will find dense regions and outliers. DBSCAN (density-based spatial clustering of applications with noise) works as follows:

We define $eps$ and $minPts$ for the algorithm. $eps$ is maximum distance between two neighbor points. $minPts$ are minimum number of points to define a cluster. There are some key concept for this algorithm.

Core Point: a point is core point if there are at least $minPts$ points (including the point itself) in its surrounding area with radius $eps$.
Border Point: a point which is reachable from a core point and have less than $minPts$ point in its surrounding area.
Outlier Point: a point which is not a core point and not reachable from any core point.

After determining the $eps$ and $minPts$ the algorithm goes as follows.

start from a random point. label it visited. check if it is a core point or not. if it is a core point we have a cluster which for example we call it A. then all the neighbors of the point goes into the queue and we porform 2 for them. if the poinst isn't a core point. mark it as a noise. a noise doesn't belong to any cluster.
pop a point from the queue. label it visited. assign it to current cluster (in this example A). add all of its unvisited neighbors to the queue. After the queue finished. select a unvisited point and perform 1.

Pros And Cons

This method does not require to specify number of clusters beforehand and also performs well with arbitrary shapes clusters. as we described DBSCAN is robust to outliers and able to detect the outliers. but determining the best $eps$ and $minPts$ isn't always a trivial task. and because these two parameters are constants the algorithm cannot perform well on the clustering tasks which the clusters with much different densities.

Expectation Maximization (EM) clustering using Gaussan Mixture Models (GMM)

This methods is a probablistic method. One of K-means problem is using mean for deciding the center of the cluster. because of this assumption we are only able to find clusters which their shape is simple circles. GMM is more flexible than K-Means. it assumes that each cluster is a gaussian model in that space. then it tries to find the best mean and standard deviation for each cluster. It does it in an iterative manner. It performs two steps iteratively. Expectation step and maximization step. in Expectation step it tries to findout how likely each point belongs to each cluster. and in maximization step it updates the mean and standard deviation according to the new data points.

The math work behind the above explanation goes as below. The E step: $$ p(x) = \sum_{k=1}^{K} \pi_k N(x| \mu_k, {\sum}_k) $$ The above formula assumes that each data point is a weighted sum of gaussian (Normal) distributions. the number of $k$ is predefined and it defines the number of clusters. $$ \gamma(z_{nk}) = \frac{\pi_k N(x_n|\mu_k, {\sum}_k)}{\sum_{j=1}^{K}\pi_j N(x_n|\mu_j, {\sum}_j)} $$ In above formula $z_{nk}$ is a latent variable which implies if the $x_n$ belongs to ${cluster}_k$. the above formula shows the responsibility of the $kth$ cluster for the $nth$ data point. The M step:

After finding the posterior of all datapoints we need to estimate the mean and standard deviation for each of out distributions. $$ N_k = \sum_{n=1}^{N} \gamma(z_{nk}) $$

$$ \pi_k^{new} = \frac{N_k}{N}$$ where N is total number of data points. $$ \mu_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk})x_n $$ $$ {\sum}_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T $$

All of thses steps are taken to maximize the marginal log likelihood defined as below. $$ln p(X|\mu,\sum ,\pi) = \sum_{n=1}^N ln \{ \sum_{k=1}^K \pi_k N(x_n|\mu_k , {\sum}_k) \} $$

References

In [ ]:

Clustering in Machine Learning