Sepehr Amini Afshar
Author
In this notebook, we talk about clustering, its variants, and how those variants behave in machine learning workflows.
Pros And cons
K-Means has the advantage that it’s pretty fast, as all we’re really doing is computing the distances between points and group centers; very few computations! It thus has a linear complexity O(n). Although it may sound fast, but as the number of data points grow larger this time becomes very noticable. because ve run O(n) computation at each iteration. but the number of iterations may vary.
On the other hand, K-Means has a couple of disadvantages. Firstly, you have to select how many groups/classes there are. This isn’t always trivial and ideally with a clustering algorithm we’d want it to figure those out for us because the point of it is to gain some insight from the data. K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent. It is also sensitive to outliers.
Hierarchical clustering is useful and gives better results if the underlying data has some sort of hierarchy.
In agglomerative algorithm, it is not wise to combine all data points into one cluster. The termination constraint could be reducing number of clusters to a certain number or define a distance threshold and when two clusters have a bigger distance than the threshold we do not combine those clusters.
Some common applications of hierarchical clustering:
Pros and Cons
You do not have to specify the number of clusters beforehand.
and it always generates the same clusters. K-means clustering may result in different clusters depending on the how the centroids (center of cluster) are initiated.
But It is a slower algorithm compared to k-means. Hierarchical clustering takes long time to run especially for large data sets.
We define $eps$ and $minPts$ for the algorithm. $eps$ is maximum distance between two neighbor points. $minPts$ are minimum number of points to define a cluster. There are some key concept for this algorithm.
Core Point: a point is core point if there are at least $minPts$ points (including the point itself) in its surrounding area with radius $eps$.
Border Point: a point which is reachable from a core point and have less than $minPts$ point in its surrounding area.
Outlier Point: a point which is not a core point and not reachable from any core point.
After determining the $eps$ and $minPts$ the algorithm goes as follows.
start from a random point. label it visited. check if it is a core point or not. if it is a core point we have a cluster which for example we call it A. then all the neighbors of the point goes into the queue and we porform 2 for them. if the poinst isn't a core point. mark it as a noise. a noise doesn't belong to any cluster.
pop a point from the queue. label it visited. assign it to current cluster (in this example A). add all of its unvisited neighbors to the queue. After the queue finished. select a unvisited point and perform 1.
Pros And Cons
This method does not require to specify number of clusters beforehand and also performs well with arbitrary shapes clusters. as we described DBSCAN is robust to outliers and able to detect the outliers. but determining the best $eps$ and $minPts$ isn't always a trivial task. and because these two parameters are constants the algorithm cannot perform well on the clustering tasks which the clusters with much different densities.
The math work behind the above explanation goes as below. The E step: $$ p(x) = \sum_{k=1}^{K} \pi_k N(x| \mu_k, {\sum}_k) $$ The above formula assumes that each data point is a weighted sum of gaussian (Normal) distributions. the number of $k$ is predefined and it defines the number of clusters. $$ \gamma(z_{nk}) = \frac{\pi_k N(x_n|\mu_k, {\sum}_k)}{\sum_{j=1}^{K}\pi_j N(x_n|\mu_j, {\sum}_j)} $$ In above formula $z_{nk}$ is a latent variable which implies if the $x_n$ belongs to ${cluster}_k$. the above formula shows the responsibility of the $kth$ cluster for the $nth$ data point. The M step:
After finding the posterior of all datapoints we need to estimate the mean and standard deviation for each of out distributions. $$ N_k = \sum_{n=1}^{N} \gamma(z_{nk}) $$
$$ \pi_k^{new} = \frac{N_k}{N}$$ where N is total number of data points. $$ \mu_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk})x_n $$ $$ {\sum}_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T $$
All of thses steps are taken to maximize the marginal log likelihood defined as below. $$ln p(X|\mu,\sum ,\pi) = \sum_{n=1}^N ln \{ \sum_{k=1}^K \pi_k N(x_n|\mu_k , {\sum}_k) \} $$