What is a clustering algorithm?
A clustering algorithm is an algorithm that is used to identify and group similar objects in data.
Clustering
The aim of clustering and a
cluster analysis
is to find homogeneous groups of objects within the data set where each object within a group is more similar than the objects in other groups.
How do clustering algorithms work?
Clustering algorithms use a number of techniques to group similar data points based on their characteristics and properties.
Grouping in hierarchical clusters
The grouping can take the form of hierarchical clusters or flat clusters.
Collect data for analysis and group it into clusters
Clustering algorithms therefore collect facts based on values or information that can serve as a basis for analysis or decision-making.
A data set in a cluster is a specific compilation or subset of data that can be treated as a unit. Clustering algorithms can therefore be applied to any type of data.
Functionality of a clustering algorithm in pseudocode
Pseudocode is a good way of illustrating how a clustering algorithm works.
Pseudocode for the K-Means clustering algorithm
1. initialize k cluster centers randomly in the data room.
2. repeat until convergence:
3. assign each data point to the nearest cluster center.
4. calculate the new cluster centers as the centroid of all data points in each cluster.
5. check whether the cluster centers have changed. If not, cancel the loop.
6. return the clusters.
This is the general pseudocode for the K-Means clustering algorithm.
In step 1 we select k random cluster centers, the loop condition defines step 2.
Step 3 assigns the nearest cluster center to each data point by calculating the distance between the data points and the cluster centers.
In the 4th step, we calculate the new cluster centers as the center of gravity of all data points in each cluster.
In step 5, we check whether the cluster centers have changed. If not, the loop is canceled.
The process is repeated until the cluster centers no longer change, which means that the algorithm has converged. Finally, the algorithm returns the clusters.
Selection of the clustering algorithm
The choice of clustering algorithm depends on several factors, including the type of data being analyzed, the size of the data set and the user’s requirements.
Clustering algorithms
There are many different algorithms that we can use for clustering data. Here are some of the best known and most commonly used cluster algorithms:
- K-Means clustering
- Hierarchical clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify the Clustering Structure)
- Mean Shift Clustering
- Agglomerative clustering
- Fuzzy clustering
- Spectral Clustering
- Affinity Propagation
- Gaussian Mixture Models
It is important to note that each algorithm has different strengths and weaknesses and can work differently depending on the data set and use case.
It is therefore often necessary to try out and compare several algorithms in order to achieve the best result.
Targeted use of data points in cluster analysis
In cluster analysis, “data points” or “points” refer to the individual observations or elements that are contained in a data set and are to be analyzed. Each data point is represented by a series of variables or characteristics, which are referred to as values for these variables.
“Means” in cluster analyses often refers to the centroids of each cluster, which are defined as the center of the cluster.
“Distance” refers to the unit of measurement used to quantify the similarity or difference between data points. The Euclidean distance or a similar metric is used in many clustering algorithms.
“Number” refers to the number of clusters to be formed by the algorithm. This can either be determined manually by the user or automatically by the algorithm based on certain criteria.
“Value” refers to the numerical values assigned to each variable in a data set.
“Groups” and “points” are synonymous terms in cluster analysis and refer to the clusters themselves and the data points assigned to each cluster.
“Variable” refers to a characteristic or property that is measured in a data set. A variable can be discrete or continuous.
“Number” refers to the numerical values that are assigned to each variable in a data set and that represent the data points.

Pseudocode of a clustering algorithm
Pseudocode for an algorithm that implements a cluster analysis
1. read in data record
2. define the number of clusters
3. select initial centroids for each cluster
4. repeat for each point in the data set
a. Calculate the distance from the point to each centroid
b. Assign the point to the cluster with the nearest centroid
5. calculate the new centroids for each cluster by the mean value of all points in this cluster
6. repeat steps 4 and 5 until the centroids no longer change or a maximum number of iterations is reached
7. output the final clusters and their points
In this algorithm, the points are the individual data points in the data set. The distance is calculated to measure the distance between the points and the centroids. Here the number refers to the number of clusters that are to be formed in the data set.
Value refers to the numerical or categorical properties of the data points, the groups are the resulting clusters and the variables are the attributes or characteristics of the data points on which the cluster analysis is based.
Applications of cluster analysis:
- Customer segmentation: Companies can use cluster analysis to segment their customers into groups and thus better target their marketing strategies.
- Image recognition: In image recognition, we can use cluster analysis to group similar images.
- Recommendation systems: Companies can use cluster analysis to give customers recommendations on products or services that match their behavior patterns.
- Medical research: In research on classical medicine, we can use cluster analysis to segment patients into groups in order to create customized treatment plans based on proven medical procedures. For this purpose, patients are grouped together on the basis of common characteristics. We can then derive a specific treatment that is tailored to the group. This can lead to greater effectiveness and efficiency of treatment, as the therapy is tailored to the specific needs and characteristics of the patient group.

Clustering Use Case eCommerce
Practical example of a cluster analysis in eCommerce marketing
An illustrative example of customer segmentation in marketingwith clustering algorithms could look like this:
An e-commerce company collects data on the purchasing behavior of its customers, such as items purchased, amount spent and search behavior on the website. This data is then used to cluster customers into different segments.
Clustering use case in eCommerce
The following steps are required to technically map this clustering use case:
- Data collection: First, the e-commerce company collects customer data that includes purchase history, search behavior on the website, response data to marketing campaigns, demographic information and customer reviews.
- Data preparation: This data is cleaned and normalized in order to be comparable and usable for the algorithm.
- Selection of the clustering algorithm: A suitable clustering algorithm, such as K-Means, is selected. The algorithm divides the customer data into segments based on similarities in the data points.
- Determining the number of clusters: The number of clusters is determined, possibly using methods such as the elbow test to determine the optimum number of segments.
- Clustering: The algorithm assigns each customer to a cluster based on their characteristics. For example, customers who frequently buy sale items could be grouped into a “price-sensitive customer” cluster.
- Cluster analysis: Each cluster is analyzed to identify common characteristics and behavioral patterns. This helps in the development of targeted marketing strategies.
- Marketing application: The company uses this information to create personalized marketing campaigns tailored to the specific needs and preferences of each customer segment.
This segmentation enables the company to communicate more effectively and increase customer loyalty by providing relevant offers and content.
Example for clusters:
- Price-sensitive customers: Customers who mainly buy special offers and low-priced products.
- Brand loyal customers: Customers who repeatedly buy certain brands or product categories.
- Occasional shoppers: Customers who shop irregularly and spontaneously.
The data sets for each segment contain specific characteristics such as average spend, preferred product categories and purchase frequency. This segmentation helps the company to develop targeted marketing strategies that are tailored to the needs and preferences of each customer segment.
Example of hierarchical methods
A good example of hierarchical methods in cluster analysis is agglomerative clustering.
Here, all points are initially considered as separate clusters and similar clusters are merged in the course of the algorithm until finally all points are combined in a single cluster.

