In data mining, classification is a task where statistical models are trained to assign new observations to a “class” or “category” out of a pool of candidate classes; the models are able to differentiate new data by observing how previous example observations were classified. For example, deciding whether or not a pattern of activity on a computer network is malicious, based on past experience, is a classification task.
In contrast, clustering is a task where observations in a dataset are grouped together into clusters based on their statistical properties, where observations in the same cluster are thought to be similar or somewhat related. Recognizing groups of machines on a network that behave in similar ways is an example of clustering.
The training of a classification model is a form of supervised learning where the model is “taught” to learn from correct classifications over existing data. In technical terms, the training of a classification model often involves the minimisation of an error metric which indicates the average amount of mistakes made when producing class assignments for a training dataset.
Common classification models include: decision trees and random forests, Naive Bayes classifiers, neural networks with logistic/softmax-activated output layers, support vector machines.
The training of a clustering model, on the other hand, represents a form of unsupervised learning; clustering algorithms are typically provided with a “distance measure” which describes how the similarities between observations should be measured. The clustering algorithm will then set off to find the best way to group the observations such that the overall distances between the observations are minimised.
Some common clustering algorithms include: k-means clustering, hierarchical clustering, Gaussian mixture models.