- Clustering vs. classification
- Unsupervised learning
- Applications
- Measures of (dis-)similarity
2021-11-15
Classification can be observed as sorting of objects into a predefined classes
Main task of clustering - as opposed to classification - is to identify smaller groups (clusters) within a larger set.
Scholars usually classify (N.B. not cluster!) learning algorithms into a two categories (or classes, but not clusters) - supervised and unsupervised.
Two classes, because there is a predefined notion on their difference - whether one uses the information from a supervisor or not.
Supervisor - predefined authority.
Can we trust it?
In certain use cases one is allowed to question supervisor
Who classified the learning algorithms into a two categories in the first place? Why not three or four?
N.B. Whatever algorithm classification you use, remember that this is ALGORITHM classification, not a PROBLEM classification. Sometimes YOU define problem (e.g. a cat with moustaches).
Clustering main task is to identify smaller groups (clusters) within a larger set so that:
Let’s visualize that…
Different clusters within initial set.
There are multiple approaches to define similarity between objects:
… or measure similarity based on how two objects are related to one another, e.g.
… or some distance measure in selected feature space
Intra-cluster similarity (distance)
Iner-cluster similarity (distance)
Similarity, proximity and cluster is in the eye of the beholder.
Sometimes it is not simple to unambiguously define the number of clusters or whether an element belongs to a cluster A, or a cluster B.
Let us observe few examples and ask ourselves where should we put the border between the clusters?
…the elements in this dataset came from 8 (eight!) different distributions…
This is not just a machine problem.
And things get even more complicate when we try to transfer the human knowledge (or perception) to machines.
Clustering is hard and may provide multiple solutions. From these solution we cannot a priori select the good ones.
Elements may belong to class A or B, and sometimes to both of them. Do the clusters overlap is not always easy to conclude from the available data.
We can come to different solutions, based on the certain assumptions (sometimes made implicitly) not related to data.
The correct approach for clustering algorithm selection would be to first contemplate on the properties of a cluster that we may assume.
The accuracy of the clustering algorithm depends on how accurate these assumptions are.
Some of the well known and often used algorithms for clustering fall into one of the following category:
If as a “center” of the algorithm we use the mean value (centroid), the algorithm is know as a k-centers or k-centroids.
If as a “center” we use the median value (medoid) the algorithm is known as k-medoids.
Iris dataset - 150 items, 3 species, 50 instance of each - 4D space - petalum and sepalum width and height - can we identify the number of species and differentiate between them
Centers:
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 6.850000 3.073684 5.742105 2.071053 ## 2 5.006000 3.428000 1.462000 0.246000 ## 3 5.901613 2.748387 4.393548 1.433871
To which cluster is each element assigned:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 ## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 ## [149] 1 3
Number of elements in each cluster:
## [1] 38 50 62
Confusion matrix is usually utilized to measure classification accuracy.
Confusion matrix is an intuitve way to show how correctly do we assign each element to a certain cluster (or class):
table(iris$Species, kc$cluster)
## ## 1 2 3 ## setosa 0 50 0 ## versicolor 2 0 48 ## virginica 36 0 14
Let’s visualize the results.
Note that he problem is 4D in nature since there are 4 features.
Therefore, we will visualize this in two 2D panels.
From the figures, one might conclude that the results are inaccurate, as some points are obviously closer to the center of some other cluster. Actually, we simply get that impression since we observe the projection of 4D problem to 2D, so the distance between data points and a cluster center is distorted.
Perhaps the following animation would convince you that the coloring is right.
The most notable difference between various algorithms is a similarity/proximity measure
Let’s demonstrate it on our Iris dataset once again.
For illustration, only few dozens of elements will be observed.
We do not know the true labels of the elements.
By performing clustering we again get the only the abstract labels, same as earlier:
## 1 2 3 4 5 6 7 8 9 10 51 52 53 54 55 56 57 58 59 60 ## 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 ## 101 102 103 104 105 106 107 108 109 110 ## 4 4 4 4 4 4 3 4 4 4
On the same dataset, the top-down approach provided better slightly different results.
This is just an example.
Sometimes we cannot assume the shape of the cluster or even know that any assumption on the cluster shape is void. Under certain circumstances, the cluster center (if calculated as a center of gravity) might event not be the element of a cluster.
In order to cope with this type of a problem we have to utilize other cluster properties.
Human perception clusters elements not only according to proximity, but according to density as well. This is what motivates the density based clustering.
Imagine you have object like the one shown in the bottom figure…
… you will simply identify them as letters or numbers, right?
Even if we add noise to our problem - we can still identify the elements.
However, none of the previous algorithms would perform well in this case…
Density-based spatial clustering of applications with noise (DBSCAN) algorithm is an algorithm that provides decent results for these type of problems.
DBSCAN algorithm is more than 20 years old and since 2014 it belongs to algorithms with “test of time award”.
eps
)eps
) of that cluster
DBSCAN algorithm:
Works even when the excessive noise is present and is resilient to outliers
Requires only two parameters to tune
Clustering is explanatory technique and can be observed as a dimensionality reduction technique.
We are using just few data to describe the whole dataset - we reduced the amount of information necessary to describe the dataset.
Now, if we have more information on the relationship between data, we could describe data not just as “blobs” (clusters).
Self-organising maps (SOM) and (Growing) Neural gas (GNG) provide such an opportunity. If they are used as a clustering algorithms they will use the the distance information (neighborhood) to describe the data.
Due to this additional information the cluster may not have a predefined shape, but their (predefined) relationship will be contained in the neighborhood structure.
This structure is referred to as lattice. SOM has a rigid lattice structure, while (G)NG has more loose lattice structure.
Kalinić et al. “Sensitivity of Self-Organizing Map surface current patterns to the use of radial vs. Cartesian input vectors measured by high-frequency radars”, Computers & Geosciences, 2015
Vilibić et al. “Self-Organizing Maps-based ocean currents forecasting system”, Scientific Reports, 2016
Matić et al. “Adriatic-Ionian air temperature and precipitation patterns derived from self-organizing maps: relation to hemispheric indices”, Climate Research, 2019
Roweis and Saul “Nonlinear Dimensionality Reduction by Locally Linear Embedding”, Science, 2000
Biological and artificial
ANNs:
Aims at:
There are three main part of a neuron model:
Each synapse \(j\) is represented as a wight \(w_j\) that multiplies input signal \(x_j\) before summation.
After summation the activation function \(\psi\) transforms the signal to output (usually denoted as \(y\)).
In case of classification the activation function typically limits the output to range \([0,1]\).
Typical activation (transfer) function used in neural networks:
Sigmoid activation function is a s-shaped function of a form:
\[ S(v)=\frac {1}{1+e^{-v}}, \]
which is a special case of a logistic function:
\[ f(v)=\frac {L}{1+{\mathrm e}^{{-k(v-\Theta)}}} \]
where \(k\) defines slop, and \(L\) the maximal value a function can achieve.
Logistic function relates to hyperbolic tangens \(tanh()\):
\[ 2\,f(v)=1+\tanh ({\frac {v}{2}} ) \]
Therefore, the same activation function can be found under different names, but beware these names do tno necessary reflect how the activation function is implemented.
\[ \psi(v) = \frac {1}{1+e^{-av}}\]
Rectifier linear unit (ReLu), earlier know simply as “ramp” function. Is usually used in DNNs.
One smooth approximation to the rectifier is the “softplus” analytic function:
\[ f(v)=\ln(1+e^{v}) \]
Please notice the similarities to sigmoid function that can be observed as a smooth approximation of Heaviside step function.
A (model of a) neural network is built from multiple neurons and their connections. This can be presented in a form of a directed graph where direction denotes the signal flow.
Two basic elements of such a graph are:
Architecture or topology denotes how neurons (nodes) are mutually connected.
Basic topologies are:
Naturally, recurrent network can also contain one or more layers
Recently, we distinguish between a shallow (classical) neural network and a deep neural network (DNN).
As a guideline we can consider a neural networks that require less feature engineering as a deep one.
Perhaps the most distinguishable DNN architectures that do not fall into a category of recurrent networks are information filtering networks and generative networks.
ADALINE - ADAptive LINear Element, is a model of the form:
\[ y_{i}=wx_{i}+\Theta \]
Notice that activation function is linear:
\[ \psi(x) = x \]
Now, assume we have \(N\) ordered pairs \((x_i, d_i)\) where \(d_i\) denotes the desired output for a given input \(x_i\).
Let us also assume that the relationship between \(x_i\) and \(d_i\) is approximately linear, i.e. the dots are (approximately) lie on a line.
If the line intersects with the origin (for simplicity) we can write:
\[d_i \approx w \cdot x_i \]
The error made by this approximation can be expressed as:
\[ \varepsilon = d_i - w \cdot x_i \]
Now, ADALINE learning can be formulate as a search for a weight \(w\) for which the (absolute) error is minimal.
Absolute error is not differentiable therefore one usually use quadratic error \(\varepsilon^2\), or sum of squares (across all \(N\) pairs):
\[ J = \sum_{i=1}^N \varepsilon_i^2 = \sum_{i=1}^N (d_i - w \cdot x_i)^2\]
This optimization problem has an analytical solution. The analytical solution can be found by looking where a partial derivative of \(J\) equals \(0\):
\[ \frac {\partial J} {\partial w} = 0\]
\[ \frac {\partial \sum_{i=1}^N (d_i - w \cdot x_i)^2} {\partial w} = 0\]
\[ -2 \, \frac {1} {N} \sum_{i=1}^N (d_i - w \cdot x_i) \cdot x_i = 0\]
The constant \(-2/N\) plays no significant role in our case, and we can make the following simplification:
\[ \sum_{i=1}^N d_i \cdot x_i - w \sum_{i=1}^N \cdot x_i^2 = 0 \]
\[ w = \frac {\sum_{i=1}^N d_i \cdot x_i} {\sum_{i=1}^N \cdot x_i^2} \]
Sometimes, the analytical solution is not the best option:
Whatever the case might me, we can utilize an iterative procedure to find the minimum error - the gradient descent.
Gradient descent uses the gradient information to find the local minumum of a given function. In order to do that it makes incremental steps in the direction opposite to (estimated) gradient at that location.
In our case, we can calculate the gradient as follows::
\[ \nabla J = \frac {\partial J} {\partial w} = \frac {\partial} {\partial w} \left( \frac{1}{N} \sum_{i=1}^N \varepsilon_i^2 \right)= \frac{2}{N} \sum_{i=1}^N \varepsilon_i \frac {\partial } {\partial w} \varepsilon_i\]
Since the error is given by \(\varepsilon = d_i - w \cdot x_i\), after derivating by \(w\) we may express gradient as:
\[ \nabla J = -\frac{2}{N} \sum_{i=1}^N \varepsilon_i x_i\]
Was this useful?
However, sometimes we wish learing (weight adaptation) to be in real time, i.e. that neurons adapt as soon as ti gets a single datum, not the whole data set.
In order to achieve this, we use a gradient estimate. This estimate is no longer average value of the product \(\varepsilon_i \cdot x_i\), but the estimate is done with each new data sample, where we pick “a bit” of \(\varepsilon_i \cdot x_i\). We refer to this “bit” as a learning step, denote it by \(\eta\), and write:
\[ \nabla J \approx -\eta \cdot \varepsilon_i \cdot x_i\]
Thanks to this idea the research in the field of adaptation and learning machines started, where we distinguish on-line learning (as in previous example) where we learn on each sample, and batch learning were we adapt based on several samples at once.
Earlier this problem was always reduced to question why learn (sample-by-sample) if I can conclude (analytically) once I have all the data.
The problem of not having “all the data” was purely technical one.
With each new sample, the analytical problem has to be solved from the start, and this becomes more and more demanding (there is more and more data).
Therefore, we formulate the learning problem as a weight adaptation problem, and we know that this adaptation should go in the direction opposite to a gradient:
\[ w(k+1) = w(k) - \nabla J (k),\]
where \(k\) denotes a (discrete) time step.
\[ w(k+1) = w(k) - \eta \cdot \varepsilon(k) \cdot x(k),\]
where \(k=i\) if time step corresponds to index of a new data sample.
LMS algorithm can be easily implemented, either in the form of batch learning:
\[ w(k+1) = w(k) - \eta \sum_{i=1}^N \varepsilon_i x_i, \]
or on-line learning:
\[ w(k+1) = w(k) - \eta \cdot \varepsilon(k) \cdot x(k).\]
Let’s assume that we have a function given in the form of a look-up table. We know the desired values \(d_i\), while the parameters (\(x_i\)) are measured (terefore may contain noise). If the data are pairs (0.1,0),(0.9,1),(2.1,2),(2.9,3) i (4.1,4), one can visualize them on 2-D plane, alongside the line that represents the analytical solution.
… the wights might not converge to true value
…we might get stable oscillations around optimal value.
Learning step (\(\eta\)) is an important parameter of LMS algorithm:
Notice, the stability of the algorithm is not related to gradient approximation, but to the adaptation method (gradient descent)
Various gradient descent methods exist each with its own strength and weaknesses.
To reduce the probability of finding a local minimum (instead of global one), one might modify the gradient descent by introducing the inertia term (physical analogy):
\[ w(k+1) = w(k) - \eta \cdot \varepsilon(k) \cdot x(k) + \gamma \cdot \Delta w(k),\]
where \(\gamma\) is an inertion term, a \(\Delta w\) is passed from the previous step:
\[ \Delta w(k) = w(k) - w(k-1). \]
ADALINE example showed that:
Actually, all one perceptron does is that it fits a curve to a given set of points. A regression analysis (i.e. a matrix calculation) could do that. And better (no optimization/convergence problem).
We use it to optimize wights as if we are solving a regression problem. Other typical case is a classification problem.
In this problem one tries to find a straight line (or more generally a curve) that divides a plain in two parts. Initial criticism of NN aimed in this direction as it was trivial to show that a single neuron cannot implement a function simple as XOR.
True, one neuron cannot solve this apparently simple problem, but two of them already can.
As we will see, true strength is not in their complexity, but in the number of neurons - although tone may increase the complexity by tampering with the network layers and neuron connectivity or with their activation function.
To understand why a single neuron cannot solve XOR problem we will introduce the notion of linear separability.
In the classification problem we distinguish between linearly separable and linearly unseparable classes depending on weather we can separate the classes with one straight line.
The complexity of a problem which a neural network can solve increases with the complexity of its topology.
As not many problems are linearly separable, we usually use more neurons to solve certain problems.
With the increase of the number of neurons we reach to the point where it is not trivial to decide the responsibility of each neuron in the final solution.
Notice that \(\Delta w\) of the solution the final layer should split to \(\Delta w\)’s of previous layer etc.
The algorithm that solves this problem is referred to as the backpropagation algorithm.
## ## 1 2 3 ## setosa 25 0 0 ## versicolor 0 24 1 ## virginica 0 3 22