PORTAL USER GUIDE
Hierarchical Clustering (Distance Matrix)
The Hierarchical Clustering Distance Matrix is a matrix (two-dimensional array) containing the distances, taken pairwise, of a set of points. This matrix will have a size of N \times N where N is the number of points, nodes or vertices.
The output of this is a graph which shows how similar each of the different areas are when taking into account a range of variables. Closeness to each other on the distance matrix suggests greater similarity.
To illustrate the use of the Hierarchical Clustering (Distance Matrix) tool, we will use a dataset on Income, Inequality and Financial Stress across the Greater Hobart area. To do this:
- Select Greater Hobart as your area.
- Select SA2 OECD Indicators: Income, Inequality and Financial Stress 2011 as your dataset, selecting all variables.
Once you have done this, open the Hierarchical Clustering (Distance Matrix) tool (Tools → Charts→ Hierarchical Clustering (Distance Matrix)) and enter the parameters listed below.
The parameters that need to be entered are:
- Dataset Input: Select a dataset that contains the variables of interest. Select SA2 OECD Indicators: Income, Inequality and Financial Stress 2011.
- Variables: A set of independent variables. Select the following variables:
- Median Disposable Income (Synthetic Data)
- Gini Coefficient (Synthetic Data)
- Poverty Rate (Synthetic Data)
- % with no access to emergency money (Synthetic Data)
- % Can’t afford a night out (Synthetic Data)
- Distance Metric: Distance measure to be used. Select euclidean.
- euclidean: “ordinary” straight-line distance between two points in Euclidean space.
- maximum: greatest distance along any coordinate dimension, also known as chessboard distance.
- manhattan: the distance between two points measured along axes at right angles.
- canberra: a measure of similarity and dissimilarity between groups.
- binary: measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.
- minkowski: is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.
- Cluster Metric: The agglomeration method (linkage rule) to be used. It is important to note that in every method used, the analysis is processed as a complete-link case. Select complete.
- ward: calculates the increase in the error sum of squares (ESS) after fusing two clusters.
- single: the two closest points from each cluster
- complete: the two furthest points from each cluster
- average: the average of the cluster’s distances is taken whilst compensating for the number of points in that cluster
- mcquitty: the average of the cluster’s distances is taken, not considering the number of points in that cluster.
- median: the inter-cluster median point
- centroid: the inter-cluster mid-point
- Observation Labels: A variable whose values are to be used as labels for each case. Select SA2 Name.
- Chart Title: A title for your Hierarchical Clustering Dendrogram. Type Income, Inequality and Financial stress in Greater Hobart.
- Greyscale: Specify whether you would like your graph to be grey-scale (checked) or colour (unchecked). Untick this box.
Note: Please see the documentation of Cluster Analysis (Hierarchical) for further details.
Once you have selected your parameters, click Run Tool.
Once you have run the tool, click the Display Output button which appears in the pop-up dialogue box. This should open up a chart tool looking like the one shown below.
The output shows the distance between each Hobart SA2. The smaller the number, the closer they are in relation to the variables chosen. We can see that Kingston Beach is close to Claremont, as it has a value of 18.904, compared to Bellerive and Bridgewater that have a high value of 660.21 indicating that they are considered further apart.