statsitical modelling (cluster analysis) | project index | statistical modelling (diagnostics) |
Euclidean Distance Measures
cluster analysis (research design issues | key terms)The Euclidean distance between any two observations j and k described by n variables is given by the Euclidean distance
Δ = [Σ(X - X)] where i = 1 to n
statsitical modelling (cluster analysis) | top | statistical modelling (diagnostics) |
Correlation measures
cluster analysis (analytical routines | key features | key terms)Inverted factor or factor Q-analysis is used to group observations according to their shared variance across the characteristic variable set. Although a popular clustering technique factor Q-analysis suffers from several important shortcomings:
- There is an important loss of information. In short, it is one thing to standardize across a single variable, it is quite another to standardize across the values of all variables for a single observation. Standardizing across attributes is likely to result in major distortion of the clustering outcomes, if the variables themselves are not similarly measured before the analytical routine takes place.
- Individual observations are likely to load on more than one factor thus blurring the cluster boundaries and group membership. When modelling structure this may or may not be a problem, however.
- As cluster boundaries and group membership are unclear, identifying individual clusters becomes difficult and further statistical analysis is required.
statsitical modelling (cluster analysis) | top | statistical modelling (diagnostics) |
Similarity measures
cluster analysis (research design issues | key terms)
Similarity measures are often used to group observations when the characteristic variables are non-metric. There are two standard approaches to overcoming the absence of metric data:
- Multidimensional scaling
This technique is employed to metricize non-metric data, so that it can be used in standard clustering techniques.
- Attribute matching and similarity coefficients
- Fractional match coefficient - A metric solution to non-metric dummy variables techniques.
Observation Characteristic Attribute 1 2 3 4 5 6 7 8 i 1 0 0 1 1 0 1 0 j 0 1 0 1 0 1 1 1
S = M / N = 3 / 8 = 0.375
where M = the number of matching hits and misses (zeros and ones)
where N = the number of characteristic attributes (non-metric variables)
- Tanimoto coefficient - As the variables that one includes in a characteristic variable set are generally those which describe the presence of an attribute rather than its absence, there are far many more attributes not included in a variable set than are included. Accordingly, one should not overemphasize the absence of a particular attribute among those that are included. The Tanimoto coefficient adjusts for this natural bias in the make-up of charactertistic variable sets. Using the table above (see fractional match coefficient) the Tanimoto coefficient yields the following metric result:
T = M / M = 2 / 7 = 0.286
where M = the number of attributes present for both i and j
where M = the number of attributes present for either i or j, or both i and j
- Pattern similarity coefficient - Although interval data provides an immediate metric solution for quantifying the absence or presence of a given attribute, many metric and non-metric matching techniques fail to adjust for the occurrence of random matching. Cattell, Coulter, and Tsujioka developed a similarity coefficient that helps adjust for this uncertainty:
r = E - Σ d
where i = 1 to n for Σ
where n = the total number of dimensions
where d = the squared Euclidean distance in standard units between any two observations j and k such that j ≠ k
where E = twice the median chi-square value with i degrees of freedom
The above coefficient reduces to the following expression for non-metric dummy variables
r = (E - d) / (E + d)
where d = the number of non-matches on d items
- Common features of attribute matching and similarity measures
- Advantages
- Flexibility - Similarity measures can be adpated to handle nominal, ordinal, and interval-scaled data.
- Metricization - Even when similarity measures are not employed the results of attribute matching can be metricised through multidimensional scaling procedures.
- Robustness - Similarity measures are less sensitive to extreme outliers commonly associated with metric data.
- Disadvantages
- Homo- and heterogeneic mismatches
- If clusters are formed on the basis of the overall number of matches, two observations that match well on a particular subset of attributes may not appear together in the same group. Matching according to similarity measures makes no allowance for the relative importance of some shared characteristics over others.
- An observation may be assigned to a group because it shares many different characteristics with many different members of the same group; rather than attributes shared by all members. This is most likely to occur when there are a large number of variable attributes.
- Noise - the likelihood that an observation will be assigned to a particular group by chance increases with the number of characteristic variable attributes.
- Mixed scaling - When multichotomous and dichotomous similarity measures are mixed the dichotomous measures, the solution is biased toward the dichotomous attributes.
- Information loss - Interval or ratio metric data that is reconstituted as non-metric data results in an important loss of information.
statsitical modelling (cluster analysis) | top | statistical modelling (diagnostics) |
Mixed scaling - The best way to avoid problems of mixed units is to design the original experiment in such a way that all variables of the characteristic variable set are measured in the same way. Short of this, non-conforming elements of the same variable set must be transformed.
statsitical modelling (cluster analysis) | top | statistical modelling (diagnostics) |