HKLNA Project - Proximity measures

statsitical modelling (cluster analysis)

statistical modelling (diagnostics)

English or languish - Probing the ramifications
of Hong Kong's language policy

CLUSTER ANALYSIS
Proximity Measures
cluster analysis (research design issues | key features | key terms)

Euclidean Distance Measures
cluster analysis (research design issues | key terms)

The Euclidean distance between any two observations j and k described by n variables is given by the Euclidean distance

Δ_jk = [Σ(X_ij - X_ik)²]^1/2

where i = 1 to n

Incompatible units -As it is unlikely that all variables share the same unit of measurement, variables are typically standardized before the distances are measured. As such only relative distances from variable means become important, and the actual values of the means and variances are ignored. Standardization should be applied with caution as it removes important information related to skewing and kurtosis. Notwithstanding, when the units of measurement among the attribute variables are different, standardization is difficult to avoid. Considerable distortion in the final solution can arise, if all variables are not treated equally. Variable weights can be assigned to help overcome this problem, when there is strong reason to believe that a particular variable is not normally distributed.
Correlated characteristic variables - In order for Euclidean measurements to provide accurate spatial relationships one must take into account likely correlations among the variables. One solution is to eliminate the correlation altogether. There are two ways to achieve this without loss of information.
- Principal component analysis: Weighted factor scores obtained from principal component analysis are used to measure the distance between observations. This technique eliminates the correlations among variables by creating a new set of variables (factors) that are linearly related (independent) and more suitable for measuring Euclidean distance.
- Mahalanobis generalized distance measure: The technique measures the squared distances between objects as a linear combination of correlated measurements expressed in units of a composite measure related to the dispersion of the entire population. If the variables are uncorrelated and the measurements are standardized, the square root of the Mahalanobis measure is equivalent to the Euclidean distance described above.
Limitations
- Euclidean distance measures generallly require at least metric interval data. Notwithstanding, non-metric data can often be made metric through nonparametric measures such as similarity coefficients (see below).
- Euclidean distance measures are sensitive to major differences in the values of single characteristic variables. Thus, if all characteristic variables are not equally important in the determination of group membership, large differences of rather minor attributes can easily result in the misclassification of observations, or alternatively, the creation of spurious groupings. In this regard the type of clustering routine employed is crucial. Often the researcher must remove from the charactertistic variable set variables that result in the creation of spurious clusters.
- Whether one utilizes standardized or original units of measure has an important effect on the clustering pattern that results. To the extent that standardization can be avoided, only variables with similar importance should be included in the characteristic variable set. As clustering is a diagnostic technique, non-crticial variables can be removed after the fact, when it becomes obvious that they are distortive.

statsitical modelling (cluster analysis)

top

statistical modelling (diagnostics)

Correlation measures
cluster analysis (analytical routines | key features | key terms)

Inverted factor or factor Q-analysis is used to group observations according to their shared variance across the characteristic variable set. Although a popular clustering technique factor Q-analysis suffers from several important shortcomings:

There is an important loss of information. In short, it is one thing to standardize across a single variable, it is quite another to standardize across the values of all variables for a single observation. Standardizing across attributes is likely to result in major distortion of the clustering outcomes, if the variables themselves are not similarly measured before the analytical routine takes place.
Individual observations are likely to load on more than one factor thus blurring the cluster boundaries and group membership. When modelling structure this may or may not be a problem, however.
As cluster boundaries and group membership are unclear, identifying individual clusters becomes difficult and further statistical analysis is required.

statsitical modelling (cluster analysis)

top

statistical modelling (diagnostics)

Similarity measures
cluster analysis (research design issues | key terms)
Similarity measures are often used to group observations when the characteristic variables are non-metric. There are two standard approaches to overcoming the absence of metric data:

Multidimensional scaling
This technique is employed to metricize non-metric data, so that it can be used in standard clustering techniques.

Attribute matching and similarity coefficients

Fractional match coefficient - A metric solution to non-metric dummy variables techniques.

Observation Characteristic Attribute

1 2 3 4 5 6 7 8

i^th 1 0 0 1 1 0 1 0

j^th 0 1 0 1 0 1 1 1

S_ij = M / N = 3 / 8 = 0.375

where M = the number of matching hits and misses (zeros and ones)
where N = the number of characteristic attributes (non-metric variables)

Tanimoto coefficient - As the variables that one includes in a characteristic variable set are generally those which describe the presence of an attribute rather than its absence, there are far many more attributes not included in a variable set than are included. Accordingly, one should not overemphasize the absence of a particular attribute among those that are included. The Tanimoto coefficient adjusts for this natural bias in the make-up of charactertistic variable sets. Using the table above (see fractional match coefficient) the Tanimoto coefficient yields the following metric result:

T_ij = M_{i and j} / M_{i
and/or j} = 2 / 7 = 0.286

where M_{i and j} = the number of attributes present for both i and j
where M_{i and/or j} = the number of attributes present for either i or j, or both i and j

Pattern similarity coefficient - Although interval data provides an immediate metric solution for quantifying the absence or presence of a given attribute, many metric and non-metric matching techniques fail to adjust for the occurrence of random matching. Cattell, Coulter, and Tsujioka developed a similarity coefficient that helps adjust for this uncertainty:

r_jk = E_i - Σ_i d²_jk

where i = 1 to n for Σ_i
where n = the total number of dimensions
where d²_jk = the squared Euclidean distance in standard units between any two observations j and k such that j ≠ k
where E_i = twice the median chi-square value with i degrees of freedom

The above coefficient reduces to the following expression for non-metric dummy variables

r = (E_i - d) / (E_i + d)

where d = the number of non-matches on d items

Common features of attribute matching and similarity measures

Advantages

Flexibility - Similarity measures can be adpated to handle nominal, ordinal, and interval-scaled data.
Metricization - Even when similarity measures are not employed the results of attribute matching can be metricised through multidimensional scaling procedures.
Robustness - Similarity measures are less sensitive to extreme outliers commonly associated with metric data.

Disadvantages

Homo- and heterogeneic mismatches

If clusters are formed on the basis of the overall number of matches, two observations that match well on a particular subset of attributes may not appear together in the same group. Matching according to similarity measures makes no allowance for the relative importance of some shared characteristics over others.
An observation may be assigned to a group because it shares many different characteristics with many different members of the same group; rather than attributes shared by all members. This is most likely to occur when there are a large number of variable attributes.

Noise - the likelihood that an observation will be assigned to a particular group by chance increases with the number of characteristic variable attributes.
Mixed scaling - When multichotomous and dichotomous similarity measures are mixed the dichotomous measures, the solution is biased toward the dichotomous attributes.
Information loss - Interval or ratio metric data that is reconstituted as non-metric data results in an important loss of information.

statsitical modelling (cluster analysis)

top

statistical modelling (diagnostics)

Mixed scaling - The best way to avoid problems of mixed units is to design the original experiment in such a way that all variables of the characteristic variable set are measured in the same way. Short of this, non-conforming elements of the same variable set must be transformed.

statsitical modelling (cluster analysis)

top

statistical modelling (diagnostics)

Observation	Characteristic Attribute
Observation	1	2	3	4	5	6	7	8
i^th	1	0	0	1	1	0	1	0
j^th	0	1	0	1	0	1	1	1