HKLNA Project - Research Design Issues

statsitical modelling (cluster analysis)

statistical modelling (diagnostics)

English or languish - Probing the ramifications
of Hong Kong's language policy

CLUSTER ANALYSIS
Research Design Issues
cluster analysis (key terms)

Data Transformation
cluster analysis (key features | key terms)

Proximity measures - What meaure of similarity or dissimilarity should be utilized? more...
Standardization - Should the data be standardized? (non-equivalent units of metric data)
Correlated variables - How should the problem interdependence be addressed?
Weights - To what extent should variables of the charactertistic set be weighted?

Experimental studies that have tested the ability of clustering techniques and proximity measures to generate known clusters have shown that the choice of proximity measure is often not important. These studies have also demonstrated a similar result for standardization procedures. To the extent that standardization and certain proximity measures reduce the significance of extreme outliers the solutions of clustering techniques that are sensitive to outliers can be improved. In general one should select proximity measures that are appropriate to the input data, but not be overly concerned about which proximity measure is employed when more than one measure is appropriate for the same data.

Careful selection of variables is likely the best control for interdependency among the charactertistic variables. In general variables that are to be accorded equal importance should be independent of one another. The Mahalanobis D² proximity measure is the single best proximity measure for removing partial correlations among correlated input variables. Principal component analysis can also be applied to the characteristic variable set before the clustering routine is started. The resulting factor scores are then employed as the input data set for the computation of proximity measures other than the Mahalanobis D² statistic. As the input variables will no longer reflect in a precise manner those of the original variable set, this latter technique may create problems with the identification of individual clusters after the clustering routine has been completed.

When the correlations of variables are few in number the exaggerated importance of highly correlated variables can be removed through appropriate weighting.

Solution Retrieval
cluster analysis (key features)

Cluster number - How many clusters should be obtained?
Clustering routines - What clustering algorithm should be employed to perform the analysis? more...
Omitted observations - How should one deal with mismatched observations and important outliers? Can observations be omitted?
Cluster identification - Are the clusters meaningful? How are they to be identified?

Experimental studies that have tested the ability of different clustering techniques and proximity measures to generate known clusters have shown that the choice of clustering routine plays a crucial role in the determination of the final solution.

Although there are a larger number of hierarchical methods available, research has demonstrated that iterative partitioning (nonhierarchical) methods can produce superior solutions. In order for these superior to be achieved, however, one must already be well-informed about the true nature of the clusters. This obvious paradox is overcome through two-stage clustering. Well-suited for the first stage are average linkage and Ward's minimum variance routines. Both of these have been shown to be generally superior over other hierarchical techniques. From these techniques one can determine both a candidate number of clusters and a starting point for the interation routine of the second stage. In addition one can examine the order in which observations are clustered and the distance between individual observations. This information is useful in determining the presence of distortive outliers and their eventual elimination.

Caution: Not all manufactures provide both hierarchical and non-hierarchical, iterative clustering techniques in the same software package.

Pivot and Defining Variables
cluster analysis (key terms)

If two-stage clustering is not possible one can take advantage of a weakness common to most multivariate-analytical techniques - variable interdependence. A careful examination of the characteristic correlation matirx is likely to reveal variables that correlate high with some variables but only weakly with others. These are called pivot variables. Variables that correlate high with some variables but only weakly with other variables that in turn demonstrate high correlation with still other variables are called defining variables. This is because their shared and non-shared variance form an important basis for cluster formation. By identifying pivotal and defining variables in the correlation matrix one can get a good general idea of the number of clusters that are likely to result before an iterative routine is run. Thus, pivotal and defining variable identification is an important step in determining the initial number of clusters for clustering techniques that require a fixed number.

Variable Selection
cluster analysis (key features)

Selection - What is the best set of variables for generating a cluster-analytic solution?
Multivariate normality - Is the characteristic variable set distributed normally?

Empirical findings have demonstrated that variable selection is crucial in the application of clustering procedures, as even one or two irrelevant variables can greatly distort the final solution. Thus, before cluster analysis is performed one must carefully consider that which one hopes to find. Unlike factor analysis which is primarily a data reduction and summarization technique, classification analysis can be more appropriately described as an analytical test or summary procedure. One cannot expect one's laundry machine to sort one's laundry. Much of the sorting occurrs before the laundering takes place. In other words, one should choose his charactertistic variable set according to what one expects to find.

As multivariate normal distribution is a standard assumption for many cluster-analytical procedures, one should test the members of the attribute variable set for signs of skewness and kurtosis before including them in the characteristic variable set. Individual variables that behave non-normal ly are likely to distort the final solution.

In general one should avoid mixing variables with different metric properties. Moreover, one should be judicious about standardizing entire variable sets, as there can be a significant loss of information.

Statistical Significance and Validation
cluster analysis (key features) | statistical modelling (useful statistics)

Chance - Is the cluster solution very different from that which one could derive by chance?
Reliability - Is the cluster solution stable across different samples from the same population?
Application - Are the clusters related to variables different from those used to derive them? Are the clusters applicable?
Statistical significance - Are the clusters statistically different from one another?

Statistical Significance (C-Statistic )- Even after careful analysis of the data set and completion of the clustering routine, one cannot be sure that one has succeeded in identifying meaningful clusters. Since all clustering techniques result in a clustered solution, one must test whether one's solution is statistically meaningful. A somewhat preferred statistic to test for significance is given by the following formula

C = log(max (|T| / |W|)

where |T| = determinant of the total variance-covariance matrix
where |W| = determinant of the pooled within-group variance-covariance matrix

As a number of iterative partitioning methods seek to maximize |T| with respect to |W| this statistic is especially useful. Moreover, the statistic errs on the conservative side for those iterative procedures that do not seek to maximize this relationship. See reference to Arnold (1979) and Friedman and Rubin (1967) in Punj and Stewart (1983)

Validation
cluster analysis (key features)

Demonstrating statistical significance, though an important first step, is insufficient to insure reliability of the solution. Cross-validation and external validation are two commonly accepted methods to insure reliability.

Cross-validation
cluster analysis (key features)

Probably the most popular method of cross-validation for clustering techniques is the split-sample approach. Unlike multi-discriminant analysis for which a classification matrix can be constructed and a single hit ratio calculated, a similar approach for cluster analysis requires a hit ratio for each separate cluster or a single hit ratio for all clusters. Without important modification of the statistical tests associated with these hit ratios reliable testing cannot be achieved.

Another method of cross-validation using the split-sample approach is the multi-discrimant procedure itself. Cluster membership is used to identify each observation of the nonmetric dependent variable. After a cluster solution has been derived for one sample, discriminant functions are derived and applied to the second. From there, a standard multi-discrimant analytical procedure is conducted, and a simple t-statistic can be applied to test for significance.

Some have argued that discriminant coefficients may be poor measures of population values and must be cross-validated themselves. In the absence of a large number of observations this could be very costly.

A third cross-validation technique that also utilizes the split-sample approach utilizes the same cluster-analytical procedure twice. The step-by-step procedure is described below:

Cluster analysis is performed on one half of the data until a statistically significant solution is determined and centroids obtained.
Objects in the holdout sample are then assigned to those centroids based on their relative nearness (Euclidean distance) to each.
The same cluster analytical technique is then run on the holdout sample.
The amount of agreement between the nearest-centroid assignments (step 2) and the results of the clustering routine (step 3) are then compared.
The level of agreement is an indicator of the stability of the solution.
If an appropriate level of significance is obtained, the data is recombined, and the same statistical procedure is then repeated to obtain the final solution.

This approach is thought to provide an objective measure of reliability. For further reference see McIntyre and Blashfield (1980) in Punj and Stewart (1983)

External Validation
cluster analysis (key features)

Of course the true test of any statistical procedure is the ability of the procedure to describe accurately the population from which the original sample was drawn. Thus, the researcher should be able to show that groups derived from the procedure do not contradict other attributes of the population that could be used to describe the groups, but were not included in the original set of characteristic variables.Truly useful statistical techniques are those that use little information to describe much larger sets of phenomena.

statsitical modelling (cluster analysis)

top

statistical modelling (diagnostics)