HKLNA Project - Analytical Routines

statsitical modelling (cluster analysis)

statistical modelling (diagnostics)

English or languish - Probing the ramifications
of Hong Kong's language policy

CLUSTER ANALYSIS
Analytical Routines
cluster analysis (research design issues)

There are many clustering techniques available, and the researcher must decide among them.

Hierarchical Routines cluster analysis (identification | two-stage flow diagram)

Minimum Variance Routine or Ward's Method
cluster analysis (key features | research design issues)

Listed below are two opposite analytical approaches using the same statistical technique.

Descending Iteration - Using a standard least-squares technique the entire set of observations is divided into two primary clusters. East cluster is then broken down into two more additional clusters. this procedure is followed until a hierarchical array of clusters is obtained. (Edwards and Cavalli-Sforza)

This procedure reveals pre-existing sectoral divisions that could be deep cutting, such as political, religious, ethnic, ideological, or racial differences, or utterly suferficial, such as passing fads, popular misconceptions, or national stereotypes. These are widespread differences that are likely to influence individual behavior, only when one is compelled to act outside of one's own group.

Ascending Iteration - This procedure works in the exact opposite direction. Intiallly there are as many clusters as there are observations. Through sequential iterations a hierarchy of ever larger clusters is obtained until the entire sample population forms one large cluster. (Ward)

This procedure places the individual at the center of attention in so far as clusters are built around individuals who most typically resemble one another. It is likely to result in a more political description of the population. It could be used to identify diverse popular sentiment, or the behavior of individuals within their own group.

Both of these procedures are highly quantitative in so far as they minimize within group variance and require metric data that facilitates the calculation of means and variances. Both methods can be employed in either classification or structural modelling

Unlike other hierarchical clustering techniques this method uses an objective statistic trW where W is the pooled within-cluster sum of squares and cross-products matrix.The Ward's method is similar to the average linkage routines described below in so far as the minimized variance is a function of deviations from the sample mean. Ward's method assumes that the characteristic variable set is multivariate normally distributed.

Ward's Method is severely limited by its strong natural bias toward assigning similar numbers of observations to all clusters. Notwithstanding, because it is statistically rigorours, it makes an excellent starting point for the first stage of a two-stage clustering procedure.

Other Hierarchical Routines
cluster analysis (key features)

Single linkage - An observation is joined to a cluster, if it is has a certain level of similarity with at least one of the members of that cluster. Connections between clusters are based on links between single entities.

Complete linkage - An observation is joined to a cluster, if it has a certain level of similarity with all current members of the cluster.

Average linkage
cluster analysis (research design issues)
An observation is joined to a cluster, if it has a certain average level of similarity with all members of that cluster. These routines differ in the manner in which the average level of similarity is defined.

Simple average method - Membership is based on the average for the entire group. This method appears to be the most widely used among the average linkage routines.
Weighted average method - This method provides for an a priori weighting of the averages based on a desired number of observations in each cluster.
Centroid method - This method provides for an intitial computation of the centroid of each cluster.
Median method - Like the weighted average method weights are assigned based on the number of observations desired in each cluster.

Threshold Routine cluster analysis (identification | key features)

A primary node (observation) is selected based on its proximity to the centroid of the entire sample population. Based upon a pre-determined threshold (cut-off) distance additional observations are added to the cluster until the distance of the next furthest observation from the primary node exceeds the threshold distance, whereupon it is selected as the primary node for the formation of the next cluster. The primary node for the third cluster is obtained when the next furthest observation from the average of the centroids (primary nodes) of the first two clusters exceeds the threshold distance.

This procedure is likely to reflect better the underlying universal structural relationships of a society and would be most useful in analyzing a society that is culturally speaking relatively homogeneous.

Mathematically speaking this method of clustering does not require high level metric data.

Iterative Partitioning (Nonhierarchical) Methods
cluster analysis (two-stage flow diagram | key features)
These methods begin with the partioning of the observations into a specified number of clusters. The number of clusters can be determined on a random or nonrandom basis. Observations are then reassigned to clusters according to some stopping criterion. In addition to the stopping criterion non-hierarchical methods differ according to a number of different features including the

original number of partitions - The number of original partitions can be selected either randomly or nonrandomly.

reassignment procedure - The two types of reassignment procedures most commonly used are the K-means and hill-climbing procedures.

K-means Procedure - Observations are assigned to the cluster whose centroid is nearest to the observation. Reassignment continues until all observations belong to a particular cluster.

Hill-Climbing Procedure - Observations are moved from cluster to cluster until a selected statistical criteron is met. Reassignment continues until optimization occurs. The objective function is selected by the researcher.

decision rule for terminating a cluster

frequency with which centroids are updated during the assignment process - Centroids may be updated after each new reassignment or only after all observations have been reassigned.

Other clustering techniques
cluster analysis (key features)

Factor Q-Analysis - see proximity measures
Space density search routines (A kind of clustering technique) - This method of clustering divides the n - dimensional variable space into hypercubes and uses observation density as the standard for determining the presence of clusters.

statsitical modelling (cluster analysis)

top

statistical modelling (diagnostics)