HKLNA Project - Explanation of Decision Tree

factor analysis (primary characteristics)

factor analysis (decision tree)

English or languish - Probing the ramifications
of Hong Kong's language policy

FACTOR ANALYSIS
Explanation of Decision Tree

Variable Selection and Sample Size - Factor analysis is a diagnostic statistical technique used for the purpose of data reduction and summarization. As a statistical technique it looks for the common variation of a large number of variables, and from this variation creates a smaller set of factors (new variables). Shared variation among variables can arise when different variables respond similarly to the same phenomena, or when different variables constitute different aspects of the same phenomenon. In the first instance the common variation is brought about by an external force to which all variables respond in a similar manner. In the second instance, an external source is unnecessary to bring about a common pattern of variance, because the variables in question always behave similarly no matter what force might be present. Flotsam moves together atop undulating ocean waves not because the individual pieces of flotsam are related, rather because the waves that cause the individual pieces of flotsam to rise and fall are indifferent to their presence. In contrast the limbs of one's own body always demonstrate similar movement when the entire body is moved -- no matter the origin of the force that causes the body to move. Thus, it is important to know the approximate relationship or lack of relationship among the variables that one enters into the analysis before one can effectively interpret the reduced number of factors that results.

The number of variables and number of observations in any statistical analysis is crucial. As the number of variables and observations included in the HKLNA-Project are both expected to be large, sample size will only become an issue with regard to the size of the population that is being tested, not the statistical technique itself. (statistical index (factor analysis) | decision tree)

Correlation Matrix - As factor analysis is a statistical procedure that ignores cause and effect relationships, it treats the variance of all variables (R-analysis) or all observations (Q-analysis) similarly. In other words, one can calculate the correlation matrix with respect to either variables or observations simply by turning the input data matrix on its side.

R Analysis - Of the two procedures R and Q, R is the more common.
Q Analysis - This procedure groups the observations according to shared variance around sample means and is indifferent to the direction of variation. Thus, it is not a matter of differentiating among individuals that are consistently high with regard to the sample means of some variables and consistently low with regard to others; rather, it is a matter of separating those that show large deviation in either direction from those that show little deviation. As it is often in the researcher's interest to understand the direction of deviation as well as the magnitude when comparing observations, other statistical procedures, such as cluster analysis are more commonly used. (statistical index (factor analysis) | decision tree)

Factor Model - Common factor analysis and principal component analysis are the two principal techniques employed by factor analysis to obtain factor solutions. These two statistical methods differ technically in the amount of information they employ in the selection of the factor solutions. Whereas principal component analysis uses all available variance to calculate a factor solution, common factor analysis uses only that variance which is shared among variables to determine a solution. This difference is highlighted by the structure of the correlation matrix -- namely, the elements of the principal diagonal of the correlation matrix. Whereas the diagonal elements of the principal component matrix consist only of ones and thus reflect the full variance of each variable, the elements of the common factor model are the communalities associated with each of the input variables. In short, the common factor model ignores the unique variance of each variable in calculating the final factor solution.

Which of the two models to use depends on two considerations: the research objective and prior knowledge about the variance structure.

Common factor model - This model is employed when the primary objective of the analysis is to identify latent dimensions or constructs that provide new information about how the input variables are related among themselves. In general the researcher has little knowledge about the unique variance associated with each of the input variables.

Principal component analysis
statistical index (factor analysis) | decision tree | cluster analysis (proximity measures)

This model is employed when the objective of the analysis is to determine the minimum number of factors needed to account for the maximum amount of information and the researcher knows that unique (specific and error) variance is relatively little. This technique is most useful in eliminating autocorrelation among an otherwise correlated set of independent variables. As variable independence is a requisite assumption for many predictive statistical techniques principal component analysis can be particularly valuable as a preliminary step to further analysis. In effect, principal component analysis reconstitutes the autocorrelated "independent" variable set into a set of truly independent new variables (factors).

Another important use of principal component analysis is the identification of surrogage variables -- variables that load heavily on independent factors. When the researcher has several variables from which he can choose to measure the same phenomena, principal component analysis can help him to identify good (uncorrelated) proxy variables for further analysis.

See under general uses for further clarification.

Method of Extraction - Once the appropriate factor model has been determined one must choose between an orthogonal or oblique extraction method (factor solution). When the goal of extraction is to obtain independent factors for use in other statistical techniques that require a high degree of independence among the explanatory variables, orthogonal extraction is the appropriate choice. Thus, principal component analysis and orthogonal extraction often go hand-in-hand.

As may be deduced from the names of these two extraction processes, orthogonal extraction assumes indepedence among the extracted factors. Oblique extraction assumes that the factors are correlated. When the objective of the analysis is to identify underlying factors or latent constructs (common factor analysis) both orthogonal and oblique factor extraction methods can be employed. (statistical index (factor analysis) | decision tree)

Closely associated with the method of extraction is factor rotation.

Factor Rotation (factor extraction criteria)- In order to understand the importance of factor rotation it is useful to examine how the factors of an unrotated orthogonal extraction are obtained.

Unrotated, orthogonal, factor extraction - In computing the unrotated factor matrix, whether one employs principal component or common factor analysis, the analyst seeks to obtain the best linear combination of the variables -- in other words that combination for which no other combination can account for more of the total variance employed by the model. As such, the first factor represents the best linear combination of all variables; the second factor represents the best linear combination of all the variables based upon the total variance that remains after the first factor has been extracted; the third factor represents the best linear combination of all the variables based upon the variance that remains after the first and second factors have been extracted. Subsequent factors are similarly derived until all of the variance has been taken into account. Since the determination of each subsequent factor is obtained from the residual variance of previous factor extractions, the independence of all factors is insured. Consequently, the order in which the factors are extracted is crucial; each subsequent factor always accounts for less of the model's total variance than that of all preceding ones.

Unrotated factor solutions achieve the task of data reduction in so far as they provide the analyst with a series of factors that account for an ever diminishing amount of the total variance. By selecting only those factors that account for the largest amount of information the analyst reduces the number of variables with little loss of information. As the unrotated factor solution may or may not provide a meaningful pattern of the model's total variance, factor rotation (see next section) may or may not become necessary. (statistical index (factor analysis) | decision tree)

Rotated, orthogonal, factor extraction - Factor analysis can involve much more that the simple reduction in the number of variables and their orthogonalization.

The purpose of rotation is to simplify the factor solution by increasing the loading of individual variables on particular factors and reducing the number of factors on which each variable loads. Variables which load on all factors similarly or factors on which all variables load poorly provide little new information to the researcher about hidden factors that determine the behavior of some or all of the variables on the one hand, or are themselves hidden attributes which certain or all of the variables share in common on the other.

Unrotated factor solutions almost always appear the same, in so far as most of the variables load heavily on the first factor and less heavily or not at all on each subsequent factor. Rotating the factor solution redistributes the variance from the first factor to subsequent factors in such a way that the researcher is able to identify each of the factors more easily. A sample comparison of orthogonally extracted, unrotated and rotated solutions demonstrates how the model's variance is redistributed among the factors when they are rotated. (statistical index (factor analysis) | decision tree)

Oblique factor extraction with rotation - In so far as factor independence is not of primary concern fo the researcher, oblique rotation can provide a more accurate picture of the relationship between the factor solution and the original variable set. Moreover, it can provide the researcher with valuable information about the degree of correlation that exists among the factors. In effect oblique rotation is both theoretically and empirically more realistic, but impractical for subsequent analysis using statistical techniques that require variable independence. (statistical index (factor analysis) | decision tree)

Criteria for selecting the number of factors to be rotated - As one of the goals of factor analysis is to reduce a large number of variables into a smaller number of more easily manipulated factors, researchers have developed several guidelines for deciding the number of factors to interpret after an initial factor run.

Factor Interpretation - (statistical index (factor analysis) | decision tree) Once the number of factors have been determined, they must be interpreted. Interpreting factors is largely a question of which variables load on which factors and in what amount. Several guidelines have been suggested for determining when a factor loading is significant. Variables that do not load heavily on a particular factor should not be used to interpret that factor

For a sample size of 50 or more a factor must account for at least 10 percent of a variable's variation before that variable can be used to interpret the factor. Thus, factor loadings of plus or minus 0.30 are considered weak and any loading equal to or above the absolute value of plus or minus 0.50 is considered strong.

As factor loadings are correlations of variables with the factors on which they load, load rules similar to those applied in judging correlation coefficients can be applied. Thus, lower factor loadings are permissible for larger sample sizes.

Significance Level

1 Percent
5 Percent

Sample size Minimum Factor Loadings Percent of variance captured Minimum Factor Loadings Percent of variance captured

100 | ±0.19 | < 3.6 | ±0.26 | < 6.8

200 | ±0.14 | < 2.0 | ±0.18 | < 3.2

300 | ±0.11 | < 1.2 | ±0.15 | < 2.3

| | = absolute value

Obviously very high restrictions on the level of significance demand that a far larger number of factors be included in the final solution.

Although the above rules appear sufficiently rigorous they do take into account the number of factors in the factor solution. As the number of factors increase so too does the amount of unique variance employed in the determination of higher numbered factors. Thus, with principal component analysis the above criteria should be raised for higher numbered factors.

In addition to the number of factors one should also take into consideration the number of variables. As the number of variables increases the significance criterion can be lowered. This is especially true for factors that are increasingly determined by the unique variance of the models variables. In the table below one can observe the interaction of sample size, number of variables, and factor number in their determination of criterion values for a given level of desired significance.

Minimum Factor Loadings

Sample size Number of variables 5th Factor 10th Factor

50 20 0.292 0.393

50 0.267 0.274

100 20 0.216 0.261

50 0.202 0.214

5 percent siginificant level

A tabled summary of the above relationships can be found here.

Interpreting the factor matrix
(factor analysis (index) | decision tree) | cluster analysis (identification)

Some helpful steps for interpreting the factor matrix include the following:

Write out the names of the variables.
Find the highest loading of each variable on each factor and highlight it. This is best achieved by selecting one variable and then finding the factor on which it loads the highest before moving on to the next variable.
After identifying the highest factor loading of each variable, identify other loadings that are significant.
Critically evaluate those variables that do not load significantly on any factor. Variables that load significantly on no factor and fail to demonstrate high communality can be eliminated and a new factor solution obtained.
Having identified all variables that load significantly examine each factor separately and assign a name based upon those variables which load most heavily on it. If no name can be found then a factor should be labeled as undefined.

Factor Scores and Surrogate Variables (statistical index (factor analysis) | decision tree) - Researchers wishing to perform further experiments using different statistical tests can do so in either of two ways: one, select a surrogate variable for each of the factors, or two, employ the factor scores associated with each factor.

Selection of Surrogate Variables - Surrogate variables are typically chosen by selecting the variable which loads the highest on a particular factor. When more than one variable loads particularly high on a given factor, then the research should select only that variable which based upon apriori knowledge is likely to be the most reliable.

Surrogate variables are generally selected when the factor solution is orthogonal and the level of independence for each surrogate variable is likely to be high. Once again, the research objective is crucial in determining proper surrogates.

Factor Scores - Conceptually speaking factor scores represent the degree to which each observation scores on the each of the factors. High scores on a particular factor represent a strong relationship between the variable and the factor. Low scores indicate a weak relationship.

Whether factor scores or surrogate variables are used depends greatly on the ability of the researcher to povide the "new variables" with appropriate names and meaningful interpretation. Using factor scores requires the invention of new units of measurement, or alternatively standardization.

top

	Significance Level
	1 Percent		5 Percent
Sample size	Minimum Factor Loadings	Percent of variance captured	Minimum Factor Loadings	Percent of variance captured
100	\| ±0.19 \| <	3.6	\| ±0.26 \| <	6.8
200	\| ±0.14 \| <	2.0	\| ±0.18 \| <	3.2
300	\| ±0.11 \| <	1.2	\| ±0.15 \| <	2.3
\| \| = absolute value

		Minimum Factor Loadings
Sample size	Number of variables	5th Factor	10th Factor
50	20	0.292	0.393
50	50	0.267	0.274
100	20	0.216	0.261
100	50	0.202	0.214
5 percent siginificant level