Variable Selection and Sample Size
- Factor analysis is a diagnostic statistical technique used
for the purpose of data reduction and summarization. As a statistical
technique it looks for the common variation of a large number
of variables, and from this variation creates a smaller set of
factors (new variables). Shared variation among variables can
arise when different variables respond similarly to the same
phenomena, or when different variables constitute different aspects
of the same phenomenon. In the first instance the common variation
is brought about by an external force to which all variables
respond in a similar manner. In the second instance, an external
source is unnecessary to bring about a common pattern of variance,
because the variables in question always behave similarly no
matter what force might be present. Flotsam moves together atop
undulating ocean waves not because the individual pieces of flotsam
are related, rather because the waves that cause the individual
pieces of flotsam to rise and fall are indifferent to their presence.
In contrast the limbs of one's own body always demonstrate similar
movement when the entire body is moved -- no matter the origin
of the force that causes the body to move. Thus, it is important
to know the approximate relationship or lack of relationship
among the variables that one enters into the analysis before
one can effectively interpret the reduced number of factors that
results.
The number of variables and number of observations in any
statistical analysis is crucial. As the number of variables and
observations included in the HKLNA-Project are both expected
to be large, sample size will only become an issue with regard
to the size of the population that is being tested, not the statistical
technique itself. (statistical index (factor
analysis) | decision tree)
Correlation Matrix - As factor
analysis is a statistical procedure that ignores cause and effect
relationships, it treats the variance of all variables (R-analysis)
or all observations (Q-analysis) similarly. In other words, one
can calculate the correlation matrix with respect to either variables
or observations simply by turning the input data matrix on its
side.
- R Analysis - Of the two procedures R and Q, R is the
more common.
- Q Analysis - This procedure groups the observations
according to shared variance around sample means and is indifferent
to the direction of variation. Thus, it is not a matter of differentiating
among individuals that are consistently high with regard to the
sample means of some variables and consistently low with regard
to others; rather, it is a matter of separating those that show
large deviation in either direction from those that show little
deviation. As it is often in the researcher's interest to understand
the direction of deviation as well as the magnitude when comparing
observations, other statistical procedures, such as cluster analysis
are more commonly used. (statistical index (factor analysis) | decision
tree)
Factor Model - Common factor analysis
and principal component analysis are the two principal
techniques employed by factor analysis to obtain factor solutions.
These two statistical methods differ technically in the amount
of information they employ in the selection of the factor solutions.
Whereas principal component analysis uses all available variance
to calculate a factor solution, common factor analysis uses only
that variance which is shared among variables to determine a
solution. This difference is highlighted by the structure of
the correlation matrix -- namely, the elements of the principal
diagonal of the correlation matrix. Whereas the diagonal
elements of the principal component matrix consist only of ones
and thus reflect the full variance of each variable, the elements
of the common factor model are the communalities
associated with each of the input variables. In short, the common
factor model ignores the unique
variance of each variable in calculating the final factor
solution.
Which of the two models to use depends on two considerations:
the research objective and prior knowledge about the variance
structure.
- Common factor model - This model is employed when
the primary objective of the analysis is to identify latent
dimensions or constructs that provide new information about
how the input variables are related among themselves. In general
the researcher has little knowledge about the unique variance
associated with each of the input variables.
- Principal component analysis
statistical index (factor
analysis) | decision tree | cluster
analysis (proximity
measures)
This model is employed when the objective of the analysis
is to determine the minimum number of factors needed to account
for the maximum amount of information and the researcher knows
that unique (specific and error) variance is relatively little.
This technique is most useful in eliminating autocorrelation
among an otherwise correlated set of independent variables.
As variable independence is a requisite assumption for many predictive
statistical techniques principal component analysis can be particularly
valuable as a preliminary step to further analysis. In effect,
principal component analysis reconstitutes the autocorrelated
"independent" variable set into a set of truly independent
new variables (factors).
Another important use of principal component analysis is the
identification of surrogage
variables -- variables that load heavily on independent factors.
When the researcher has several variables from which he can choose
to measure the same phenomena, principal component analysis can
help him to identify good (uncorrelated) proxy variables for
further analysis.
See under general uses for
further clarification.
Method of Extraction - Once
the appropriate factor model has been determined one must choose
between an orthogonal
or oblique extraction
method (factor solution). When the goal of extraction is to obtain
independent factors for use in other statistical techniques that
require a high degree of independence among the explanatory variables,
orthogonal extraction is the appropriate choice. Thus, principal
component analysis and orthogonal extraction often go hand-in-hand.
As may be deduced from the names of these two extraction processes,
orthogonal extraction assumes indepedence among the extracted
factors. Oblique extraction assumes that the factors are correlated.
When the objective of the analysis is to identify underlying
factors or latent constructs (common factor analysis) both orthogonal
and oblique factor extraction methods can be employed. (statistical index (factor
analysis) | decision tree)
Closely associated with the method of extraction is factor
rotation.
Factor Rotation (factor extraction criteria)-
In order to understand the importance of factor rotation it is
useful to examine how the factors of an unrotated orthogonal
extraction are obtained.
- Unrotated, orthogonal, factor
extraction - In computing the unrotated factor matrix, whether
one employs principal component or common factor analysis, the
analyst seeks to obtain the best linear combination of the variables
-- in other words that combination for which no other combination
can account for more of the total variance employed by the model.
As such, the first factor represents the best linear combination
of all variables; the second factor represents the best linear
combination of all the variables based upon the total variance
that remains after the first factor has been extracted; the third
factor represents the best linear combination of all the variables
based upon the variance that remains after the first and second
factors have been extracted. Subsequent factors are similarly
derived until all of the variance has been taken into account.
Since the determination of each subsequent factor is obtained
from the residual variance of previous factor extractions,
the independence of all factors is insured. Consequently, the
order in which the factors are extracted is crucial; each subsequent
factor always accounts for less of the model's total variance
than that of all preceding ones.
Unrotated factor solutions achieve the task of data reduction
in so far as they provide the analyst with a series of factors
that account for an ever diminishing amount of the total variance.
By selecting only those factors that account for the largest
amount of information the analyst reduces the number of variables
with little loss of information. As the unrotated factor solution
may or may not provide a meaningful pattern of the model's total
variance, factor rotation (see next section) may or may not become
necessary. (statistical index (factor
analysis) | decision tree)
- Rotated, orthogonal, factor extraction
- Factor analysis can involve much more that the simple reduction
in the number of variables and their orthogonalization.
The purpose of rotation is to simplify the factor solution by
increasing the loading of individual variables on particular
factors and reducing the number of factors on which each variable
loads. Variables which load on all factors similarly or factors
on which all variables load poorly provide little new information
to the researcher about hidden factors that determine the behavior
of some or all of the variables on the one hand, or are themselves
hidden attributes which certain or all of the variables share
in common on the other.
Unrotated factor solutions almost always appear the same, in
so far as most of the variables load heavily on the first factor
and less heavily or not at all on each subsequent factor. Rotating
the factor solution redistributes the variance from the first
factor to subsequent factors in such a way that the researcher
is able to identify each of the factors more easily. A sample comparison of orthogonally extracted,
unrotated and rotated solutions demonstrates how the model's
variance is redistributed among the factors when they are rotated.
(statistical index (factor
analysis) | decision tree)
- Oblique factor extraction with rotation
- In so far as factor independence is not of primary concern
fo the researcher, oblique rotation
can provide a more accurate picture of the relationship between
the factor solution and the original variable set. Moreover,
it can provide the researcher with valuable information about
the degree of correlation that exists among the factors. In effect
oblique rotation is both theoretically and empirically more realistic,
but impractical for subsequent analysis using statistical techniques
that require variable independence. (statistical
index (factor analysis) | decision
tree)
- Criteria for selecting the number
of factors to be rotated - As one of the goals of factor
analysis is to reduce a large number of variables into a smaller
number of more easily manipulated factors, researchers have developed
several guidelines for deciding the
number of factors to interpret after an initial factor run.
Factor Interpretation
- (statistical index (factor
analysis) | decision tree) Once
the number of factors have been determined, they must be interpreted.
Interpreting factors is largely a question of which variables
load on which factors and in what amount. Several guidelines
have been suggested for determining when a factor loading is
significant. Variables that do not load heavily on a particular
factor should not be used to interpret that factor
- For a sample size of 50 or more a factor must account for
at least 10 percent of a variable's variation before that variable
can be used to interpret the factor. Thus, factor loadings of
plus or minus 0.30 are considered weak and any loading equal
to or above the absolute value of plus or minus 0.50 is considered
strong.
- As factor loadings are correlations of variables with the
factors on which they load, load rules similar to those applied
in judging correlation coefficients can be applied. Thus, lower
factor loadings are permissible for larger sample sizes.
|
Significance
Level |
1 Percent |
5 Percent |
Sample
size |
Minimum
Factor Loadings |
Percent
of variance captured |
Minimum
Factor Loadings |
Percent
of variance captured |
100 |
|
±0.19 | < |
3.6 |
|
±0.26 | < |
6.8 |
200 |
|
±0.14 | < |
2.0 |
|
±0.18 | < |
3.2 |
300 |
|
±0.11 | < |
1.2 |
|
±0.15 | < |
2.3 |
| | = absolute value |
Obviously very high restrictions on the level of significance
demand that a far larger number of factors be included in the
final solution.
- Although the above rules appear sufficiently rigorous they
do take into account the number of factors in the factor
solution. As the number of factors increase so too does the amount
of unique variance employed in the determination of higher numbered
factors. Thus, with principal component analysis the above criteria
should be raised for higher numbered factors.
- In addition to the number of factors one should also take
into consideration the number of variables. As the number
of variables increases the significance criterion can be lowered.
This is especially true for factors that are increasingly determined
by the unique variance of the models variables. In the table
below one can observe the interaction of sample size, number
of variables, and factor number in their determination of criterion
values for a given level of desired significance.
|
Minimum Factor
Loadings |
Sample
size |
Number
of variables |
5th
Factor |
10th
Factor |
50 |
20 |
0.292 |
0.393 |
50 |
0.267 |
0.274 |
100 |
20 |
0.216 |
0.261 |
50 |
0.202 |
0.214 |
5 percent siginificant level |
A tabled summary of the above relationships can be found here.
Interpreting the factor matrix
(factor analysis (index)
| decision tree) | cluster analysis (identification)
Some helpful steps for interpreting the factor
matrix include the following:
- Write out the names of the variables.
- Find the highest loading of each variable on each factor
and highlight it. This is best achieved by selecting one variable
and then finding the factor on which it loads the highest before
moving on to the next variable.
- After identifying the highest factor loading of each variable,
identify other loadings that are significant.
- Critically evaluate those variables that do not load significantly
on any factor. Variables that load significantly on no factor
and fail to demonstrate high communality can be eliminated and
a new factor solution obtained.
- Having identified all variables that load significantly examine
each factor separately and assign a name based upon those variables
which load most heavily on it. If no name can be found then a
factor should be labeled as undefined.
Factor Scores and Surrogate Variables
(statistical index (factor
analysis) | decision tree)
- Researchers wishing to perform further experiments using different
statistical tests can do so in either of two ways: one, select
a surrogate variable for each of the factors, or two, employ
the factor scores associated with each factor.
- Selection of Surrogate Variables - Surrogate variables
are typically chosen by selecting the variable which loads the
highest on a particular factor. When more than one variable loads
particularly high on a given factor, then the research should
select only that variable which based upon apriori knowledge
is likely to be the most reliable.
Surrogate variables are generally selected when the factor solution
is orthogonal and the level of independence for each surrogate
variable is likely to be high. Once again, the research objective
is crucial in determining proper surrogates.
- Factor Scores
- Conceptually speaking factor scores represent the degree to
which each observation scores on the each of the factors. High
scores on a particular factor represent a strong relationship
between the variable and the factor. Low scores indicate a weak
relationship.
Whether factor scores or surrogate variables are used depends
greatly on the ability of the researcher to povide the "new
variables" with appropriate names and meaningful interpretation.
Using factor scores requires the invention of new units of measurement,
or alternatively standardization.