Step 2 - Validation
- General points of interest
- Classification matrices
- Cutting score determination
- Chance models
- Classification matrices
- Measures of the discriminant function's significance are
generally of little value, as the functions themselves might
fail to discriminate well between the groups' centroids even
when the distance between them is statistically significant.
How well the discriminant function is able to classify individual
group members is a better test of the function's utility.
- Hit - ratios
A hit - ratio is the percentage of correctly classified observations.
In so far as they tell how much of the total variation in the
dependent variable is accounted for by the discriminant function,
hit - ratios are somewhat analogous to R-square values
in regression analysis. Accordingly the F-statistic for regression
analysis and the Chi-square statistic for discriminant analysis
are analogous.
- Cutting scores (Critical Z-values) - A cutting score
is the decision rule for determining an individual observation's
group membership. If the groups are of equal size, then the cutting
scores are equal to the midpoints between the cetroids of the
different groups. When the groups are of different size the value
of the centroids must be weighted.
Zc = (Na·Za + Nb·Zb) / (Na + Nb)
where
Zc = the critical Z
Na = number of observations in group A
Nb = number of observations in groups B Za = centroid of group
A
Zb = centroid of group B
Normal distribution is assumed for the Z - scores around their
centroids.
- Optimal cutting scores - Weighted cutting scores that
do not take into account the cost of misclassification are not
optimal, unless the cost of misclassification is the same for
both groups.
- Procedure
cluster analysis (research
design issues)
After the sample has been split and the discriminant function
determined, the observations of the hold-out sample are classified
by the function and placed into a table
that compares their estimated membership with the known membership.
- A t-test is employed to determine the level of significance
of the discriminant function's ability to classify the observations
correctly. For a two-group analysis with equal sample size the
following formula is employed:
t = (p - 0.5) / [0.5(1 - 0.5) / n]
where p = the proportion of correctly classified observations
and n = the size of the entire sample
This formula can be modifed to include more than two groups
of different size.
- Chance models - a few rules of thumb
In general the discriminant function should be able to predict
the proper group for each observation better than chance itself.
How much better will depend on the cost of generating the discriminant
function -- namely, the cost of analysis; and the actual value
derived from accurately producing group membership.
Two common rules of thumb in this regard are the
- Maximum chance criterion, and
- Proportional chance criterion
Chance models only produce accurate measures when split samples
are possible.
- Only if both statistical accuracy of the discriminant function
and a satisfactory level of accuracy in classification are achieved,
does interpretation of the discriminant function become useful.
Go to step 3
|
|