provides
internal analysis of two-way or multi-way data of
various kinds. Simple correspondence analysis is
typically applied to represent row and column categories of a two-way
contingency table in a two-dimensional map.But
the same procedure can be applied to any matrix which can plausibly be regarded
as consisting of 'pseudo-frequencies'.
DATA:
2-way, 2-mode table of frequencies
TRANSFORM:
Linear
MODEL:
Chi-square distance
It
can also be applied descriptively to non-frequency data such as rankings, or
data representing the intensity of responses to stimuli, or any of a variety of
indices of proximity.
Correspondence Analysis is increasingly popular in
analysing contingency tables and exploring areas such as the sequencing of
artefacts found at different archaeological sites or levels of excavation
('seriation'), and of animals or plants and habitats ('gradient analysis') based
upon relationships
between frequencies of the objects.
Input
of multi-way indicator matrices or Burt matrices (obtained by multiplying an
indicator matrix by its transpose) is one form of multiple correspondence
analysis, as is Guttman scaling. Stacking of a series of two-way tables is
another.
The correspondence analysis model (CA) represents the row and column categories of the input matrix as points in the same dimensionality. It is closely related to MDPREF which also represents row and column variables in the same space, but instead fits row variables as vectors to the configuration derived from the column variables. Both CA and MDPREF use a similar singular value decomposition. An important difference is that CA considers only interactive factors by explicitly neglecting the magnitude effect after decomposition. Because the canonical ("optimal") scores reported are row and column conditional, it is advisable to avoid inter-set point distance interpretation, however tempting this may be when using correspondence analysis.
The input matrix is first normalized by dividing each row entry by the square root of the product of the corresponding row and column totals, their geometric mean. This removes differences in the marginal totals and expresses each cell as a proportion. It is a different pre-transformation from any available in MDPREF, so that the results, while similar in appearance, are not the same.
The
second step finds the basic structure of the resultant matrix A
by singular value decomposition, producing summary row and column vectors ( U
and V) and a diagonal matrix of singular values d corresponding
to the columns of A, so that
It is important to check that the eigenvalues remaining after ignoring the first one are in fact large enough to justify continuing with the analysis. Where appropriate, reference can be made to the chi-squared contributions of each dimension of "inertia" and to the overall chi-squared value for the analysis.
If A is a correlation matrix, the V matrix is related to the principal components of A. The method implemented here is equivalent to HOMALS in SPSS, which uses an alternating least squares algorithm which is more suitable for large numbers of cases.
INPUT
CORRESPaccepts as input data a set of frequencies forming a rectangular matrix.This can be a simple two-way contingency table of categorical data, or more generally, an indicator matrix of rows representing subjects and columns representing presence and absence of a series of binary attributes for each subject. This can be condensed by adding together identical rows, and will produce the same scores for equivalent data.
When
using correspondence analysis descriptively for data other than strict
frequencies, there are five restrictions to be observed:
1. Inferential tests such as Chi-square are not valid for non-frequencies (nor
when expected frequencies are too small).
2. The data must be in the form of 'similarities', i.e. if they are ranks, they
should be ordered from highest to lowest preference (compare DATA TYPE(4) for MDPREF). If the data are distances, they should be
reflected by subtraction from a number larger than the largest distance.
3. When analysing symmetric square matrices, it is essential that the leading
diagonal contain large positive values.
4. All values in the matrix must be positive, or the results will be
invalid.
5. In the analysis of sparse matrices, consider the possibility that the data
may contain disjoint sets, which should be separated prior to analysis.
OUTPUT
CORRESP
by default outputs the normalized input matrix,
and the number of its eigenvalues which are greater than zero -
the rank of the matrix. Further details are available using the PRINT
and PLOT
options described below.
Correspondence
analysis customarily concentrates on the proportion of "inertia"
accounted for
by each of the singular values. If the input
matrix is a contingency table, the chi-squared contributions for by the number of output dimensions
requested for the analysis should be consulted before looking at the further
results. In addition, the relative contributions of the row and column points
to the dimensions of "inertia" are listed. These are in fact simply
the squared values of the corresponding row and column vectors of the basic
structure, ignoring the first, or "trivial", factor.
The
row and column vectors are finally rescaled to obtain the canonical variates or
"optimal" scores, which are normally used to summarise the analysis.
Dimensionality The program
reports the number of non-zero eigenvalues or latent roots of the (column or
row) cross-products of the normalized data matrix. The number of positive roots
is an indication of the rank of the matrix. The magnitude of the roots gives an
indication of the amount of variation in the data accounted for by the
corresponding dimension. The largest root will always be first and the others
are arranged in decreasing order. An appropriate dimensionality may be chosen
by means of the PLOT ROOTS option. The first singular value reported
should always be 1.0, and this serves as a check on the accuracy of the
procedure.
INPUT
COMMANDS
Keyword
Function
N OF COLUMNS [number]
Number of columns in
the input matrix
N OF ROWS [number]
Number of rows in the
input matrix
DIMENSIONS n
Number of
dimensions to
list and plot in detail.
LABELS [followed
by a series Optionally used to identify
categories.
of
labels (<=
65 chars There should be as many
labels
each
on a separate line] as categories, first for the
columns,
and then for the rows.
READ MATRIX
Start reading input data
COMPUTE Start
computation
FINISH Final
statement in the run
NOTE
N OF COLUMNS, N OF ROWS and
DIMENSIONS are obligatory.
PRINT
options
(to main output file)
Option
Form Description
FIRST r
x c The input matrix,
rows by columns
CROSS-PRODUCTS
Cross-products of the rows
and columns of the
normalized input matrix.
CORRELATIONS
The
correlation matrices of rows
and columns of the
normalized input matrix.
ROOTS
The
eigenvalues of the cross-products
of the normalized input
matrix.
BASIC
The
basic
structure of the cross-products
of the normalized input matrix
in full.
FINAL
The
remaining output described above,
in the chosen dimensionality.
CHISQUARE
The total
chi-squared value, with
degrees of freedom, and the
contributions
of the factors of
"inertia".
By default, the FINAL output and CHISQUARE values only are printed.
PLOT
options
(to main output file)
Option
Description
COLUMNS
The n(n-1)/2 plots of the canonical
("optimal")
column scores in
the chosen dimensionality.
ROWS
The n(n-1)/2 plots of the canonical
("optimal")
row scores in
the chosen dimensionality.
JOINT
Both
of the above
ROOTS
A scree diagram of the latent roots.
By
default, the first two dimensions of the joint space only are plotted
PUNCH
options
No secondary output file is produced by CORRESP.
PROGRAM
LIMITS
Maximum no. of rows = 1000
Maximum no. of columns = 100
Maximum dimensions = 8
See also