Correspondence Analysis : CORRESP

provides internal analysis of two-way or multi-way data of various kinds. Simple correspondence analysis is typically applied to represent row and column categories of a two-way contingency table in a two-dimensional map.But the same procedure can be applied to any matrix which can plausibly be regarded as consisting of 'pseudo-frequencies'.

 

DATA: 2-way, 2-mode table of frequencies       
TRANSFORM: Linear                                
MODEL: Chi-square distance

It can also be applied descriptively to non-frequency data such as rankings, or data representing the intensity of responses to stimuli, or any of a variety of indices of proximity.

Correspondence Analysis is increasingly popular in analysing contingency tables and exploring areas such as the sequencing of artefacts found at different archaeological sites or levels of excavation ('seriation'), and of animals or plants and habitats ('gradient analysis') based upon relationships between frequencies of the objects.

Input of multi-way indicator matrices or Burt matrices (obtained by multiplying an indicator matrix by its transpose) is one form of multiple correspondence analysis, as is Guttman scaling. Stacking of a series of two-way tables is another.

The correspondence analysis model (CA) represents the row and column categories of the input matrix as points in the same dimensionality. It is closely related to MDPREF which also represents row and column variables in the same space, but instead fits row variables as vectors to the configuration derived from the column variables. Both CA and MDPREF use a similar singular value decomposition. An important difference is that CA considers only interactive factors by explicitly neglecting the magnitude effect after decomposition. Because the canonical ("optimal") scores reported are row and column conditional, it is advisable to avoid inter-set point distance interpretation, however tempting this may be when using correspondence analysis.

The input matrix is first normalized by dividing each row entry by the square root of the product of the corresponding row and column totals, their geometric mean. This removes differences in the marginal totals and expresses each cell as a proportion. It is a different pre-transformation from any available in MDPREF, so that the results, while similar in appearance, are not the same.

The second step finds the basic structure of the resultant matrix A by singular value decomposition, producing summary row and column vectors ( U and V) and a diagonal matrix of singular values d corresponding to the columns of A, so that A = Ud(VT). The matrices U and V are in fact the eigenvectors of the matrices of row and column cross-products of A, and the d values are related to their (identical) eigenvalues (d=sqrt(D*(n-1)), where D is the diagonal of eigenvalues and n is the number of rows in A). The first singular value in d is always 1.0. It corresponds to the independence model of chi-squared expected values, and is ignored in subsequent analysis.

It is important to check that the eigenvalues remaining after ignoring the first one are in fact large enough to justify continuing with the analysis.  Where appropriate, reference can be made to the chi-squared contributions of each dimension of "inertia" and to the overall chi-squared value for the analysis.

If A is a correlation matrix, the V matrix is related to the principal components of AThe method implemented here is equivalent to HOMALS in SPSS, which uses an alternating least squares algorithm which is more suitable for large numbers of cases.

INPUT

CORRESPaccepts as input data a set of frequencies forming a rectangular matrix.This can be a simple two-way contingency table of categorical data, or more generally, an indicator matrix of rows representing subjects and columns representing presence and absence of a series of binary attributes for each subject. This can be condensed by adding together identical rows, and will produce the same scores for equivalent data.

When using correspondence analysis descriptively for data other than strict frequencies, there are five restrictions to be observed:
1. Inferential tests such as Chi-square are not valid for non-frequencies (nor when expected frequencies are too small).
2. The data must be in the form of 'similarities', i.e. if they are ranks, they should be ordered from highest to lowest preference (compare DATA TYPE(4) for MDPREF). If the data are distances, they should be reflected by subtraction from a number larger than the largest distance.
3. When analysing symmetric square matrices, it is essential that the leading diagonal contain large positive values.
4. All values in the matrix must be positive, or the results will be invalid.
5. In the analysis of sparse matrices, consider the possibility that the data may contain disjoint sets, which should be separated prior to analysis.

OUTPUT

CORRESP by default outputs the normalized input matrix, and the number of its eigenvalues which are greater than zero - the rank of the matrix.  Further details are available using the PRINT and PLOT options described below.

Correspondence analysis customarily concentrates on the proportion of "inertia" accounted for by each of the singular values. If the input matrix is a contingency table, the chi-squared contributions for by the number of output dimensions requested for the analysis should be consulted before looking at the further results. In addition, the relative contributions of the row and column points to the dimensions of "inertia" are listed. These are in fact simply the squared values of the corresponding row and column vectors of the basic structure, ignoring the first, or "trivial", factor.

The row and column vectors are finally rescaled to obtain the canonical variates or "optimal" scores, which are normally used to summarise the analysis.

Dimensionality The program reports the number of non-zero eigenvalues or latent roots of the (column or row) cross-products of the normalized data matrix. The number of positive roots is an indication of the rank of the matrix. The magnitude of the roots gives an indication of the amount of variation in the data accounted for by the corresponding dimension. The largest root will always be first and the others are arranged in decreasing order. An appropriate dimensionality may be chosen by means of the PLOT ROOTS option. The first singular value reported should always be 1.0, and this serves as a check on the accuracy of the procedure.

INPUT COMMANDS

Keyword                                            Function
N OF COLUMNS    [number]          Number of columns in
                                                  the input matrix
N OF ROWS         [number]           Number of rows in the
                                                  input matrix
DIMENSIONS           n                  Number of dimensions to
                                                  list and plot in detail.
LABELS [followed by a series        Optionally used to identify categories.
            of labels (<= 65 chars      There should be as many labels
            each on a separate line]   as categories, first for the columns,
                                                  and then for the rows.

READ MATRIX                              Start reading input data
COMPUTE                                    Start computation
FINISH                                        Final statement in the run

NOTE
N OF COLUMNS, N OF ROWS and
DIMENSIONS are obligatory.

PRINT options (to main output file)
Option                Form                   Description
FIRST                   r x c          The input matrix, rows by columns

CROSS-PRODUCTS                Cross-products of the rows
                                            and columns of the
                                            normalized input matrix.

CORRELATIONS                    The correlation matrices of rows
                                           and columns of the
                                           normalized input matrix.

ROOTS                                 The eigenvalues of the cross-products
                                           of the normalized input matrix.

BASIC                                 The basic structure of the cross-products
                                           of the normalized input matrix in full.

FINAL                                  The remaining output described above,
                                           in the chosen dimensionality.

CHISQUARE                          The total chi-squared value, with
                                           degrees of freedom, and the contributions
                                           of the factors of "inertia".

By default, the FINAL output and CHISQUARE values only are printed.

PLOT options (to main output file)
Option                      Description
COLUMNS              The n(n-1)/2 plots of the canonical
                             ("optimal") column scores in
                             the chosen dimensionality.
ROWS                    The n(n-1)/2 plots of the canonical
                              ("optimal") row scores in
                             the chosen dimensionality.
JOINT                     Both of the above
ROOTS                   A scree diagram of the latent roots.

By default, the first two dimensions of the joint space only are plotted

PUNCH options
No secondary output file is produced by CORRESP.

PROGRAM LIMITS
Maximum no. of rows = 1000
Maximum no. of columns = 100
Maximum dimensions = 8

See also

  • The NewMDSX commands in full