takes as input a rectangular matrix in which the rows represent cases (or subjects) and the columns, variables (or stimuli) and computes one of 26 measures of (dis)similarity between each pair of variables in the input matrix. For comparing cases only, it is also possible to calculate Gower's general similarity coefficient. The resulting output matrices may be input to the other NewMDSX routines as required.
The number of rows in the matrix is specified by the user in the N OF CASES statement or, alternatively, N OF SUBJECTS. The number of columns fields is given by either N OF VARIABLES or N OF STIMULI. The data are read by the program when it encounters a READ MATRIX command. Data may be input in free format, separated by spaces. If an INPUT FORMAT statement is used, it should be specified to read one row of the data matrix, as real numbers.
The basic form of input to WOMBATS is a rectangular matrix in which the rows represent cases (or subjects) and the columns, variables (or stimuli). If the data to be input are for some reason in a matrix where the rows represent variables and the columns cases, then the user should specify MATFORM(1) in the PARAMETERS statement. The chosen measures are calculated between the entities designated as variables, whatever value is taken by the parameter MATFORM. If the user wishes measures to be calculated between cases rather than between variables, this is accomplished by specifying ANALYSIS(1) in the PARAMETERS statement.
Levels of Measurement
The user must specify, for each of the variables in the analysis, its assumed level of measurement. These range from RATIO, the highest, to DICHOTOMOUS, the lowest, or binary (presence/absence), level. If an attempt is made to compute a measure which assumes a level of measurement higher than that at which the variables have been declared to lie, the program will report an error. This is done using the LEVELS command, which is peculiar to WOMBATS, followed by one or more of the keywords RATIO, INTERVAL,ORDINAL, NOMINAL or DICHOTOMOUS. The program defaults to ordinal , so there is no need actually to specify this keyword. In parentheses following each keyword are listed the variables which are to be assumed to be at that level of measurement. In these parentheses ALL and TO are also recognized. The following are valid examples of LEVELS declarations:
LEVELS INTERVAL (1, 2, 5, 7,) NOMINAL (3, 4, 6, 8)In the last example, variables 5 and 6 are presumed by default to be at the ordinal level.
Where variables are of mixed levels, Gower's general similarity coefficient may be used, for comparing cases only .
Missing Data
For each variable in which there are missing data, the user may specify one code which the program will read as a missing value. However an attempt to calculate certain measures between variables containing missing data will fail if missing data are present.
The MISSING statement specifies the numerical values to be understood as missing, followed in each case in parentheses by a list of the variables for which that value represents a missing datum. In these parentheses the forms ALL and TO are recognised. The following are valid examples of MISSING declarations.
MISSING -9.(1, 2, 7, 9), 99.(3, 4, 6, 8)
MISSING 0. (ALL)
MISSING .1(1
TO 7), -.1(8 TO 16)
OUTPUT OPTIONS
Measures are listed by default as a lower triangular matrix, without the diagonal. The keyword OUTPUT in the PARAMETERS statement offers alternative matrix formats, as follows:
OUTPUT(1) - lower triangle without diagonal
OUTPUT(2) - lower triangle
with diagonal
OUTPUT(3) - full symmetric matrix with diagonal
MEASURES
The aim of WOMBATS is to calculate for each pair of variables a measure of the (dis)similarity between them. This is specified by the MEASURE keyword, followed by the keyword referring to one of 26 available measures. Only one measure is computed in each TASK of the run. If more than one measure is required for the same set of data, a separate TASK NAME is necessary.
Sixteen measures of agreement between two DICHOTOMOUS variables (a 2 x 2 table with cell values a,b,c,d ) are included in WOMBATS. Missing data are allowed in all these measures, identified as follows:
MEASURE D1
Type Similarity measure
Range low = 0, high = 1
Name Jaccard's
coefficient
Description Represents the probability of a pair of objects exhibiting both of a pair of attributes when only those objects exhibiting one or other are considered. It is possible that a division by zero may occur in the calculation of this measure.
MEASURE D2
Type Similarity measure
Range low = 0, high = 1
Name Russell and Rao's
measure
Description: Represents the probability of a pair of objects in a pre-selected set exhibiting both of a pair of attributes.
MEASURE D3
Type Similarity measure
Range low = 0, high = 1
Name Sokal's
measure
Description Represents the probability of a matching of two attributes.
MEASURE D4
Type Similarity measure
Range low = 0, high = 1
Name Dice's measure
Description Gives the positive matches 'a' twice as much importance as anything else. Excludes entirely the mismatches. It is thus possible that a division by zero may occur in the calculation of this measure.
MEASURE D5
Type Similarity measure
Range low = 0, high = 1
Name no name
Description Includes 'd' in both numerator and denominator. The matches (a and d) are given twice as much weight as the mismatches.
MEASURE D6
Type Similarity measure
Range low = 0, high = 1
Name no name
Description Excludes 'd' entirely. The matches (b and c) are accorded twice as much weight as the matches. It is possible that a division by zero may occur in the calculation of this measure.
MEASURE D7
Type Similarity measure
Range low = 0, high = 1
Name Rogers and
Tanimoto's measure
Description Includes 'd' in numerator and denominator. The mismatches (b and c) are accorded twice as much weight as the matches.
MEASURE D8
Type Similarity measure
Range low = 0, high = a + b + c + c + d - 1
Name Kulczynski's measure
Description Excludes 'd' entirely. This measure is the simple ratio of the positive matches (a) to the mismatches (cf. D9). it is possible that a division by zero could occur in the calculation of this measure and an undefined statistic occur. The maximum value otherwise is as stated.
MEASURE D9
Type Similarity measure (Sokal & Sneath)
Range low = 0, high = a + b
+ c + d - 1
Name no name
Description This measure is the simple ratio of all matches (positive and negative) to the mismatches (cf D8). The statistic may be undefined, due to a zero divisor. The maximum finite value is as stated.
MEASURE D10
Type Similarity measure
Range low = 0, high = 1
Name Kulczynski's
measure
Description Excludes 'd' entirely. This measure is a weighted average of the matches to one or other of the mismatches. This statistic may be undefined.
MEASURE D11
Type Similarity measure
Range low = 0, high = 1
Name no name
Description Includes 'd' in numerator and denominator. This is the analogue of D10 with mismatches included.
MEASURE D12
Type Similarity measure
Range low = 0, high = 1
Name Ochiai's measure
Description Excludes 'd' from numerator. It uses the geometric mean of the marginals as a denominator. This statistic may have a zero divisor.
MEASURE D13
Type Similarity measure
Range low = 0, high = 1
Name no name
Description Includes 'd' in numerator and denominator. It uses the geometric mean of the marginals as a denominator and will return a value of 0 iff either a or d is empty.
MEASURE D14
Type Similarity measure
Range low = -1, high = +1
Name Hamann's
coefficient
Description Simply the difference between the matches and the mismatches as a proportion of the total number of entries. A value of 0 indicates an equal number of matches to mismatches. Some thought should be given to the interpretation of any negative coefficients before scaling the results.
MEASURE D15
Type Similarity measure
Range low = -1, high = +1
Name Yule's Q
Description This is the original measure of dichotomous agreement, designed to be analogous to the product-moment correlation. A value of 0 indicates statistical independence. Some thought should be given to the interpretation of any negative coefficients before scaling the results. This statistic may be undefined.
MEASURE D16
Type Similarity measure
Range low = -1, high = +1
Name Pearson's Phi
Description A value of 0 indicates statistical independence. Some thought should be given to the interpretation of any negative coefficients before scaling the results. The statistic may be undefined if any one cell is empty.
Five measures are available for the measurement of agreement between NOMINAL variables. Four of these are based on the chi-square statistic. The other is the Index of Dissimilarity.
MEASURE CHISQUARE
Type Similarity measure
Range low = 0, high = N x min(r,c)
Name
Chi-square
Comment A value of 0 indicates statistical independence. The maximum value is dependent on the value of N.
MEASURE PHI
Type similarity measure
Range low = 0, high = (min(r,c)-1)
Name Phi
Comment The phi coefficient is chi-square normed to be independent of N. Reaches a maximum for 2 x 2 tables in which case it reduces to the product-moment correlation. It may, however, exceed 1 when the minimum of r and c is greater than 2.
MEASURE CRAMER
Type similarity measure
Range low = 0, high = 1
Name Cramer's V
Comment Cramer's coefficient is chi-square normed to be independent of N and of the number of r and c. Reaches a maximum for non-square tables.
MEASURE PEARSON
Type similarity measure
Range low = 0, high = 1
Name Pearson's
Contingency coefficient C
Comment Pearson's coefficient is chi-square normed to be independent of N, originally developed as a measure for contingency tables. Cannot reach its maximum of 1 for non-square tables.
MEASURE ID
Type dissimilarity
Range low = 0, high = 100
Name Index of
dissimilarity
Comment The index of dissimilarity is simply the proportion of cases in the off-diagonal cells and may be thought of as the proportion of changes needed to change the distribution of one nominal variable into the other. The index does not require equal numbers of values in the variables.
There are three measures forORDINAL variables in WOMBATS:
MEASURE GAMMA
Type similarity measure
Range low = -1, high = +1
Name Goodman and
Kruskal's gamma
Comment Measures the weak monotonic agreement between the variables, taking the ratio of the difference between concordant and discordant pairs to their sum. It thus ignores the ties completely. For this reason it is possible that the value be undefined (i.e. there may be no cases). If there are no ties then the index reduces to Yule's Q (D15). Some thought should be given to the interpretation of the negative values before the results are scaled.
MEASURE TAUB
Type similarity measure
Range low = -1, high = +1
Name Kendall's
tau-b (tb)
Comment Measures strong monotonic agreement in the variables, relating the difference between concordant and discordant pairs and the geometric mean of the quantities arrived at by adding in the ties to the denominator. This should be used only for square tables.
MEASURE SOMERS' D
Type similarity measure
Range low = -1, high = +1
Name Somers'
D
Comment Measures strong monotonic agreement between the variables, being the difference between concordant and discordant pairs divided by their sum plus the number of ties.
MEASURE COVARIANCE
Type similarity
Range low = 0, high = highest variance
Comment The
interpretation given to the negative values should be carefully thought out
before scaling.
MEASURE CORRELATION
Type similarity
Range low = -1, high = 1
Comment The negative values
may need to be given some thought before the results of this calculation are
scaled.
MEASURE DISTANCE
Type dissimilarity
Range low = 0, high = maximum variance in the
variables
Comment This calcluates the Euclidean distance between
variables. If the ranges of the variables involved are markedly different, then
some attempt at rescaling (i.e. normalisation) should be made so that
differences in a highly valued variable do not swamp out differences in one of
humbler dimensions.
MEASURE MAHALANOBIS DISTANCE
Type dissimilarity
Range low = 0, high = maximum variance in the
normalised variables
Comment This is advisable instead of the
Euclidean distance for data liable to exhibit high multicollinearity. It
effectively rescales the components so that those with high variability receive
less weight than those with low variability. Where variables are uncorrelated,
it is equivalent to the Euclidean distance for the normalised data.
For MIXED level variables, and for comparing up to 200 cases only, it is possible to calculate Gower's general similarity coefficient
MEASURE GOWER
Type similarityPROGRAM LIMITS
Maximum no. of subjects = 10000 (with the exception of Gower's general
similarity coefficient, for up to 200 subjects only)
Maximum no. of stimuli =
200
Maximum no. of nominal values per stimulus = 30
See also