Work Out Measures Before Scaling (WOMBATS):

takes as input a rectangular matrix in which the rows represent cases (or subjects) and the columns, variables (or stimuli) and computes one of 26 measures of (dis)similarity between each pair of variables in the input matrix. For comparing cases only, it is also possible to calculate Gower's general similarity coefficient. The resulting output matrices may be input to the other NewMDSX routines as required.

The number of rows in the matrix is specified by the user in the N OF CASES statement or, alternatively, N OF SUBJECTS. The number of columns fields is given by either N OF VARIABLES or N OF STIMULI. The data are read by the program when it encounters a READ MATRIX command. Data may be input in free format, separated by spaces. If an INPUT FORMAT statement is used, it should be specified to read one row of the data matrix, as real numbers.

The basic form of input to WOMBATS is a rectangular matrix in which the rows represent cases (or subjects) and the columns, variables (or stimuli). If the data to be input are for some reason in a matrix where the rows represent variables and the columns cases, then the user should specify MATFORM(1) in the PARAMETERS statement. The chosen measures are calculated between the entities designated as variables, whatever value is taken by the parameter MATFORM. If the user wishes measures to be calculated between cases rather than between variables, this is accomplished by specifying ANALYSIS(1) in the PARAMETERS statement.

Levels of Measurement

The user must specify, for each of the variables in the analysis, its assumed level of measurement. These range from RATIO, the highest, to DICHOTOMOUS, the lowest, or binary (presence/absence), level. If an attempt is made to compute a measure which assumes a level of measurement higher than that at which the variables have been declared to lie, the program will report an error. This is done using the LEVELS command, which is peculiar to WOMBATS, followed by one or more of the keywords RATIO, INTERVAL,ORDINAL, NOMINAL or  DICHOTOMOUS. The program defaults to ordinal , so there is no need actually to specify this keyword. In parentheses following each keyword are listed the variables which are to be assumed to be at that level of measurement. In these parentheses ALL and TO are also recognized. The following are valid examples of LEVELS declarations:

LEVELS INTERVAL (1, 2, 5, 7,) NOMINAL (3, 4, 6, 8)
LEVELS RATIO (ALL)
LEVELS NOMINAL (1 TO 4), INTERVAL (7 TO 11)

In the last example, variables 5 and 6 are presumed by default to be at the ordinal level.

Where variables are of mixed levels, Gower's general similarity coefficient may be used, for comparing cases only

Missing Data

For each variable in which there are missing data, the user may specify one code which the program will read as a missing value. However an attempt to calculate certain measures between variables containing missing data will fail if missing data are present.

The MISSING statement specifies the numerical values to be understood as missing, followed in each case in parentheses by a list of the variables for which that value represents a missing datum. In these parentheses the forms ALL and TO are recognised. The following are valid examples of MISSING declarations.

MISSING -9.(1, 2, 7, 9), 99.(3, 4, 6, 8)
MISSING 0. (ALL)
MISSING .1(1 TO 7), -.1(8 TO 16)

OUTPUT OPTIONS

Measures are listed by default as a lower triangular matrix, without the diagonal. The keyword OUTPUT in the PARAMETERS statement offers alternative matrix formats, as follows:

OUTPUT(1) - lower triangle without diagonal
OUTPUT(2) - lower triangle with diagonal
OUTPUT(3) - full symmetric matrix with diagonal 

MEASURES

The aim of WOMBATS is to calculate for each pair of variables a measure of the (dis)similarity between them. This is specified by the MEASURE keyword, followed by the keyword referring to one of 26 available measures. Only one measure is computed in each TASK of the run. If more than one measure is required for the same set of data, a separate TASK NAME is necessary.

Sixteen measures of agreement between two DICHOTOMOUS variables (a 2 x 2 table with cell values a,b,c,d ) are included in WOMBATS. Missing data are allowed in all these measures, identified as follows:

MEASURE D1

Type Similarity measure
Range low = 0, high = 1
Name Jaccard's coefficient

Description Represents the probability of a pair of objects exhibiting both of a pair of attributes when only those objects exhibiting one or other are considered. It is possible that a division by zero may occur in the calculation of this measure.

MEASURE D2

Type Similarity measure
Range low = 0, high = 1
Name Russell and Rao's measure

Description: Represents the probability of a pair of objects in a pre-selected set exhibiting both of a pair of attributes.

MEASURE D3

Type Similarity measure
Range low = 0, high = 1
Name Sokal's measure

Description Represents the probability of a matching of two attributes.

MEASURE D4

Type Similarity measure
Range low = 0, high = 1
Name Dice's measure

Description Gives the positive matches 'a' twice as much importance as anything else. Excludes entirely the mismatches. It is thus possible that a division by zero may occur in the calculation of this measure.

MEASURE D5

Type Similarity measure
Range low = 0, high = 1
Name no name

Description Includes 'd' in both numerator and denominator. The matches (a and d) are given twice as much weight as the mismatches.

MEASURE D6

Type Similarity measure
Range low = 0, high = 1
Name no name

Description Excludes 'd' entirely. The matches (b and c) are accorded twice as much weight as the matches. It is possible that a division by zero may occur in the calculation of this measure.

MEASURE D7

Type Similarity measure
Range low = 0, high = 1
Name Rogers and Tanimoto's measure

Description Includes 'd' in numerator and denominator. The mismatches (b and c) are accorded twice as much weight as the matches.

MEASURE D8

Type Similarity measure
Range low = 0, high = a + b + c + c + d - 1
Name Kulczynski's measure

Description Excludes 'd' entirely. This measure is the simple ratio of the positive matches (a) to the mismatches (cf. D9). it is possible that a division by zero could occur in the calculation of this measure and an undefined statistic occur. The maximum value otherwise is as stated.

MEASURE D9

Type Similarity measure (Sokal & Sneath)
Range low = 0, high = a + b + c + d - 1
Name no name

Description This measure is the simple ratio of all matches (positive and negative) to the mismatches (cf D8). The statistic may be undefined, due to a zero divisor. The maximum finite value is as stated.

MEASURE D10

Type Similarity measure
Range low = 0, high = 1
Name Kulczynski's measure

Description Excludes 'd' entirely. This measure is a weighted average of the matches to one or other of the mismatches. This statistic may be undefined.

MEASURE D11

Type Similarity measure
Range low = 0, high = 1
Name no name

Description Includes 'd' in numerator and denominator. This is the analogue of D10 with mismatches included.

MEASURE D12

Type Similarity measure
Range low = 0, high = 1
Name Ochiai's measure

Description Excludes 'd' from numerator. It uses the geometric mean of the marginals as a denominator. This statistic may have a zero divisor.

MEASURE D13

Type Similarity measure
Range low = 0, high = 1
Name no name

Description Includes 'd' in numerator and denominator. It uses the geometric mean of the marginals as a denominator and will return a value of 0 iff either a or d is empty.

MEASURE D14

Type Similarity measure
Range low = -1, high = +1
Name Hamann's coefficient

Description Simply the difference between the matches and the mismatches as a proportion of the total number of entries. A value of 0 indicates an equal number of matches to mismatches. Some thought should be given to the interpretation of any negative coefficients before scaling the results.

MEASURE D15

Type Similarity measure
Range low = -1, high = +1
Name Yule's Q

Description This is the original measure of dichotomous agreement, designed to be analogous to the product-moment correlation. A value of 0 indicates statistical independence. Some thought should be given to the interpretation of any negative coefficients before scaling the results. This statistic may be undefined.

MEASURE D16

Type Similarity measure
Range low = -1, high = +1
Name Pearson's Phi

Description A value of 0 indicates statistical independence. Some thought should be given to the interpretation of any negative coefficients before scaling the results. The statistic may be undefined if any one cell is empty.



Five measures are available for the measurement of agreement between NOMINAL variables. Four of these are based on the chi-square statistic. The other is the Index of Dissimilarity.

MEASURE CHISQUARE

Type Similarity measure
Range low = 0, high = N x min(r,c)
Name Chi-square

Comment A value of 0 indicates statistical independence. The maximum value is dependent on the value of N.

MEASURE PHI

Type similarity measure
Range low = 0, high = (min(r,c)-1)
Name Phi

Comment The phi coefficient is chi-square normed to be independent of N. Reaches a maximum for 2 x 2 tables in which case it reduces to the product-moment correlation. It may, however, exceed 1 when the minimum of r and c is greater than 2.

MEASURE CRAMER

Type similarity measure
Range low = 0, high = 1
Name Cramer's V

Comment Cramer's coefficient is chi-square normed to be independent of N and of the number of r and c. Reaches a maximum for non-square tables.

MEASURE PEARSON

Type similarity measure
Range low = 0, high = 1
Name Pearson's Contingency coefficient C

Comment Pearson's coefficient is chi-square normed to be independent of N, originally developed as a measure for contingency tables. Cannot reach its maximum of 1 for non-square tables.

MEASURE ID

Type dissimilarity
Range low = 0, high = 100
Name Index of dissimilarity

Comment The index of dissimilarity is simply the proportion of cases in the off-diagonal cells and may be thought of as the proportion of changes needed to change the distribution of one nominal variable into the other. The index does not require equal numbers of values in the variables.



There are three measures forORDINAL variables in WOMBATS:

MEASURE GAMMA

Type similarity measure
Range low = -1, high = +1
Name Goodman and Kruskal's gamma

Comment  Measures the weak monotonic agreement between the variables, taking the ratio of the difference between concordant and discordant pairs to their sum. It thus ignores the ties completely. For this reason it is possible that the value be undefined (i.e. there may be no cases). If there are no ties then the index reduces to Yule's Q (D15). Some thought should be given to the interpretation of the negative values before the results are scaled.

MEASURE TAUB

Type similarity measure
Range low = -1, high = +1
Name Kendall's tau-b (tb)

Comment  Measures strong monotonic agreement in the variables, relating the difference between concordant and discordant pairs and the geometric mean of the quantities arrived at by adding in the ties to the denominator. This should be used only for square tables.

MEASURE SOMERS' D

Type similarity measure
Range low = -1, high = +1
Name Somers' D 

Comment  Measures strong monotonic agreement between the variables, being the difference between concordant and discordant pairs divided by their sum plus the number of ties.   


The INTERVAL/RATIO level measures currently available in WOMBATS are covariance, the product-moment correlation, Euclidean distance and Mahalanobis' D:

MEASURE COVARIANCE

Type similarity
Range low = 0, high = highest variance
Comment The interpretation given to the negative values should be carefully thought out before scaling.

MEASURE CORRELATION

Type similarity
Range low = -1, high = 1
Comment The negative values may need to be given some thought before the results of this calculation are scaled.

MEASURE DISTANCE

Type dissimilarity
Range low = 0, high = maximum variance in the variables
Comment  This calcluates the Euclidean distance between variables. If the ranges of the variables involved are markedly different, then some attempt at rescaling (i.e. normalisation) should be made so that differences in a highly valued variable do not swamp out differences in one of humbler dimensions.

MEASURE MAHALANOBIS DISTANCE

Type dissimilarity
Range low = 0, high = maximum variance in the normalised variables
Comment  This is advisable instead of the Euclidean distance for data liable to exhibit high multicollinearity. It effectively rescales the components so that those with high variability receive less weight than those with low variability. Where variables are uncorrelated, it is equivalent to the Euclidean distance for the normalised data. 

For MIXED level variables, and for comparing up to 200 cases only, it is possible to calculate Gower's general similarity coefficient 

MEASURE  GOWER

Type similarity
Range low = 0, high = 1.0
Comment  This is applicable as a general measure of similarity between cases, where the variables are of mixed levels. This measure also requires the statement GOWER WEIGHTS, followed on the next and subsequent lines of input by a series of (positive) weight values to be applied to the variables in calculating similarity values. The value 1.0 for all variables treats all variables as of equal weight. 

PROGRAM LIMITS

Maximum no. of subjects = 10000 (with the exception of Gower's general similarity coefficient, for up to 200 subjects only)
Maximum no. of stimuli = 200
Maximum no. of nominal values per stimulus = 30

See also

  • The NewMDSX commands in full