Skip to content Skip to navigation

PalaeoMath: Part 8 - Minding your Rs and Qs II

Article from: Newsletter No. 62
Written by:
PDF: No article PDF

8. Minding your Rs and Qs II

Written by Norm MacLeod - The Natural History Museum, London, UK (email: n.macleod@nhm.ac.uk). This article first appeared in the Nº 62 edition of Palaeontology Newsletter.

Part 2 - Correspondence Analysis

Last time we took a look at how we might want to approach the quantitative analysis of measured ‘objects’ (e.g., specimens, localities, samples in a section or core) as opposed to variables, and drew a useful distinction between R-mode and Q-mode analyses. No doubt you noticed over the past several essays that, despite our geometric approach to the analysis of any data matrix, we’re really interested in both quantities. You may have even asked yourself, “isn’t there anything that does both?”. There is, and we’ll spend this essay discussing one of the most popular ways to do so. Along the way we’ll compare and contrast this new method to some of our old friends and (hopefully) gain a bit deeper insight into what multivariate analysis is all about.

Correspondence analysis (CA) grew from the ground prepared by Pearson’s (1901) early work on what came to be known as principal components analysis (PCA) and Spearman’s (1904) work on the original factor analysis (FA) model. Both these approaches were concerned with the decomposition of similarity matrices into components or factors that represented a more complex, underlying structure. Their success implied that it might be possible to do the same thing with any table of data. By ‘any table of data’ I don’t mean a data table of any size or shape. That situation can be handled by PCA and FA as you already know them. Rather, this statement refers more to different types of data.

Since CA derives much of its power from being able to handle different types of data, a brief digression is called for here. We touched on this last time in our discussion of Q-mode similarity coefficients. There, we said different similarity coefficients were needed because a variety of data can be collected from objects. Now it’s time to come to grips with this topic in a more systematic manner.

Quantitative data come in four basic types: nominal, ordinal, interval, and ratio. The difference between these is the manner in which they incorporate the concept of scale.

  • Nominal data are simply number names for different, mutually exclusive groups of objects (e.g., 1 = dog, 2 = cat, 3 = horse). These data contain no scale information and are most frequently represented as contingency tables.
  • Ordinal data represent observations that can be ranked with reference to some external scale (e.g., 0 = small, 1 = medium, 2 = large). These data embody information about the rank order of the categories, but nothing else.
  • Interval data have a true scale, in the sense that the magnitudes of the numbers express important information. The classic example of an interval-scale variable is temperature expressed as °F or °C. In both those scales the difference between 2.4° and 1.1° is the same as the difference between 3.7° and 2.4°, but the zero point of both scales is set arbitrarily.
  • Ratio data have equivalent steps in magnitude and true zero points. Lengths are typical examples of ratio-scale variables.

The regression and component-factor analysis methods we’ve discussed previously have, for the most part, assumed interval or ratio-scale data. Of course, our trilobite data matrix is composed entirely of ratio-scale variables. Correspondence analysis was originally developed to provide an eigenanalysis-based means for analyzing nominal and ordinal scale variables, though it can, just as easily, analyze interval or ratio-scale variables and/or data matrices that mix different variable types. Not only that, it scales these data such that relations between objects and variables can be graphed on the same ordination plot and used to refine the interpretation of those data. All in all it’s a neat trick, and there’s little wonder you are seeing more correspondence analyses appearing in the palaeontological literature.

We’ll begin our presentation with a classical CA of ordinal-scale data.

Table 1. Trilobite frequency data (X matrix)
GenusParalic ShaleShoal Lmstn.Upper Lmstn.Mid. Lmstn.Phant. Lmstn.Org. Siltstn.Black ShaleRow Total
Acaste8531045136
Balizoma6651023133
Calymene8771322140
Ceraurus1011101011447
Cheirurus1091141319268
Cybantyx9319810343
Cybeloides541769335
Dalmanites641757232
Deiphon9731245141
Ormathops95110810245
Phacopidina532634225
Phacops9731256143
Placoparia662857236
Pricyclopyge310389832
Ptychoparia109214913259
Rhenops611655327
Sphaerexochus722845230
Toxochasmops7541033133
Trimerus222322417
Zacanthoides44151014543
Column Total139914317711614950765

 

Here we are looking at a hypothetical distribution of trilobite genera among different environmental facies representing a peritidal-bathyal transect. This is a typical contingency table. The numbers are occurrence frequencies of the different genera among the environments. Our problem is to infer the character of faunal similarity relations among environments (R-mode analysis of the matrix columns) and the character of environmental-preference similarities among the genera (Q-mode analyses of the matrix rows), simultaneously. Since there are seven variables and twenty objects, a premium will also be placed on dimensionality reduction so we can summarize the greatest amount of information in the lowest number of composite variables. In the end we want a single plot, or set of plots, that will tell us everything we need to know about this system of variables and objects.

Our first problem is the fact that the sums of the matrix rows and the columns are numbers of characteristically different magnitudes. This is the typical situation. The methods we’ve discussed previously (PCA, FA, PCoord) finesse this issue because they focus on analysing only one type of similarity relation, either that among variables (PCA, R-mode FA) or among objects (Q-mode FA, PCoord). Correspondence analysis considers the matrix from both points of view. Accordingly, this discrepancy must be corrected. Otherwise, the scale of the resulting composite variables will not be comparable.

In order to render the scales comparable we divide each element of the matrix by the grand sum of the matrix, which is the sum of the sums of rows (or the sum of the sums of columns).

(8.1)

\[b_{ij} = \frac{x_{ij}}{\displaystyle\sum_{i=1}^{n}\displaystyle\sum_{j=1}^{p}x_{ij}}\]

Dividing a frequency observation by the total number of times the observation has occurred provides an estimate of its proportion of the total occurrence pattern. This proportion is also an estimate of the probability of finding the observation at that locality, horizon, genus, etc. Accordingly, the matrix resulting from this scaling operation is an expression of the joint probabilities genera will be found in specific environments and specific environments will contain genera.

In Table 2, the row totals (\(B_{i\bullet}\)) represent the marginal probabilities of each genus occurring in any environment. Similarly the column totals (\(B_{\bullet j} \)) represent the marginal probabilities that specific environments will contain any trilobites. Naturally, both groups of marginal probabilities sum to 1.000.

Now comes the first complex part. If the trilobite faunas of two environments are similar we would expect to find similar patterns of variation in the proportion of trilobite genera in each column. If the environments differ in terms of their trilobite fauna, the joint probabilities should be different. This holds for the rows too. Thus, we should be able to come up with an index to express the similarities between rows and columns.

Table 2. Trilobite frequency data (B matrix of joint and marginal probabilities)
GenusParalic ShaleShoal Lmstn.Upper Lmstn.Mid. Lmstn.Phant. Lmstn.Org. Siltstn.Black Shale\(B_{i\bullet}\)
Acaste0.0100.0070.0040.0130.0050.0070.0010.047
Balizoma0.0080.0080.0070.0130.0030.0040.0010.043
Calymene0.0100.0090.0090.0170.0030.0030.0010.052
Ceraurus0.0130.0010.0010.0130.0130.0140.0050.061
Cheirurus0.0130.0120.0010.0180.0170.0250.0030.089
Cybantyx0.0120.0040.0010.0120.0100.0130.0040.056
Cybeloides0.0070.0050.0010.0090.0080.0120.0040.046
Dalmanites0.0080.0050.0010.0090.0070.0090.0030.042
Deiphon0.0120.0090.0040.0160.0050.0070.0010.054
Ormathops0.0120.0070.0010.0130.0100.0130.0030.059
Phacopidina0.0070.0040.0030.0080.0040.0050.0030.033
Phacops0.0120.0090.0040.0160.0070.0080.0010.056
Placoparia0.0080.0080.0030.0100.0070.0090.0030.047
Pricyclopyge0.0040.0010.0000.0040.0100.0120.0100.042
Ptychoparia0.0130.0120.0030.0180.0120.0170.0030.077
Rhenops0.0080.0010.0010.0080.0070.0070.0040.035
Sphaerexochus0.0090.0030.0030.0100.0050.0070.0030.039
Toxochasmops0.0090.0070.0050.0130.0040.0040.0010.043
Trimerus0.0030.0030.0030.0040.0030.0030.0050.022
Zacanthoides0.0050.0050.0010.0070.0130.0180.0070.056
\(B_{\bullet j} \)0.1820.1190.0560.2310.1520.1950.0651.000

 

Fortunately, there is such an index. The derivation is a tad complicated and I won’t go into it in detail. Good—though somewhat mathematical—discussions of this index can be found in the references listed at the end of this column. Effectively what the mathematics does is scale the transformed data (B) by the reciprocal of the square root of the marginal row and column probabilities (\(B_{r}^{-1/2}\) and \(B_{c}^{-1/2}\), respectively). In matrix notation the equation is as follows.

(8.2)

\[H = B_{r}^{-1/2} B B_{c}^{-1/2}\]

In order to make this calculation, the Br -1/2 and Bc -1/2 matrices need to be arranged so that the marginal probabilities are located in the matrix diagonals with all off-diagonal values set to 0.0 (see the PalaeoMath 101: R&Qs II worksheet for details of the calculations). This calculation estimates the conditional probabilities specific genera will be found in specific environments and vice versa. Another way of thinking about this calculation is that it’s scaling, or weighting, the column values by the reciprocal of the row sums (and so forming a weighted average) and scaling the row values by the reciprocal of the column sums. This, in turn, leads to the other name for CA, ‘reciprocal averaging’ or ‘reciprocal averaged PCA’. As a by-product, this weighted averaging also helps equalize the scales within the rows and between columns, but it does not enforce equal scaling (as does correlation-based PCA or R-mode FA).

The \(H\) matrix of conditional probabilities for our trilobite frequency analysis is listed below.

Table 3. Trilobite frequency data (H matrix of conditional probabilities)
GenusParalic ShaleShoal Lmstn.Upper Lmstn.Mid. Lmstn.Phant. Lmstn.Org. Siltstn.Black ShaleRow Total
Acaste0.1130.0870.0760.1250.0620.0680.0240.556
Balizoma0.0890.1090.1330.1310.0320.0430.0250.561
Calymene0.1070.1160.1690.1540.0290.0260.0220.624
Ceraurus0.1240.0150.0220.1100.1350.1310.0830.620
Cheirurus0.1030.1140.0180.1280.1460.1890.0340.733
Cybantyx0.1160.0480.0230.1030.1130.1250.0650.594
Cybeloides0.0720.0710.0260.0890.0940.1250.0720.548
Dalmanites0.0900.0740.0270.0930.0820.1010.0500.517
Deiphon0.1190.1150.0710.1410.0580.0640.0220.590
Ormathops0.1140.0780.0230.1120.1110.1220.0420.602
Phacopidina0.0850.0630.0610.0900.0560.0660.0570.477
Phacops0.1160.1120.0700.1380.0710.0750.0220.603
Placoparia0.0850.1050.0510.1000.0770.0960.0470.561
Pricyclopyge0.0450.0190.0000.0400.1310.1300.2000.565
Ptychoparia0.1100.1230.0400.1370.1090.1390.0370.694
Rhenops0.0980.0200.0290.0870.0890.0790.0820.484
Sphaerexochus0.1080.0380.0560.1100.0680.0750.0520.506
Toxochasmops0.1030.0910.1060.1310.0480.0430.0250.548
Trimerus0.0410.0510.0740.0550.0450.0400.1370.443
Zacanthoides0.0520.0640.0230.0570.1420.1750.1080.621
Column Total1.8911.5141.0982.1301.7001.9101.20311.446

 

Now most of the hard work is done. As before, genera (rows) and/or environments (columns) that are similar to one another should exhibit similar patterns of values. Drawing on analogy with previous multivariate methods, we now need to summarize this between-rows and betweencolumns similarity with a numerical index. We can do this in a couple of different ways. The method most similar to PCA, FA, and PCoord is to calculate the covariance between columns of the \(H\) matrix. Because of the probability calculations we have performed, this quantity (d) represents the product of two \(X^2\) values, and is sometimes referred to as a \(X^2\) ‘distance’. Once you have obtained the \(H\) matrix the easiest way to calculate the matrix of \(X^2\) distances (\(D\)) is as follows.

(8.3)

\[D = {H}' H\]

In this equation \({H}' \) is the transpose of the \(H\) matrix and the matrix multiplication is equivalent to calculating the sum of squares and cross products between all pairs of columns. Since we had previously designated the columns of \(X\) to contain our variables, \(D\) is analogous to an R-mode distance matrix. The \(D\) matrix for our trilobite frequency data set is listed below.

Table 4. R-Mode \(X^2\)  distance matrix (D )
 Paralic ShaleShoal Lmstn.Upper Lmstn.Mid. Lmstn.Phant. Lmstn.Org. Siltstn.Black Shale
Paralic Shale0.1910.1480.1070.2140.1590.1790.097
Shoal Lmstn.0.1480.1380.0980.1760.1200.1400.069
Upper Lmstn.0.1070.0980.0940.1310.0680.0750.048
Mid. Lmstn.0.2140.1760.1310.2450.1730.1960.104
Phant. Lmstn.0.1590.1200.0680.1730.1700.1930.115
Organic. Siltstn.0.1790.1400.0750.1960.1930.2220.126
Black Shale0.0970.0690.0480.1040.1150.1260.111

 

Note this matrix is symmetric about its diagonal or trace. An eigenanalysis of the \(D\) matrix yields the following eigenvalues, which express the character of between-facies biotic similarity relations

Table 5. Eigenvalues of \(D\)  matrix (\(W\)  matrix)
 Eigenvalue% VariationCum. % Variation
11.00085.41285.412
20.11710.02295.435
30.0363.09098.524
40.0141.15999.683
50.0030.27099.953
60.0000.02699.979
70.0000.021100.000

 

In CA the first eigenvalue is usually 1.000. Obviously the first three eigenvalues represent the overwhelming majority of the variation. Arguably the first two do, but you’ll see why I’m going to interpret three in a moment. Eigenvector 7 is going to be 0.0 (save for rounding error) because we’ve scaled the data and so removed a component of variation. Eigenvector 6 exists (0.0003) but is too small to show up in a report to three decimal places.

These eigenvalues correspond to the eigenvectors of the \(D\) matrix, which are what we will use to produce an ordination plot of the similarity relations between environments. To do so these values are first scaled by the square roots of the eigenvalues in a manner identical to the one we used in PCoord Analysis (see previous PalaeoMath 101 column: Minding your Rs and Qs). Then, in order to make it possible to plot both the R-mode and Q-mode loadings in the same coordinate space, these scaled values are scaled again by the square roots of the Q-mode conditional probabilities (= column sums) of the  \(B\) matrix, as follows.

(8.4)

\[A_{r-scaled} = B_{c}^{1/2}A_r\]

The results of these calculations are shown below.

Table 6. Eigenvectors and scaled, R-mode correspondence axis loadings (Ar) of the D matrix
EnvironmentsR-Mode Eigenvectors
 123
Paralic Shale0.4260.1660.102
Shoal Lmstn.0.3450.3100.127
Upper Lmstn.0.2370.491-0.479
Mid. Lmstn.0.4810.3280.062
Phantom Lmstn.0.389-0.3810.139
Organic Siltstn.0.441-0.4290.324
Black Shale0.256-0.447-0.784
 Singular Value-Scaled Eigenvectors
 123
Paralic Shale0.4260.0570.019
Shoal Lmstn.0.3450.1060.024
Upper Lmstn.0.2370.168-0.091
Mid. Lmstn.0.4810.1120.012
Phantom Lmstn.0.389-0.1300.026
Organic Siltstn.0.441-0.1470.062
Black Shale0.256-0.153-0.149
 Cond. Probability (Q)-Scaled Eigenvectors
 123
Paralic Shale0.1820.0240.008
Shoal Lmstn.0.1190.0370.008
Upper Lmstn.0.0560.040-0.022
Mid. Lmstn.0.2310.0540.006
Phantom Lmstn.0.152-0.0510.010
Organic Siltstn.0.195-0.0650.027
Black Shale0.065-0.039-0.038

 

The resulting plots of the similarity structure among environments for the first three correspondence axes (Fig. 1) present the data at the bottom of Table 6 graphically.

Figure 1. Scaled R-mode loadings on the first three trilobite frequency data correspondence axes.

These ordination plots are deceptively simple. As always, care needs to be taken with their interpretation. The clearest substructure among environments is the separation of shallower facies (Paralic Shale, Shoal Limestone, Middle Limestone, Upper Limestone) from the deeperwater facies (Phantom Limestone, Organic Siltstone, Black Shale). However, that separation occurs along the second correspondence axis, which represents only 10 percent of the overall variation (Table 5). The environment (facies) points are spread out along the first CA axis with no obvious clustering. What aspect of the original data matrix \(X\) is this pattern expressing?

When in doubt, return to the basis matrix (in this case the R-mode \(X^2\) ‘distance’ matrix, Table 4) and the original data (Table 1). Everything there is to see in the data should be visible there, especially if your eye has been clued in by seeing the ordination of R-mode loadings. The Black Shale and Upper Limestone facies plot low along CA-Axis 1. Inspection of Table 4 shows the ‘biotic distance’ between these facies is the lowest in the table. They should be close to one another along the predominant eigenvector. The pattern at one end of CA-Axis 1 should then be compared with that at the other end: the Organic Siltstone and Middle Limestone facies. Both are located at a relatively large distance from the Black Shale and Upper Limestone facies, and this is borne out by their distances from these facies in Table 4. This gives us confidence that the plot is a faithful picture of relations within the R-mode \(X^2\) ‘distance’ matrix and encourages us to look further, to the original data. Take a look at Table 1 now and see whether you can find the pattern responsible for the distribution of facies along CA-Axis 1.

Did you find it? It’s a bit subtle but it’s there. If you didn’t see it don’t give up (and don’t be lazy). Go back and look for it. No pain no gain and all that. Here’s a hint: Think about the bottom line of any matrix-based data analysis. Of course, I’m stalling a bit here so the answer is buried in the text. If you’re going to undertake multivariate analyses you need to develop a skill at looking for and finding the geometric patterns you see in the ordination plots within tables of numbers. But if you took that ‘bottom line’ bit to heart you now know that the distribution of points along CA-Axis 1 represents a measure of the relative frequency of trilobite occurrence in each of the facies. In retrospect, this is perfectly understandable. Those ‘bottom line’ numbers summarize a dominant pattern among facies created by considering the data across all trilobite genera.

Many CA practitioners will tell you the best practice is to interpret only the correspondence axes that have eigenvalues less than 1.000 and term the first axis a ‘trivial’ or ‘nuisance’ factor (see Manley, 1994). After all, we already extracted that information during our scaling operations (it’s the Bc matrix). This sort of CA-Axis 1 result doesn’t show up in every CA analysis, but it’s not uncommon. Some describe it as a trivial or ‘nuisance’ factor since it really doesn’t tell us much about the relation between genera and facies we didn’t already know without having dragged all the complex mathematical machinery of a full-blown correspondence analysis out of the closet. Still, ‘trivial’ and ‘nuisance’ are relative terms. It all depends on what you’re looking for.

We now have interpretations of correspondence axes 1 and 2. What about CA-Axis 3? As with CA-Axis 1, we see the Black Shale and Upper Limestone facies form a group on the low end of that axis while the other four facies cluster together at the high end. Is the interpretation the same? Not quite. The ordination of facies along CA-Axis 1 is not clustered to the same degree and represents a pattern strictly controlled by the trilobite relative occurrence frequency. Not so with CA-Axis 3. Inspection of Tables 1 and 4 doesn’t reveal an obvious pattern and, owing to the small amount of variance expressed on this axis, we really wouldn’t expect them to. But there must be a reason for it.

Here is where the power of CA reveals itself. What we need is an explanation of the ordination pattern of facies along CA-Axis 3 in terms of the pattern of trilobite occurrences. In effect, we need to relate patterns of between-columns variation to patterns of between-rows variation. That’s what CA does, and does more effectively than any of the methods we’ve discussed thus far. So how do we perform the Q-mode analysis in the context of CA? Best to go through the same set of calculations we did before, but this time focus on comparisons between rows rather than columns.

A couple of convenient mathematical theorems make this Q-mode analysis a snap. The first and most important of these is the Ekhart-Young Theorem which we’ve already met informally in the guise of Gower’s (1966) proof that a PCoord analysis of a Q-mode squared Euclidean distance matrix is an exact mirror, or ‘dual’, of the covariance-based R-Mode PCA for the same data (see previous PalaeoMath 101 column: Minding your Rs and Qs). Ekhart and Young theorized that any real matrix \(X\) is equivalent to the product of three matrices, \(V\), \(W\), and \(U\) such that …

(8.5)

\[X = VW{U}'\]

Matrix \(V\) is the set of Q-mode eigenvectors. Matrix \({U}'\) is the transpose of the R-mode eigenvectors. Matrix \(W\) is the diagonalized matrix of ‘singular values’ that are equivalent to the square roots of the eigenvalues. Note here that all those times I’ve been asking you to multiply or divide matrix values by the square root of the eigenvalues in the previous PCoord analysis and above, I’ve really been asking you to make use of the data matrix’s singular values. Goulub and Reinsch (1971) devised a method for finding these matrices directly and their method (with improvements) is now called singular value decomposition or SVD. The important result of the Ekhart and Young Theorem we need to make use of now is that, for real matrices such as the one we’ve transformed our trilobite frequency data into, the R-mode and Q-mode eigenvalues are the same (because the singular values of these matrices are the same). That means we don’t have to recalculate the Q-mode eigenanalysis. We’ve already determined the eigenvalues of both analyses (see Table 5).

Since we already have the \(W\) matrix we can easily use that to calculate the Q-mode CA loadings we need using the following equations.

(8.6)

\[A_q = HA_rW^{-1}\]

(8.7)

\[A_{q-scaled} = B_{r}^{1/2}A_q\]

This is equivalent to scaling each of the Q-mode eigenvector loadings by the corresponding singular value and the scaling by the R-mode conditional probabilities (= square roots of the row sums of the \(B\)-matrix). Results of this calculation for the first three correspondence axes (= eigenvectors) are shown below.

Table 7. Q-mode correspondence axis loadings (Aq-matrix)

GenusQ-Mode Correspondence Axis Loadings
 123
Acaste0.0470.0130.001
Balizoma0.0430.024-0.007
Calymene0.0520.035-0.012
Ceraurus0.061-0.0180.002
Cheirurus0.089-0.0140.023
Cybantyx0.056-0.0110.004
Cybeloides0.046-0.0100.001
Dalmanites0.042-0.0030.003
Deiphon0.0540.0180.003
Ormathops0.059-0.0050.010
Phacopidina0.0330.003-0.004
Phacops0.0560.0150.004
Placoparia0.0470.0030.002
Pricyclopyge0.042-0.035-0.018
Ptychoparia0.0770.0010.013
Rhenops0.035-0.007-0.004
Sphaerexochus0.0390.002-0.002
Toxochasmops0.0430.019-0.004
Trimerus0.022-0.003-0.016
Zacanthoides0.056-0.028-0.001

 

Plotting axes 2 and 3 in the same coordinate system as the R-mode loadings gives us the following diagram.

Figure 2. Q-mode loadings on the correspondence axes 2 and 3.

Comparison of figures 1 and 2 now reveals why the Black Shale and Upper Limestone facies are being pulled down along the third axis. Pricyclopyge exhibits a unique deep-water distribution and its relatively high abundance in the Black Shale facies is differentially affecting the placement of that facies. Similarly, the Upper limestone facies is being pulled down by the shallow-water occurrence pattern of Calymene and, to a lesser extent, Balizoma. Both these patterns are further reinforced by the unusual distribution of Trimerus. Note that the combined plot of the scaled R-mode and Q-mode loadings is also conveniently centred on the centroids of both datasets.

The ability of CA to handle simultaneous R-mode and Q-mode analyses in a manner that really improves the interpretation of data matrices and ordination plots is a big plus in the technique’s favour. So is its similarity to PCA and FA, both of which have proven their usefulness in many different data analysis contexts. But there is another advantage possessed by CA that needs brief discussion: flexibility.

Recall that PCA is used to conduct R-mode analyses of data matrices composed of real numbers with no missing values. R-mode factor analysis can be used in the same manner, but is built on a different mathematical model than PCA. Principal Coordinates–Q-mode factor analysis can handle a variety of different data types, but only if you select an appropriate similarity/ dissimilarity index. Correspondence analysis combines the best parts of all these methods and, as a bonus, can be used to analyse almost any type of data. In the example above we used it to analyse occurrence frequencies. Because of the scaling calculations inherent in CA, this would be possible even if a large number of cells in the X matrix were occupied by zeros (= no observations or missing data). But while CA was originally developed to handle the case of contingency-table data, it is not restricted to analyzing nominal or ordinal data. It can handle interval and ratio data just as easily. To demonstrate this, here’s a CA ordination of the original trilobite body and glabellar length data we’ve used throughout this series.

Figure 3. R-mode (hollow circles) and Q-mode (solid discs) correspondence axis scores for axes 1 and 2 (upper) and 2 and 3 (lower) of the trilobite body and glabellar size data

Comparing these plots to Figure 7 of the PCA column (Newsletter vol. 59) shows that, with minor changes in relative point spacing (brought about as a result of the scaling calculations), the ordinations are very similar, with all outlying points in comparable relative positions. In that previous column recall we had to jump back and forth between the table of eigenvectors (shown just above Table 5 of the PCA column) and the plots in order to develop a sense of the geometric meaning of the PCA subspaces. We still need the eigenvectors to interpret the CA subspaces. But it’s much easier and intuitive to develop these interpretations when a representation of the eigenvector ‘positions’ can be graphed on the same diagram.

The method I’ve illustrated above scales the results in terms of the B matrix. One can easily preserve the scale of the original \(X\) matrix by conducting all weighting operations using the original \(X\) matrix along with the row and column sums from that matrix. The singular values, eigenvalues, and eigenvectors will be the same, but the magnitudes of the \(H\) matrix, and \(D\) matrix, along with the various row matrices, column matrices, and eigenvector loading (= score) matrices, will be larger. The relative placement of points within the ordination diagrams should remain unchanged.

I appreciate this has been the most complex mathematical presentation of the series (thus far) and both thank and congratulate you if you’ve stuck with me to this point. As you can probably imagine there’s much more to the mathematics of CA than I’ve presented here. What I’ve tried to do is give you enough maths to understand what CA is about, how to interpret CA results, how CA relates to the methods we’ve discussed previously. The PalaeoMath 101 spreadsheet for this column contains a complete worked example of the trilobite frequency analysis. Look to that to clarify details of the calculations.

Correspondence analysis is a newer data analysis method and is being included in more highend multivariate statistical analysis packages these days (e.g., SPSS, SAS, Multitab, Statistica, GenStat). Not everyone has access to such packages, however, and it is expensive to acquire personal copies. Among the less expensive commercial software packages, XLStat (http://www.xlstat.com/en/home/) has a good implementation, as does CANOCO (http://www.pri.wur.nl/uk/products/canoco/). With a bit of practice and access to public-domain Excel routines like PopTools (http://www.cse.csiro.au/poptools/), though, you can easily make up an Excel spreadsheet that will perform simple CA analyses.

Sometimes you will see reference to a variant of CA termed ‘detrended correspondence analysis’ or DCA. This is a standard CA to which a post-processing step has been added in order to remove the possibility of generating ordinations that look like a parabola or a wave (also called the ‘arch effect’, horseshoe effect’, ‘Guttman effect’). Such results are usually obtained when analysing datasets in which there is strong gradient. Those who perform Q-mode analyses of any sort on a routine basis run into such results sooner or later. They are especially common in ecological analyses because one often runs into gradient-like structures in ecological data. Despite looking rather strange when you first encounter one, there’s nothing wrong with an analysis that produces such a result. All it really means is that an important aspect of your data’s variation is organized or patterned in a non-linear manner.

‘Detrending’ such plots sounds like a grand thing to do. Usually it’s not. The most popular detrending algorithm arbitrarily subdivides the range of the data along one CA axis (usually CAAxis 1) into a set of n bins and centres the data falling into each bin about 0.0. This removes the trend and ‘linearizes’ the data. But other than responding to an aesthetic desire to remove nonlinearities from the CA scores, there seems little justification either for applying this method of post hoc data ‘correction’ or preferring it to other conceivable methods (e.g., applying a curvilinear regression or spline function and representing the data along the CA-Axis 1 as residuals from the regression or spline axis). Indeed, such ad hoc manipulations can destroy aspects of the data’s pattern that are important for its correct interpretation (see Watenberg et al. 1987 for a critical review). The safest course of action is always to try to stay as close to the raw data as possible.

My advice is not to correct any analytic result for what amounts to aesthetic reasons. If you find a horseshoe in your results it’s telling you something about your data (presence of a gradient or some other source of non-linear signal). Use that information to understand how the pattern relates to your hypothesis test. Good discussions of, and further references to, the horseshoe effect can be found in Greenacre (1989), Reyment (1991) and Reyment and Joreskog (1993). Early and ecological practitioners of CA—and those influenced by them—often advocate detrending (e.g., Pielou 1984, Hammer and Harper 2006). More recent commentators, especially those concerned with applications in the geological sciences, have been decidedly more sceptical as to the technique’s value. Certainly the application of such ‘corrections’ to data in which there is no non-linear trend (see examples above) is entirely unnecessary and indefensible.

The methods on which modern approaches to CA are built are among the most powerful in the entire field of linear algebra. We’ll encounter them again when we discuss how patterns of variation in one set of variables can be related to those on other sets of variables. Looking back though, CA is perhaps best understood as a generalization of PCA. Anything you can do with PCA you can do with CA. The main difference between the methods is that PCA allows you to choose whether to retain the original scale of the data (unstandardized, covariance-based PCA) whereas CA, of course, requires the data be normalized to ‘correct for’ scaling differences. The ability of CA to handle more different types of data is a product of its attention to scaling issues, and a clear advantage in terms of the number of different types of data analysis situations it can cope with. It is possible to scale PCA and FA results to portray both eigenvector loadings and the scores of objects on the same PCA/FA axis, using methods developed in the context of CA. I can present those variations on PCA/FA analyses if there’s any interest.

Finally, a word about where CA leads us. There is a school of thought in multivariate data analysis that focuses on dimensionality reduction of complex numerical data and the production of graphs that express ‘relations’ between ‘entities’ in a low-dimensional space. This is the field of multidimensional scaling. Most often the purpose of these methods is simply to produce a picture of the data with little or no interest in the parameters of the data that control the graphical representation. The methods of PCA, PCoord, and CA are members of this family of data analysis techniques, and can be collectively referred to as classical approaches to the multidimensional scaling problem. There are other approaches (e.g., non-metric multidimensional scaling) that are even more generalized than CA. What I’ve tried to do here is show that the classical scaling methods can not only be used to produce the picture of your data. They can also help you understand it. That’s the ultimate plus in my book.

References

DAVIS, J. C. 2002. Statistics and data analysis in geology (third edition). John Wiley and Sons, New York. 638 pp.

GOLUB, G. H. and REINSCH, C. 1971. Singular value decomposition and least squares solutions, 134–151. In WILKINSON, J. H. and REINSCH, C., eds). Linear algebra: computer methods for mathematical computation, v. 2. Springer-Verlag, Berlin.

GOWER, J. C. 1966. Some distance properties of latent roots and vectors used in multivariate analysis. Biometrika, 53, 588–589.

GREENACRE, M. J. 1984. Theory and applications of correspondence analysis. Academic Press, London. 364 pp.

HAMMER, Ø. and HARPER, D. 2006 (actually 2005). Paleontological data analysis. Blackwell Publishing, Oxford, UK. 351 pp.

JACKSON, J. E. 1991. A user’s guide to principal components. John Wiley & Sons, New York. 592 pp.

MANLEY, B. F. J. 1994. Multivariate statistical methods: a primer. Chapman & Hall, Bury St. Edmonds, Suffolk. 215 pp.

PEARSON, K. 1901. On lines and planes of closest fit to a system of points in space. Philosophical Magazine, 2, 557–572.

PIELOU, E. C. 1984. The interpretation of ecological data. John Wiley & Sons, New York. 263 pp. REYMENT, R. A. 1991. Multidimensional paleobiology. Pergamon Press, Oxford. 539 pp.

REYMENT, R. A. and JÖRESKOG, K. G. 1993. Applied factor analysis in the natural sciences. Cambridge University Press, Cambridge. 371 pp. 

SPEARMAN, C. 1904. ‘General intelligence’, objectively determined and measured. American Journal of Psychology, 15, 201–293.

WARTENBERG, D., FERSON, S., and ROHLF, F. J. 1987. Putting things in order: a critique of detrended correspondence analysis. American Naturalist, 129, 434–448.

Author Information

Norm MacLeod - The Natural History Museum, London, UK (email: n.macleod@nhm.ac.uk)

PalAss Go! URL: http://go.palass.org/641 | Twitter: Share on Twitter | Facebook: Share on Facebook | Google+: Share on Google+