18. Form & Shape Models
Written by Norm MacLeod - The Natural History Museum, London, UK (email: email@example.com). This article first appeared in the Nº 72 edition of Palaeontology Newletter.
Recall that PCA is really a form of multivariate linear regression through a space defined by the original measurements (= variables) taken or observations made on the sample. The number of regression lines produced by PCA is equal to the number of variables or number of specimens present in the dataset, whichever is smaller. These regression lines are aligned with the major dimensions of covariation among the variables with the constraint that they are (normally) oriented at right angles to each other. As such, PCA lines can be used to construct an alternative data display space within which similarity relations among the objects comprising the sample can be visualized. These visualizations can then be used to test hypotheses about the nature of the observed variation. In effect, this means that the PCA space is a simple rotation and shearing of the space formed by the original variable axes (Fig. 1). If the covariance matrix is used as the basis for the PCA the original scaling relations among the variables is preserved (Fig. 1B). If the correlation matrix is used the original variables scaling relations are standardized so that each variable contributes an equal amount of variance to the result. This means the ordination space of a correlation-based PCA has, in addition to being rotated and sheared with respect to the original variable space, has also been compressed or expanded in certain dimensions (Fig. 1C).
Comparing the scatterplots of the PCA scores in figures 1B and 1C to the raw data in Figure 1A, it’s easy to see the regression-like nature of PCA. The equations of the PC axes relate the original variables to the new PC axes and are used to project the original data points into the PC ordination space. If you understand the PCA procedure you already know these same equations can be used to project any combination of values for the variables analyzed into the PCA space. But what isn’t as widely appreciated as it should be is that these same equations can also be used to solve the inverse problem of projecting coordinates from the PC ordination back into the space of the original variables.
At this point you’re probably saying, ‘OK, so you can use the PC axis equations to go both ways. I understand why I want to get my data into the PC ordination space. But I don’t quite see why I’d ever want to return to the space of the original variables. After all, the PC ordination space is a better space in which to represent and study relations between the objects in the sample, right?’ The answer to this question is, for the most part, yes; but there are some aspects of the variation problem that are more naturally and compellingly assessed in the space of the original variables. The most important of these aspects is the interpretation of the PC axes and the PC ordination space itself.
In order to illustrate the problem let’s take a close look at the PCA solution for the simple, trilobite glabellar dataset illustrated in Figure 1: two variables, both log10-transformed. For this discussion we’ll focus on the covariance-based result (Fig. 1B) as there’s no obvious reason why we would not wish to take differences in the scaling of the variables into consideration. By log-transforming the variables we’ve already put them into a form in which differences between the variables’ scales have been minimized in a way that still allows us to recover the original scalings any time we wish. The equations of these axes are, as follows.
In these expressions x1 refers to log10 glabellar length and x2 refers to log10 glabellar width.
The loading coefficients shown in equations 18.1 and 18.2 form the matrix we use to calculate the scores of the original variables in the new covariance-optimized PC space. A quick inspection of the ordination we achieved for these data (Fig. 1B) indicates that a variety of interesting sub-groupings appear to exist, at least for the individuals included in our trilobite dataset. Along the PC-1 axis (which represents over 95 percent of the form variance1 in our sample) three subgroups seem to be present. Acaste, Balizoma, Ceraurus, and Sphaerexochus appear to form a unified group at the low end of the axis, Trimerus appears to be an outlier at the high end, and the remaining genera form a complex group in between. Along the PC-2 axis, Dalmanites, Ptychoparia and Rhenops for a subgroup at the low end, Cheirurus, Deiphon, Phacops, and Toxochasmops form a group at the high end, and the remaining genera form another complex group in between. Taken together, it appears as though glabellar variation in our sample is organized into five broad categories or classes, as shown in Figure 2.
Whether these individuals are truly representative of their genera, and whether these groups would remain if more individuals were included in the sample, is doubtful. But that’s not the point I’m after with this example. Let’s simply accept these provisional geometric subdivisions for the sake of argument.
If all we want to do is get a quick and dirty answer to the question of whether glabellar form is distributed continuously and discontinuously in our sample we could conceivably stop here. The answer is clearly the latter. But that answer more-or less begs the further question ‘What does glabellar variation in the sample look like?’ I'd like to think any competent data analyst would be as interested in providing an answer to this further question as they are in answering the original variation-mode question. But when we try to interpret even this simple PCA space we run into problems.
In terms of the standard approach to PCA analysis, the only information we have about the character of variation in this PC space are equations 18.1 and 18.2. Even though there are only two numbers to keep track of per axis it’s surprisingly difficult to construct a comprehensive and accurate picture of what the glabella of these groupings looks like—much less confirming that the result is a reasonable summary of reality—just by staring at them. What we can say is that variation along PC-1 is strongly size controlled with a subtle component of relative elongation as one moves up the PC-1 axis. Along PC-2 the glabellar groups change from being relatively long and narrow (the Dalmanites, Ptychoparia and Rhenops group) to short and wide (the Cheirurus, Deiphon, Phacops, and Toxochasmops group). But note that the amount of form variation expressed by PC-2 is so small relative to that expressed by PC-1 its uncertain whether we would expect this pattern to be noticeable to the taxonomist’s eye just from the information provided by these numbers. What’s missing in the number-comparison approach, of course, is any good way of getting at the inherent geometry of the system. This missing bit isn’t just annoying. It severely constrains our ability to interpret the results of even this simple PCA analysis in a way that’s biologically meaningful, either to ourselves in the context of our investigations or to others in the context of communicating the hard-won results of our analysis.
At this point most morphometricians would launch into a discussion of geometric morphometrics and wax eloquent about the advantages of working with landmark coordinates. I’ve already done that over the last few columns and I hope you’ve come to appreciate the power of using the sorts of graphic representations of form and shape variation we’ve generated up to this point. But the fact is, none of the superposition tools or shape coordinate plots we’ve seen up to now help us much with the problem of interpreting the ordination space that results from a PCA analysis regardless of whether that analysis is performed on linear distances as I’ve done in the example above or on Procrustes superimposed shape coordinates. If anything, the problem gets worse for shape-coordinate datasets because the number of variables needed to represent distances between landmarks in shape coordinates is up to four times larger than the number of variables needed to quantify the same distances in a multivariate morphometric dataset. To keep things simple I’m going to stick with the glabellar distance data to develop the mathematical concepts we need to translate equations 18.1 and 18.2 into pictures we can inspect and compare, just as we’d inspect and compare pictures of organisms. Then I’ll apply these same concepts to a landmark dataset to show how this technique improves our ability to take advantage of the more geometry-rich information recorded by landmarks in a Procrustes PCA analysis.
The basic tool we need is a way of solving the inverse projection problem: taking coordinate values in the PCA space and projecting them back into the space defined by the original variables. It’s actually easier than you might suspect. Expressed in matrix notation the equation we use to calculate the PC scores (S) is:
X = SU (18.3)
where X is the original data matrix of distances or landmark coordinates (in our example the 20 objects by 2 variables [= glabellar measurements] matrix of raw data each of which has been log10-transformed) and U is the 2 x 2 matrix of eigenvector coefficients (see equations 18.1 and 18.2). In order to perform the back-transform all we need do is pre-multiply the matrix of PC scores (S) by the inverse of the eigenvector matrix (U-1).2
X = SU-1 (18.4)
Of course, in our trilobite glabella example once the back-transformation calculation has been made the original data will be expressed as log10 values of the original measurements. The original scale of the distances can be recovered by sequentially raising 10 to the power of corresponding value in the X matrix. If we perform this operation correctly for the matrix of PC scores we should end up with the values of our original data. I know, that’s not too interesting. However, the magic comes when we realize that we can use the same matrix arithmetic operation to calculate the hypothetical ‘raw data’ values for any coordinate position in the PCA space.
In other words, all the possible coordinate positions in the PC space correspond to hypothetical or theoretical objects in the sense that there is a complete, one-to-one mapping between the original variable space and the PC space. Our set of observed data points is simply a subset of an infinite mathematical universe of all geometrically possible objects occurring in the space that we happen to have found and measured. The U matrix is the door that allows us to travel from the original variable space into the PC space. Similarly, the U-1 matrix is the door that allows us to travel in the opposite direction. Most data analysts know how to use the U matrix door. But they’ve either forgotten, or were never taught, about the other door. Consequently, if there are any interesting coordinate locations in the PCA space we don’t have to simply stare at them, scratch our heads, and try to figure out what they might represent by looking at observed points that may—or may not—plot in the vicinity of those we’re interested in. We can take any point in the PCA space and create a geometric picture of the hypothetical object that exists at that location.
Let’s take an obvious example the accuracy of which can be checked independently. Each of the groups shown in Figure 2 has an average PC-1 and PC-2 score that can be plotted as a specific coordinate position in the PC space. We can calculate this set of group-averaged PC scores, back-transform these coordinate locations into the space of the original length and width variables, and then compare those estimated values to the group means calculated directly from the raw data. The Palaeo-Math 101-2 spreadsheet details all these calculations. There is, with the exception of rounding error, perfect agreement between the average values calculated from the raw data and those estimated from the back-transformed group-average coordinates in the PC space. But once we have these values we can also create a direct graphic representation of the form of the glabellas for each group by drawing an ellipse with the specified mean length and width dimensions. Figure 3 shows the reconstructed gross glabellar form based on these group mean glabellar lengths and widths.
Figure 3. Reconstructed group-mean trilobite glabellar forms based on length and width measurements. Symbol colour codes as in Figure 2. See text for group membership.
Now we can see images of the hypothetical forms lying at the group centroid locations. As a result, the differences between groups have been made clear. The glabellas of the groups arrayed along the PC-1 axis (red, green-blue-yellow, magenta) are distinguished primarily by size. This is, of course, signalled by the fact that both the PC-1 eigenvector coefficients are positive. But the value of the reconstructions is that now both analysts and readers are provided with a direct visual impression as to the magnitude of the size differences. Similarly, the glabellas of the groups arrayed along the PC-2 axis (yellow, red-blue-magenta, green) are distinguished primarily by shape. The glabellar width is much shorter than its length for the yellow group, subequal to the length for the red-blue-magenta group, and much longer than the length for the green group. This agrees with our gross interpretation of equations 18.1 and 18.2. But for those not used to interpreting such data geometrically getting a sense of the form/shape lying behind the numbers is very difficult. By using this simple back-transformation method a direct and perfectly accurate visual representation of the geometric meaning of these equations can be created. These simple mathematical models of the underlying geometry can now be used to guide interpretation and facilitate communication in a manner much more accessible to most palaeontologists than visual inspection of the matrix equations themselves.
Since we’ve developed the method and proved it works, let’s use it to explore this simple PC space. One common challenge in interpreting PC ordinations is getting an accurate and complete understanding of exactly what the PC axes represent. Note that while the models we constructed for the group means are approximately aligned with the PC axes, they are not precisely aligned with them. There is also a question about which axes we’re talking about. Since the space occupied by the glabellar data in the PC space is far away from the origin of the coordinate system it makes little sense to model forms/shapes along the system axes sensu stricto. Rather, what we really want to know is what shape variation in the direction of the PC axes, but within the region of the theoretical form space occupied by our data, looks like. This effectively focuses our exploration on the region containing the mean form in a manner wholly consistent with geometric morphometric conventions.
Table 1 shows the coordinate values and associated form models along glabellar form PC-1 and PC-2 centred on the mean form (coordinates: 1.284, 0.088).
Table 1. Form models for the glabellar principal component axes. Coordinates (= PC scores) used to construct the model are given below each graphic.
As with the group mean models, there is no information in Table 1 that is not present in Figure 2. But it in terms of accessing the information present in that figure to make valid biological interpretations of the PCA result it is difficult to think of more useful technique than the calculation, plotting, inspection, and comparison of heuristic form/shape models. In this example note the particular clarity with which the dual nature of PC-1 has been shown. The standard (but for the most part erroneous) interpretation of the first principal component of a set of distance data is that it represents size. The form models calculated for this axis do differ in size, with small glabellas projecting low on the axis and large glabellas projecting high. But in addition to this there is a distinct pattern of size-independent glabellar shape variation that is also being expressed along PC-1. For this dataset small glabellas exhibit a sight but noticeable tendency for the glabellar width to be greater than the glabellar length whereas large glabella exhibit the opposite relative length-width relation. Although the difference in the rates of change in glabellar length and width along PC-1 are clear in equation 18.1, the shape-state of the space occupied by the sample cannot be inferred from the information in equations 18.1 and 18.2 alone, which is all most analysts are taught to use in making an interpretation of PCA axes. By translating selected locations within the PC coordinate space back into their equivalents within the original variable space, and then using those reconstructed values to devise a graphic representation of the distribution of (hypothetical) shapes in the space, a much more complete and meaningful interpretation of the set of abstract PC axis can be made quickly, easily, and in a manner that invites further exploration. As shown in Figure 4, any location along any trajectory through the PC space can be represented by a theoretical model of form (or shape) and used to interpret the PCA result. Moreover, this general approach can be applied to any eigenanlysis-based data analysis technique (e.g., factor analysis, principal coordinates analysis, canonical variates analysis, partial least-squares analysis, etc.).
Figure 4. Distribution of trilobite glabellar forms within the space of the two principal component (PC) axes superimposed over a set of heuristic form models illustrating the underlying geometry being expressed by the PC ordination space.
I’ve purposefully introduced the concept of heuristic shape modelling using a simple dataset of inter-semilandmark distances to show that such an approach can be applied to any dataset susceptible to PCA analysis. This is contrary to the published assertions of many adherents to the geometric morphometric paradigm who often imply that only landmark data can be modelled in ways that inform biological interpretations. As a matter of fact there is an extensive and somewhat neglected literature on the graphic representation of multivariate data analysis results (see Everitt 1978, Tufte 1983, Cleveland 1985, Myatt & Johnson 2009). Curiously though, I’ve yet to come across the straight-forward and computationally compact back-transformation method for modelling multivariate results I’ve described above.
Naturally, model-based approaches are relatively easy to devise for structures whose shape is regular—at least in gross aspect—and lends itself to characterization by simple geometric forms or form descriptors. With a little creativity though, even datasets composed of variables that have no geometric relation to one another can often benefit from the model-based approach. This point having been made, the data types that have come to be associated with geometric morphometrics are, perhaps uniquely, well-suited to this modelling approach. As a last example I’ll apply the back-transformation method to a trilobite cranidial landmark dataset (Fig. 5) to show how this procedure can by applied in the context of a Procrustes PCA analysis.
Figure 5. Landmarks used to quantify shape variation in the trilobite cranidium. Scale bar = 7.87 mm. 1: anterior glabellar mid-line terminus. 2,10: intersection of the lateral anterior glabellar margins with the pre-glabellar field. 3,9: eye centroids. 4,9: latero-posterior librigenal margins. 5,7: posterior lateral galbellar termini. 6: posterior glabellar mid-line terminus.
Only 18 of the 20 trilobite images could be used for the cranidial analysis as two genera lack the eyes necessary for location of landmarks 3 and 9. Procrustes superimposition of these 10 landmarks across the specimens representing these 18 genera, along with the sample mean shape, is shown in Figure 6.
Figure 6. Procrustes superposition of 10 cranidial landmarks (see Fig. 5) for 18 specimens. Black symbols mark position of mean shape landmark coordinates.
A Procrustes PCA analysis of these shape coordinate data yields 17 eigenvectors with non-zero lengths. This result is consistent with expectations of the removal of translation, scale, and rotation information from the raw landmark coordinate values. Of this shape-vector set the first three vectors represent more than 75 percent of the observed shape variation. The distribution of the 18 trilobite specimens within the ordination space formed by these three shape axes is shown in Figure 7.
Figure 7. Comparison of shape similarity-dissimilarity patterns among landmark data collected from the sample of 18 trilobite genera in the ordination space formed by the first three Procrustes principal components. Together these components express 77.66% of the observed shape variation. Note that the arrangement of plot axes facilitates visualization of the distribution in a three-dimensional space.
Unlike the previous glabellar form analysis, there are no obvious subsidiary groupings of taxa within the cranidial data used to construct this PCA ordination space. Therefore, it makes no sense to calculate shape models for arbitrary groupings of genera. But there is always a need to gain a detailed geometric understanding of the character of the shape space itself. Visual inspection of the table of shape coordinate loadings on these principal component is an option. This information is provided in Table 2.
Table 2. Variable (= shape coordinate) loadings for the three Procrustes PC axes shown in Fig. 7.
Taking the first cranidial principal component as an example of how such an inspection-based interpretation would be undertaken, note that the maximum positive and negative loading coefficients on the PC-1 axis are associated with variables y6 and y8 respectively with variables y1 and y4 also exhibiting notably high and low values. This suggests that, as one moves along PC-1 from left to right, the glabella of the cranidia migrates to a more anterior position relative to the lateral cranidial margins which, relative to the glabella, migrate to more posterior positions. While this interpretation is clear and relatively easy to determine for an experienced analyst, it still only provides an understanding of how these two regions of the cranidium are changing position relative to one another. It would be considerably more difficult to arrive at—much less describe in words—the full set of relative changes in the location of each landmark in the x and y directions as the position along the Procrustes PC-1 axis is changed. Compare this rather daunting task to the level of geometric insight into the geometry of PC-1 provided via calculation of heuristic models for a set of regularly-spaced positions along that axis (Table 3).
Table 3. Heuristic trilobite landmark shape models for the first principal component of the trilobite cranidial landmark data. Values of the PC-1 coordinate used to construct the model are shown below each model graphic.
These models confirm the previous shape transformation interpretations gained through visual inspection of the principal component loading values (Table 2), and also extend these interpretations in a manner that is both natural and intuitive. Using this model set, and without having to inspect any table of numbers or search for high and low values, it can be readily appreciated that PC-1 incorporates a moderately strong component of cranidial narrowing in addition to lengthening, and that this narrowing is confined to the middle region of the cranidial structure (the eye landmarks 3 & 9).
Visualization of further, and even more subtle, contrasts between these models can be seen if one overlays them in a single system such that each landmark position forms a displacement track as the position along PC-1 changes (Fig. 8).
Figure 8. Overlay (or strobe) plot of the heuristic PC-1 shape models shown in Table 2. Landmark position symbol colours denote location of the model along the PC-1 axis (as in Table 3). See text for discussion.
Pat Lohmann, who first developed the overlay modelling display technique for the interpretation of PC axes, referred to them informally as ‘strobe plots’. Colour coding the landmarks associated with each strobe plot based on axis position allows the polarity of each landmark’s displacement to be assessed. Also, joining landmarks located on or close to the outline together with straight lines provides a sense of shape change in the overall structure.
Use of these strobe plots allows complete freedom for the analyst to focus on changes in a particular landmark in isolation from all others, on changes landmarks defining or located in a particular region, or on changes in the entire landmark ensemble; whatever is needed to understand those aspects of shape variation present in the sample relevant to the particular systematic or biological question(s) at hand. Finally, for completeness, shape model sequences and strobe plots for the trilobite cranidial PC-2 and PC-3 (Fig. 7) are provided in Table 4 and Figure 8. The geometric interpretation of these axes is left as an exercise for the reader.
Table 4. Heuristic trilobite landmark shape models for axes PC-2 and PC-3. Modelled coordinates shown below each model.
Figure 9. Overlay (or strobe) plots of the heuristic shape models shown in Table 4.
Everitt, B. 1978. Graphical techniques for multivariate data. North Holland, Amsterdam. 117 pp.
Myatt, G. J. and W. P. Johnson. 2009. Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications. Wiley-Blackwell. 291 pp.
Cleveland, W. S. 1985. The elements of graphing data. Wadsworth, Monterey, California. 321 pp.
Tufte, E. R. 1983. The visual display of quantitative information. Graphics press, Cheshire, Connecticut. 197 pp.
Endnotes1 Form is represented in a morphometric dataset when size information is embedded in the data. Raw interlandmark distances and non-superposed landmark coordinate datasets express form. Shape is the information remaining in a dataset once variation due to differences in object position, rotation, and size have been standardized, usually through Procrustes superposition.
2 See the Palaeo-Math 101-2 spreadsheet for complete details of these calculations.