3. Tree Building
In this article I will explore the ways in which cladograms/trees are constructed from the data matrix by the parsimony programs. I should point out that when cladistics first started (pre-computer) Hennig argumentation was used: that is, groups were constructed solely upon the evaluation of what were judged a priori to be apomorphic character states (details of the argumentation can be seen at http://research.amnh.org/~siddall/methods/day2.html)
From cladogram/tree construction I will detail how we use the PAUP* program in order to read the data matrix, set various starting conditions and any assumptions that we may wish invoke to build the tree, as well as actually building the tree. We will then deal with the tree output but leave the meanings of all the optimisation and tree statistics until the following article when you have refreshed your batteries from this battery. This is not intended as a manual, but it may speed up the inevitable learning curve that we all experience confronted with a new program. I will deal with the Macintosh version of PAUP*. For those with a PC version then I do have a help sheet on request.
The scale of the problem
There is no theoretical reason why we have to use computers. It is just the scale of the task that makes it a practical necessity. The problem is that with increasing numbers of taxa to be evaluated then the number of possible trees that must be considered in order to find the most parsimonious is staggering (Figure 1). Therefore we need computers. Even then it turns out that the problem is currently beyond the technology.
The tree building algorithms used in PAUP* fall into two main methods: Exact and Heuristic. All of them construct unrooted trees/cladograms which are then rooted according to the criteria that you set (see later).
These are guaranteed to find the shortest tree(s) – remember there may be more than one. The Exhaustive Search routine is the simplest (Figure 2). In this routine three taxa are taken (usually the first three in the data matrix) and a fourth is added to all possible positions to give three possible paths; then to each of these a fifth taxon is added in all five possible positions on each path, followed by a sixth etc. until all taxa have been added and all possible networks have been generated. Then the lengths (numbers of steps) of all the cladograms are evaluated and the shortest chosen. Given the astonishing number of possibilities (Figure 1) the Exhaustive search is usually only practicable for about ten taxa.
A compromise on this method is provided by Branch and Bound (Figure 3). This is a technique that does not require every possibility to be examined individually. In this method a cladogram is constructed for all the taxa using one of the Heuristic techniques that we will come onto later. The length of this cladogram is stored in the memory and it is called the ‘upper bound’. Then, the Branch and Bound method proceeds as in the Exhaustive Search routine. As each of the paths is followed, the lengths of the partial cladograms (those in which not all taxa have been yet added) is calculated. As soon as the length of a cladogram in any path is longer than the upper bound, that entire path is abandoned – because the addition of more taxa can only increase the length of the cladogram further. If one of the paths is completed and a shorter cladogram is found, this becomes the new ‘upper bound’. By this method the number of cladograms that have to be evaluated is drastically reduced. Even so, Branch and Bound usually begins to creak at about 15 – 20 taxa (it very much depends on how clean the data is).
Ideally one would like to use one of the exact methods. However, because we usually have more than twenty taxa or because we have messy data (lots of homoplasy, question marks etc.) we are forced to use heuristic (trial and error) methods. Do not despair! Heuristic methods do have a number of short cuts and slick moves that usually enable us to find the shortest cladograms. But let’s start with the pitfalls.
The heuristic method, pure and simple, starts with the first three taxa in the data matrix (Figure 4a). It then adds the fourth taxon, calculates the lengths, and discards all but the shortest (Figure 4b). This means that even if the addition of the next taxon to one of the cladograms now discarded should turn out to yield trees shorter than those retained it will never be considered – it has already gone down the drain because of the inferiority of its antecedent cladograms. The situation is worse because, if there is a tie in the lengths of the cladogram one will arbitrarily be discarded (Figure 4c). So, it is possible that by discarding paths of cladogram building so early we are forced onto a path that may lead to a less than optimal cladogram length. We are, effectively, being forced onto an island where we may never find the shortest tree. At the risk of stretching the metaphor – we don’t always get the coconut! This is called the ‘Island of trees problem’.
We might explain this another way using the analogy of a hill climber wishing to reach the ‘highest peak’, which in our case is the shortest cladogram. The cartoon in Figure 5 is meant to illustrate the dilemmas. In this cartoon each of the mountains represents a path of cladogram construction. If we happened to start on any mountain but “Mount C” we would never reach the optimal solution. But, of course, having started to climb a particular mountain (started to go down a path of tree building) we cannot go back. There are two things that we could do. First we could change the trees/cladograms that we start with: in other words we could change the order in which we add the taxa. This may allow us to start on another mountain. Second, we could try to jump from one mountain to another in the hope that we land on “Mount C” and eventually reach our nirvana.
The heuristic algorithm in PAUP* has alternatives for specifying the order in which taxa are to be added into cladogram building (starting on different mountains), and it has alternatives for branch-swapping (jumping from one mountain to another). The order in which taxa are to be added to the heuristic search is called STEPWISE ADDITION and is an attempt to start on different mountains. PAUP* provides four options that are explained in Figure 6. The most frequently used are ‘Random addition’ and ‘Closest’. For branch swapping (jumping from mountain to mountain) there are three options but only one is really effective – tree bisection and reconnection, or ‘TBR’ (Figure 7); this is the default in PAUP* so I suggest that you leave well alone. By using a combination of stepwise addition and branch swapping the heuristic algorithm more than likely will find the shortest tree(s). The trick is to repeat the heuristic searches several times until both the topology and the length of the trees stabilise. The combination of random addition sequence and TBR is usually sufficient.
Having dealt with the theory we can now look at the computer screens and see what it actually looks like. I will follow through a sequence of screen dumps as you may see them when carrying out an analysis. In all of what follows there are many more options. For simplicity, I will deal only with those usually used.
Figure 8 is the opening screen that you get after reading the matrix. It tells you that the matrix has been read successfully. If there are errors it specifies the problem. If you use NDE or MacClade to construct the data matrix you should have no problems because you would not be able to save those files successfully. If you write your own matrix from scratch then the normal sources of errors are mismatches between the numbers of taxa and/or characters specified and those actually recorded.
In the next stage is the OPTIONS menu (Figure 9) where we would set the Maxtrees. This sets a value of the number of trees to be held at each iterative stage of the tree building process. The usual setting is “automatically increase by 100”. The next job under the options menu is to set the outgroup and how that outgroup is to be represented. The computer actually builds an unrooted network but then roots the tree(s) using your specification of outgroup. This action will polarise the characters (see previous article). The screens should be self explanatory.
The DATA menu (Figure 10) is very useful. Options here allow you to include or exclude characters, include or exclude taxa, and to set the type of character to behave under alternative parsimony theories (we will deal with these in another article). Normally you may wish to set a particular multistate character to be ordered or unordered, but there are other options such as Dollo and irreversible (Camin-Sokal parsimony). You may also change the weight of a particular character, but I’m not encouraging you to do this!! What this means is that you are saying that character A is x times more important than character B. What actually happens is that the computer rewrites the data matrix including that character several times, and consequently ‘loads the dice’. So if you set character 2 to have a weight of four then the matrix would include character 2 four times. How you justify this with morphological data a priori I do not know!! There are, however, perfectly good reasons for doing so with molecular data.
The ANALYSIS menu (Figure 11) is where you set the conditions of tree building just before you hit the button. Here, there are a number of tree-building options. We are only interested in Parsimony (the others are appropriate for molecular data). Under Parsimony there are three options: Exhaustive, Branch & Bound, and Heuristic. The screens for Exhaustive and Branch & Bound are straightforward and do not need any explanation. Most people just hit the ‘search’ button. I have included the Branch & Bound screen (bottom of Figure 11) because, like screens under the other parsimony options, you can ask the computer to save all trees that are X steps and less. This can be useful when we compute Bremer Support for the trees (article 5).
For HEURISTIC searches then we have options of stepwise addition and branch swapping to take into consideration (Figure 12). The common thing to do here is to hit the ‘random’ button under “Stepwise addition” and put in a number of replications.
Under any of these parsimony algorithm screens we hit the ‘SEARCH’ button.
Once the trees have been built under one of the tree search options you then have to examine the tree(s). This is found under the Tree menu (Figure 13). It is here that you could save your tree into a file for use later.
Usually you would want to describe the trees using the Tree Description menu (Figure 14), asking for certain pertinent pieces of information. Here there are a number of options indicated on the menu. Normally, you would ask for the list of changes and the list of apomorphies.
At last you will be able to see your tree(s). As a slight diversion: when you carry out an Exhaustive Search (and only when you do this) a graph will automatically scroll past (Figure 15). This gives you a tree length distribution of all trees examined. It can be very useful.
When you hit the ‘Describe’ button a lot of information will scroll past (Figure 16). This will take some digesting. Much of it we will deal with in the next article. For now, let’s look at the basics. The first thing to scroll past is a paragraph giving you details of tree length, consistency indices etc. The second is the tree itself, with terminal taxa numbered and the internal nodes numbered (you asked for this). When there are multiple trees each will have the same output format, so the screen could scroll past for some time (there is a buffer limit and therefore if you ask for the output for many trees it could be that the operation will be cut short – if this happens then just ask for the output for some of the trees, say trees 1 – 20).
If you have asked for the changelist and the apomorphy list you will get a table under each heading. The same information is conveyed in both but in a different format. Under the change list each character is taken in turn and certain facts are displayed. The headings are character number, consistency index of that character on that particular tree (next article), the number of steps that character has (if a binary character it will be 1 step, if a multistate such as a 0, 1, 2 character it will be 2 – an earlier screen in Figure 10 bottom has shown you this), and the actual change that the character has made on that particular tree. So, character 1 (binary character) has changed from the ‘0’ state to the ‘1’ state along the internode numbered 13 to 14. So the ‘1’ state would be interpreted as a synapomorphy supporting the group Taxon 1 + Taxon 2. Character 2 (also a binary character) has actually changed twice on the tree, in each case from the ‘0’ state to the ‘1’ state, once along the branch from node 13 leading to taxon 3 and again from node 14 to taxon 1. This would be interpreted as a parallel change (a homoplasy). Character 3 (a multistate of three states) has also changed twice on the tree but the state changes have been different: from the ‘0’ state to the ‘1’ state from node 11 to node 12 and then from the ‘1’ state to the ‘2’ state from node 13 to node 14.
You will also notice that there are different kinds of arrows between the state changes. There are two variations; single- or double-headed, single or double lines. The double-headed arrows mean that the change has happened between the outgroup (in this case Taxon 7 + Taxon 8) and the ingroup. The double line means that the change is unambiguous on that tree (it cannot be interpreted any other way). The single-lined arrow means that the change is ambiguous on this particular tree. As explanation look at character 2. The ‘1’ state is present in taxon 1 and taxon 3. On this particular tree there are two explanations for this distribution: either it has been gained twice – once in Taxon 1 and again in Taxon 3; or it has been gained at node 13 and lost along the branch leading to Taxon 2. If you cast you mind back to article 1, this is the difference between DELTRAN and ACCTRAN. So this particular optimisation is not the only one possible.
The apomorphy list is the same information but it takes the tree node by node; so this gives you an instant read out of what characters are supporting what nodes (the usual piece of information that we need to know).
We will deal with the ‘ci’ value next article.
I would like to express my thanks to Sinaeur Associates, Inc., for permission to use ‘screen dumps’ from the PAUP* program as part of the illustrated material in this article.