## 2. Loops, logical statements and writing functions

The first article in this series focused on introducing the very basic elements of the R language: types of data, vector calculations, statistics and plotting your data. I also mentioned that one of the most important benefits of statistical platforms such as R is that they allow for researchers to have direct control over their analyses by either adapting pre-existing software or writing bespoke code to execute a particular task in exactly the way they want it to be done.

This article is a direct continuation of the previous article, to introduce you to more basic techniques that are common in all programming languages – including loops and logical statements. Finally, with an understanding of these methods I will conclude by showing you how to turn your existing code into functions so they can be easily applied to multiple datasets and shared with other researchers

### Loops

Loops are one of the more useful tools available to programmers, although they can be troublesome as I will discuss briefly later. Simply put, a loop contains a series of commands (contained within curly braces) that you wish to be performed a certain number of times. There are two kinds of loops that are commonly used: for and while loops. The first, for, is used when you have a *vector* or *list* of values you wish to loop through and has the following general format, sometimes called *pseudocode* by programmers:

```
for(counter in vector){
command(s)
}
```

For example if you had a *scalar* called **x** that has a value of 1 and you wanted to add 2 to **x** 10 times you could use the following code:

```
x <- 1
for(i in 1:10){
x <- x + 2
}
```

If you were now to call for the resulting value by typing **x** R will print the value:

`[1] 21`

You may have noticed that in the first line of the loop after the command **for** is the letter **i**. This is referred to as the counter and represents each iteration of the loop changing each time to the next value in the list that you have defined – in this case the values from one to ten. To see this in action you can print the value of i at each iteration using the following code:

```
for(i in 1:10){
x <- x + 1
print(i)
}
```

The value of the counter can also be included within each loop and used in calculations. For example if we were to create the *scalar* **y** that is also set to 1 and we wanted to add the value of the counter (**i**) each time – *i.e.* adding one in the first loop, two in the second and so on up to 10 – we could use the following code:

```
y <- 1
for(i in 1:10){
y <- y + i
}
```

If you then call for y, by typing the letter **y** at the prompt ‘>’, R will return the following:

`[1] 56`

The second main type of loop is called a while loop and you can think of it in these terms. Here a loop continues while a certain condition remains true and the loop will cease when the condition is no longer met. Using an example similar to the previous one, say we had a value, z, and we wished to add a random number to it until it reached another value, say 100; we can use a while loop for this. In order to add a random value to z we can use the function runif (for more information type help(runif)) which requires three values: the number of values you want and a minimum and maximum number to select these numbers from.

```
z <- 1
while(z < 100){
z <- z + runif(1,0,10)
}
```

If now you were to call for z, you’ll find a value greater than 100. Because you are adding a random value to z the final value will be different every time you run this code. This occurs because in order for the loop to stop the condition must become FALSE. If you now type z < 100 at the prompt, R will tell you that it is FALSE.

### Common issues with loops

Although very useful in programming there are number of quirks about loops that I should briefly mention. Firstly, they can take a long time to run, especially if you have nested loops (for example if you have a loop of ten containing a second of ten it means the commands are being run 100 times in total). Now, this may be unavoidable but one tip is: leave outside the loop anything you only need to do once so it keeps run-time shorter. Secondly, be careful not to have a conditional statement that will result in the loop running forever; an extreme example of this would be to use the following:

```
value <- 1
while(value > 0){
value <- value + 1
}
# Press ESC to stop the program otherwise it will keep running
```

Here, the condition of running the loop is if the scalar called value is greater than zero, and this is always the case as you are continually adding one to the original value of one. To see this in action include the line print(value) in the loop. This will also give you some idea of how quickly your computer does calculations!

On another note, one of the most common, but easily fixed, errors in programming is making sure that every statement you write is properly closed. By this I mean every time you open a bracket ‘(’ or curly-brace ‘{’ you have to end with the opposing closed bracket ‘)’ or brace ‘}’. If you don’t do this R will consider that line or function incomplete and will not provide you with the command prompt ‘>’ but instead you will get a ‘+’. In order to demonstrate this take any of the previous examples and run all the lines except for the final one, leaving out the last curly brace. You will see that until you type in and run the last curly brace the code will not run until completion.

### Logical statements

Another important component in programming is the use of logical statements; these are questions about your data that result in an answer that is TRUE or FALSE which in turn can dictate what methods are applied to the data. This was mentioned briefly in the previous article but will now be discussed in greater detail. The syntax for the different logical arguments used in R is provided in Table 1.

Syntax | Explanation |
---|---|

`!` | Logical NOT |

`&` | Logical AND |

`|` | Logical OR |

`<` | Less than |

`>` | Greater than |

`<=` | Less than or equal to |

`>=` | Greater than or equal to |

`&&` | AND when use with if |

`||` | OR when used with if |

[Table 1. Logical commands in R.]

A simple example of a logical question would be to ask whether one value – e.g. 5 – is greater than another, e.g. 3, using the greater than symbol (>), by typing:

`5 > 3`

This will return the following:

`[1] TRUE`

Two or more statements can be joined together using either AND (&) which will return TRUE if all individual statements are true, or OR (|) which will return TRUE if any of the statements are true. Below are two examples using both the AND and OR statements, which in English can be read as “Is 5 greater than 3 AND is 10 less than 9?” and “Is 5 greater than 3 OR is 10 less than 9?” respectively. As only the first statement in each case is true the first example returns FALSE but the second example returns TRUE.

```
5 > 3 & 10 < 9
[1] FALSE
5>3 | 10 <9
[1] TRUE
```

If we wanted to ask a logical question of a dataset containing multiple values, in the form of either an array or a matrix, each element is treated separately. So if we create a scalar called x with five values and want to know which values are greater than or equal to four we could use the following:

```
x <- c(1,2,3,4,5)
x >= 4
```

R will return:

`[1] FALSE FALSE FALSE TRUE TRUE`

Making things more complicated, say we wish to know which value is greater than two AND less than four:

```
x > 2 & x < 4
[1] FALSE FALSE TRUE FALSE FALSE
```

It is important to note that the syntax for ‘greater than or equal to’ and ‘less than or equal to’ must be ‘>=’ and ‘<=’ respectively as this command varies across programming languages (for example .GE. and .LE. in Fortran). If you were to type ‘=>’ R would return an error saying that it did not expect the ‘>’ symbol.

The use of logical statements becomes clear not just when you want to identify elements of your datasets that match certain parameters but when you want to select, ignore or only perform calculations of selected elements. If any of the previous logical statements are used to select elements, only those in which the result is TRUE will be returned. So creating a new array called **y**:

`y <- c(1,10,12,3,90,8)`

If we then wanted to return all the values greater than 10 we could use:

```
y[y>10]
[1] 12 90
```

Rather than using specific values, say we wanted to know which values were greater than the mean value for **y**:

```
y[y > mean(y)]
[1] 90
```

As the mean value for **y** is 20.6667 only the last element (90) is greater than the mean value.

### If and else statements

There are times when you will want your code to perform a particular task but only when a specific condition is TRUE, or if that condition is not TRUE to run a secondary function. For this you will use what are known as if and else statements. In simple terms you can think of these as the following:

```
if(conditional statement){
command(s) to perform if conditional statement is TRUE
} else{
command(s) to perform if conditional statement is FALSE
}
```

Conditional statements are commonly either in the form of a logical statement such as in the previous examples or can be associated with one of the arguments of a function. When using comparative statements with an **if** command – such as “is x equal to y?” – you must include two equals signs (==) (Table 1) otherwise in this case R will want to assign the value of y to the variable y.

In order to demonstrate this we will use the dataset I provided in the previous article, called asaphidae that contains a series of measurements recorded for twenty-one trilobite genera in a matrix format, in which each column represents an individual genus. Details of where to find and how to set a working directory and load in this file are available in the previous article.

File Name | Size | Download |
---|---|---|

asaphidae.txt | 198kb | R for Palaeontologists - asaphidae.txt |

`asaphidae<-read.table(“asaphidae.txt”,header=T)`

If we wanted to separate the genera here into two categories according to their mean size, large and small genera, using an arbitrary value of 30mm to differentiate them, one way to do this would be to use **if **and **else** statements. In English we would want to ask the following: “IF the mean size of a genus is greater than OR equal to 30mm save the genus name in an array ELSE save the genus name in a second array”. So for this we first need to set up two empty arrays with which we will save the genus names that match our criteria:

```
large.species <- array(dim=0)
small.species <- array(dim=0)
```

Second we will use a **for** loop, using the counter **t** to represent each column of data, to examine each genus in turn:

```
for(t in 1:length(asaphidae[1,])){
if(mean(asaphidae[,t],na.rm=TRUE) >= 30){
large.species <- c(large.species,colnames(asaphidae)[t])
} else {
small.species <- c(small.species,colnames(asaphidae)[t])
}
}
```

If you now run both **large.species** and **small.species** R will provide lists of the genera that have a mean size greater and smaller than 30mm respectively.

Taking this further, you are not limited to using **if** and **else** once but can use them together to ask a second (or more) conditional statement(s). Using the previous example, if we wanted to add in a third category to represent the medium sized species that ranged in size between 10mm and 30mm we could amend our code in the following way:

```
large.species <- array(dim=0)
medium.species <- array(dim=0)
small.species <- array(dim=0)
for(t in 1:length(asaphidae[1,])){
if(mean(asaphidae[,t],na.rm=TRUE) >= 30){
large.species <- c(large.species,colnames(asaphidae)[t])
} else if(mean(asaphidae[,t],na.rm=TRUE) >= 10 && mean(asaphidae[,t],na.rm=TRUE) <= 30){
medium.species <- c(medium.species,colnames(asaphidae)[t])
} else {
small.species <- c(small.species,colnames(asaphidae)[t])
}
}
```

### Writing your own functions

Once you understand how to implement the basic programming techniques discussed here and in the previous article you will be ready to input your own data and write functions to analyse these datasets. Writing your own functions has several advantages in that it provides you with the ability to run identical analyses on multiple datasets without having to change any variable names, and second it provides a much easier way for other researchers to use your code.

Using the same data as before let’s say that we are interested in knowing what the mean value of each genus is and want to store the resulting values as an array. A simple way to do this would be to create an array, here called **asaphidae.means**, with a length equal to the number of genera in the matrix, using **length(asaphidae[1,])** (*i.e.* the number of columns in the matrix).

```
asaphidae.means <- array(dim=length(asaphidae[1,]))
names(asaphidae.means) <- colnames(asaphidae)
```

Next we could systematically calculate the mean for each of the 21 genera, assigning the value to **asaphidae.means** individually as in the following:

```
asaphidae.means[1] <- mean(asaphidae[,1],na.rm=TRUE)
asaphidae.means[2] <- mean(asaphidae[,2],na.rm=TRUE)
asaphidae.means[3] <- mean(asaphidae[,3],na.rm=TRUE)
# etc…
```

As you can see, for 21 genera this would be an extremely cumbersome approach to solve this issue. However, we now know that if we wish to perform the same operation (e.g. calculating the mean) multiple times we can simply loop through each column in turn by using a counter (here called m) instead of a specific column number:

```
asaphidae.means <- array(dim=length(asaphidae[1,]))
names(asaphidae.means) <- colnames(asaphidae)
for(m in 1:length(asaphidae[1,])) {
asaphidae.means[m] <- mean(asaphidae[,m],na.rm=TRUE)
}
```

Another approach to this is to use the names of the genera to loop through rather than a list of values. So, in the first instance, rather than R seeing the number 1 as the first column it is looking for a column called “Isotelus”:

```
for(m in colnames(asaphidae)){
asaphidae.means[m] <- mean(asaphidae[,m],na.rm=TRUE)
}
```

You may have noticed that in all the previous uses of the **mean** function the argument **na.rm=TRUE** is included; this is used to tell R to exclude any cells containing the value NA (which stands for not applicable) when calculating the mean. With this argument set to FALSE (the default) R cannot calculate the mean value; try running **mean(asaphidae[,1])** and you will see that R returns:

`[1] NA`

So now you have a piece of code that works well in calculating the mean species values for one particular dataset. What if we wanted to use this regularly on a wide range of taxonomic groups? There are several options; we could assign any dataset we wished to use to **asaphidae**, or change any mentions of **asaphidae** in the code to the name of the other dataset. However, the best option is to create a function using this code. In order to do this, all the code is assigned a name using **function**. The rest of the code follows the same layout as for all other functions in R. Firstly, any arguments (or options) you wish to have within the function are included in the brackets after **function**. Secondly, the function **return** is used at the end of the code to include the dataset you want to be returned to the user once the program is completed. Note that return can only be used once and with one variable; if you want to return multiple variables they must be combined together as one object such as a list using **list(variable1,variable2)**. Using the previous code for calculating mean species size as the basis of the new function **genus.means**, the array containing the resultant mean values, called **means**, is returned to the user.

```
genus.means <- function(dataset){
means <- array(dim=length(dataset[1,]))
names(means) <- colnames(dataset)
for(m in 1:length(dataset[1,])){
means[m] <- mean(dataset[,m],na.rm=TRUE)
}
return(means)
}
```

This can now be used in the same way as any other function in R. The dataset we wish to analyse, **asaphidae**, is assigned using the argument **dataset**:

`asaphidae.means.new <- genus.means(dataset = asaphidae)`

The results of this function can now be viewed by typing:

`asaphidae.means.new`

As it happens there is a similar function for calculating the mean values for all columns of a matrix, called **colMeans** (the alternative for calculating row means is called **rowMeans**). You can compare the results of your new function with **colMeans** by using the following code:

`colMeans(data=asaphidae,na.rm=TRUE)`

All the example code used here, which has been commented in detail, is available to download from the PalAss website. In addition I have provided another version of the **genus.means** function (called **genus.stats**) that calculates several additional statistics on each genus (minimum, maximum and median values).

File Name | Size | Download |
---|---|---|

code_part_2.R | 6kb | R for Palaeontologists - code_part_2.R |

### Summary

While all programmers have their own coding philosophy and will find different and unique ways to solve problems, it is important to state that there is no single correct approach to statistical programming and the examples here are just one solution of those possible. However, the basic techniques introduced here are universally applied amongst programming languages, and with these you will be well on your way to understanding what pre-existing functions are doing as well as developing code for your own specific needs.

## Further Reading

- CRAWLEY, M. J., 2005. Statistics: an introduction using R. John Wiley and Sons. 342pp.
- FIELD, A., MILES, J. and FIELD, Z., 2012. Discovering statistics using
- R. SAGE publications Ltd. 992pp.
- PARADIS, E., 2002. R for Beginners.