The programming structures that we will examine include: control flow statements and user defined functions.
In previous lessons we were either performing summary or descriptive statistics across single variables, but in this lesson we will examine the interrelationships between variables. Interrelationship analysis comes in different forms and in this lesson we will examine covariance, correlation, and linear regression techniques.
You have probably discovered that as we write more complex R scripts there are scenarios that require decision making and R provides many flexible options to control the flow of execution using conditional and iteration expressions.
We also mentioned that R is a functional and an object-oriented programming language. In this lesson we will examine how we define and use functions during our analysis.
We will learn how to create different types of R functions.
We have learned how to examine data trends over a single variable using a histogram or over 2 variables using scatterplots. In the final part of this lesson we will learn how to analyze the relationship between 2 variables through learning about covariance, correlation, and linear regression.
Let's examine decision making in R.
In this example we generate 1 random number between 1 and 100 using the runif() or random uniform distribution function. This function returns a floating point value and in this scenario we would like to simply deal with integer values so we use as.integer() function to discard the decimal portion of our number.
To enable reproducible results we have decided in line 2 to set a static seed value to R's pseudo-random number generator prior to the request for a number.
If you would like randomness when this script is executed you would obviously use a different seed value each time.
The purpose of the script is to print a statement that the random number is either odd or even.
We state the conditional expression inside parenthesis following the if statement. Here we are checking if the result of applying the modulus 2 operator to the number is zero (0) or not. If there is no remainder then we know that the number is indeed even and R will execute the code defined within the curly {} brackets.
If the first conditional expression is evaluated FALSE then the next conditional expression will be tested or in this case the final else clause will be executed.
Note the location of the brackets and indentation shown in this example. This example is consistent with various R coding style guidelines.
Throughout this lesson we will be using simulated data.
R has many methods of creating data that can be used for analysis. Two of the most common types of distributions are uniform distributions and normalized distributions.
The functions runif and rnorm can be used to generate a vector of values with the corresponding distribution properties.
With normal distributions you provide a mean and a standard deviation. For example, if you wanted to create a large data set of 100 possible test scores with a normal distribution around a mean mark of 75 and a standard deviation or measure of distribution of 2 you could use the example as shown. To visualize and verify your data distribution you could use the plot or hist functions.
As with most functional programming languages R provides program structures to control iteration or looping behaviour. We will examine each of these iteration operations in this lesson.
The repeat statement can be used to define a block of statements that will continue to iterate indefinitely.
The break statement is used to exit the loop. In this example we simply check that the value stored in the variable x is greater than 9 we exit the loop otherwise we will print out the current value of x, increment the value by one and the loop repeats. When the value stored in the variable x is greater than 9 we use the break statement to leave the loop and continue in the script.
A conditional looping structure defines the initial or precondition within parenthesis.
In this scenario our entry condition to the looping code involves checking if the value stored in the variable curr is less than or equal to the number of marks in our integer vector called marks.
Recall that in R code blocks are usually defined using the curly or brace brackets {}. It is possible to avoid using brace brackets if the expression is a single line and another option is to combine multiple expressions on a single line using the semi-colon (:). In this example we are actually doing both of these techniques. This may seem confusing so style guidelines should be considered if you have multiple developers working on a single project.
On line 3, we initialized the value of the variables p and curr on a single line.
On line 6, we use an if statement to check if a value in the marks vector is greater than or equal to 50, if it is then we increment the value in the p variable by one.
On line 7 the if statement above is considered a complete expression and therefore the code on line 7 will execute for each iteration.
Once the looping condition is no longer true the msg will be generated using the sprintf() function and then sent to the standard output display using the cat() function.
When the iteration scenario has a well-defined number of iterations, a for statement can be used in R.
The first example uses a sequence function to create a temporary vector of values. The value of i will begin with a value of 5 and then the value will be incremented by 5 until it reaches 15 for the final iteration.
The second example iterates over an existing vector of integers and since the index value of the marks vector is not required we can use a simplified version of the for statement. As we iterate through the values, the counter p will be incremented when the value in 50 or more. This style of for loop can be used to iterate over any vector data structure and it will always examine each element of the vector from the first position to the final position.