Lesson 3 – data types and structures in r

149 100+

Welcome to the lesson on R data structures.
To perform any meaningful data analysis we need to collect our data into R data structures.
In this lesson we will explore the most frequently used data types and data structures.
R can be used to analyze many different forms of data. We will explore the built-in data types of R.
Data analysis usually requires an examination of large sets of similar data.
In this lesson we will explore various data structures we can use to hold and manipulate our datasets.
R can handle many different types of data. Here we examine the commonly used: numeric, logical, and character data types. Numeric data is usually classified as either integer and real numbers.
By default, R will store numeric data as a real which is otherwise known as a double precision floating point values.
If we want our data to be stored as an integer value we use the as.integer() function.
Logical or boolean data is any data that represents a true or false condition. True or false values may exist within a dataset, or in this example, we are using a function which returns a logical value.
Character or string data is often provided within datasets. Character data supplied as a constant can be enclosed within single quotes or double quotes. If the character data represents a category then you might want R to treat it as a factor instead of a character string.
We will examine the use of factors in future lessons. Here we have decided that a letter grade should be treated as a factor and not as a character string.
A vector is used to maintain an ordered collection of values of the same data type.
Any R data type can be used for a vector.
The elements within the vector can be accessed using indexes. The indexing starts with the value 1.
Here we have an example of a character vector of three names. Vectors can also be extended with new elements if required. The elements in a vector can also be modified.
The most common technique of creating a vector is to use the c() or combined function.
This function will combine its arguments to form a new vector.
There are many methods of accessing items within a vector.
In this first example we are using a single integer as the indexing method to retrieve the first element of the ages vector. The square brackets are used to access the elements from the data structures.
Negative integer values can be used to exclude elements from the selection.
A range of values, or another vector of indexes, can be specified within the square brackets to select specific items from the data structure.
Another very useful function that works with many different data structures is the length() function.
If you pass a vector to the length() function it will return the number of elements or items in the vector.
Vectors can be created using any of the supported R data types.
Here we have decided to create a vector of called ages.
Instead of using numeric indexing we use a logical vector to access elements of the vector.
We create a logical vector with the same number of elements and call it s.ages.
When we access the elements of the ages vector using a logical vector, only the values from the ages vector with corresponding TRUE values in the logical vector are returned.
Our resulting vector 'm' contains the values 10 and 42 only.
We will soon discover how the conditional access methods for analyzing data structures is a very powerful feature of R.
Let's look at how we can dynamically access elements of a vector based on conditions.
We have a simple numeric vector called nums. We would like to extract the even numbers from the vector and store them in a new vector.
R can handle this task using conditional access.
To make this happen we will use the modulus and equality operators to help use determine if the element is even or odd.
R will apply the conditional expression to each element of the vector because the vector is referenced within the square brackets.
If the expression evaluates as true, a copy of the element will be appended to the new vector called evens.
Finally, we examine the contents of the new vector to verify that the proper values have been copied.
R provides a few different techniques to generate a sequence of values.
Here we use the seq() or sequence function. The function str() is used to display the structure of an object. In this case we notice that the initial structure is a vector of numbers.
We use the type conversion function called as.integer() to change the default double values into integers. There are various other type conversion functions available in R.
Another method of generating a simple sequence is to use the colon operator (:).
Here we are generating a vector of integers from 0 through 9.
R is a functional language with built-in support for vector operations. Whenever possible it is best to utilize vector operations instead of iterating or loops.
R will use very efficient native libraries for many vector operations and therefore your
R script will execute much faster than it would if each operation was performed using loops.
For example, let's take a vector of hourly salaries.
We would like to increase all of the hourly salaries by 10%.
To accomplish this task we will multiply the existing salary by 1.10 to obtain the new salary.
The multiplication operator or asterisk (*) can be used.
In R, the constant value of 1.10 is multiplied with each element of the vector.
This is an example of R's vector recycling rule.
Note that since we did not store the new salaries the original values remain unchanged.
To replace the existing values we must use the assignment operator.
As we saw in the previous example, when R vector operations are performed using vectors of different sizes the smaller vector will be recycled or reused as the operation is completed.
In this example we are performing vector addition, but temps has 4 elements, and it is being summed with a 2 element vector called n.temps.
Following the operation we notice has the elements 3 and 4 are applied a second time for the last 2 elements of the vector temps.
This may or may not be the result that you expected, so be careful when using these built-in functions with vectors.
When you are working with large datasets it is common to have some missing data values.
R has a special value that is used to represent missing data.
Missing data can be represented with the value of NA. NA simply means "Not Available". By default, R will recognize NA values and in this example the mean() cannot be computed when there are missing values.
If you would like to ignore the missing data you can pass the optional argument of na.rm=TRUE.
There are other special values in R, but we will not discuss them in this lesson.
A matrix is simply a vector with 2 dimensions. We still have the restriction that all of the elements must be of the same data type.
In this example we create a matrix using the matrix() function to represent the marks for 2 different students. The dimensions of the matrix should be provided when it is created.
Here we would like to create a 2 by 3 matrix as we have 2 students and 3 different attributes.
The first column is used as a unique student identifier and the 2 elements across each row represent the results of 2 different tests.
When we use the matrix() function the data is provided as a single dimension vector, but we then also describe the number of rows and columns that R should use to represent the data.
By default the data in the vector fills the data structure by columns.
In our example here we have test scores of 80 and 67 for student 1 and scores of 85 and 56 for student 2.
Data within the matrix can be analyzed as a single element or a range of elements using the comma operator.
Here we are calculating the mean or average of the first test for all students.
It is important to remember that the first index value used represents the row in the matrix and the second value represents the column index.
We previously mentioned that the data stored in an R matrix must be of the same data type, but the rows and columns can be given names instead of simply using numeric index values.
In this example, we want to clarify that the columns really represent the student id, test1, and test2 elements.
We accomplish this using the colnames() function. This function accepts a vector with the terms to be associated with each column in the matrix.
Now, we can perform the same computation across the scores matrix using the column name reference of "test1". Using column names provides additional flexibility as you no longer need to worry about the index value changing over time.
A list is an ordered collection of objects. Unlike vectors, the objects can be of mixed data types and they can also be of different lengths.
In this example, we start with 2 independent vectors. A character vector of student names and a numeric vector of ages.
A list structure called classroom is created using the list() function. The list consists of copies of the students and ages vectors.
A single set of square brackets [] is used to retrieve a copy of the data contained in a list.
The data is always returned in the form of a list data structure.
The student names can be retrieved using a set of single or double square brackets.
If we want to modify a value stored in the classroom list we can use the double bracket indexing method
Here we are replacing the first student, "Mary" with a new student "Eva", in the classroom list structure.
The initial vector called students was used to create the initial classroom data structure is left unchanged.
Just like how we were able to give columns a name with matrix data structures, we can name each component of our list.
Here we decided to label or name the list components of our classroom using the terms "students" and "ages".
Now we can reference the list elements using the dollar sign ($) symbol or using the name of the list component.
Since lists allow us the ability to group different types of data they are used frequently in data analysis tasks.
Data frames are useful data structures to represent tabular data.
Like lists, a data frame can consist of different types of data.
Unlike lists, data frames have a defined size -- or number of rows and columns.
Here we have created a data frame called df1 using the data.frame() function.
The data frame represents our three students and their marks on a single test.
The cbind() or column bind function can be used to append another vector as a new column to our data.frame.
In this case the new vector represents a set of ages of the 3 students.
If a new row is to be appended to the data frame the rbind() or row bind function can be used.
In this example we are appending a new student to our data frame.

Sep 4 '14 #1

Subscribe Post Reply

4611

zmbd

5,501

Expert Mod 4TB

nbiswas:
Please provide proper citations for these articles.
-z

Sep 29 '14 #2

Similar topics

dynamic data types

by: Charlie | last post by:

Hi, The description of Python always mentions "very high level dynamic data types". Now, I can't seem to find any examples of these (nothing described with this term anyway). Is this simply...

Python

Value Types - Structures

by: faktujaa | last post by:

Hi All, Microsoft says that structures are value types. Also primitive data types are value types. And memory for value types is allocated on the stack. Then why we need new operator to allocate...

.NET Framework

Abstract Data Types - Separating Interface from Implementation

by: Anon Email | last post by:

Hi people, I'm learning about header files in C++. The following is code from Bartosz Milewski: // Code const int maxStack = 16; class IStack

C / C++

Specifying data types inside functions

by: No One | last post by:

Here is my problem: I have a certain set of well-defined manipulations that I have to apply to different types of data. In all cases the manipulations are exactly the same, and are to be...

C / C++

data types question

by: theshowmecanuck | last post by:

As a matter of academic interest only, is there a way to programmatically list the 'c' data types? I am not looking for detail, just if it is possible, and what function could be used to...

C / C++

generate an array with different data types?

by: Schnogge | last post by:

Hi! it is possible to generate an multiple-dimensional array with different data types? Or is it possible to combine a one-dimensional array with an other which has an other data type? How...

C# / C Sharp

Compare Two Structure Data Types...

by: Kiran B. | last post by:

Hi, I am new to .net. I have two Data Structure Type ... Sturcture A and Structure B. Structure A Public Fname as String Public LastName as String Public City as String Public Zip as String...

Visual Basic .NET

low level data types

by: dementrio | last post by:

How can I handle low-level data types in Python? What I want to do is writing an interface to a C daemon which waits for stuff like unsigned ints on a socket. For example, I need to craft and...

Python

Looking for a C textbook with emphasis on data types

by: ilya2 | last post by:

I am supposed to teach an introductory C course with an unusual slant, and have trouble finding an appropriate textbook. The course will begin traditionally enough with variables, loops,...

C / C++

Creating array of data types

by: Madhur | last post by:

Hi All, I would like you help me in creating an array of data types. I am interested in look at the the data type which looks like this Array...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA