473,698 Members | 1,888 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Lesson 3 – data types and structures in r

149 New Member
Welcome to the lesson on R data structures.
To perform any meaningful data analysis we need to collect our data into R data structures.
In this lesson we will explore the most frequently used data types and data structures.
R can be used to analyze many different forms of data. We will explore the built-in data types of R.
Data analysis usually requires an examination of large sets of similar data.
In this lesson we will explore various data structures we can use to hold and manipulate our datasets.
R can handle many different types of data. Here we examine the commonly used: numeric, logical, and character data types. Numeric data is usually classified as either integer and real numbers.
By default, R will store numeric data as a real which is otherwise known as a double precision floating point values.
If we want our data to be stored as an integer value we use the as.integer() function.
Logical or boolean data is any data that represents a true or false condition. True or false values may exist within a dataset, or in this example, we are using a function which returns a logical value.
Character or string data is often provided within datasets. Character data supplied as a constant can be enclosed within single quotes or double quotes. If the character data represents a category then you might want R to treat it as a factor instead of a character string.
We will examine the use of factors in future lessons. Here we have decided that a letter grade should be treated as a factor and not as a character string.
A vector is used to maintain an ordered collection of values of the same data type.
Any R data type can be used for a vector.
The elements within the vector can be accessed using indexes. The indexing starts with the value 1.
Here we have an example of a character vector of three names. Vectors can also be extended with new elements if required. The elements in a vector can also be modified.
The most common technique of creating a vector is to use the c() or combined function.
This function will combine its arguments to form a new vector.
There are many methods of accessing items within a vector.
In this first example we are using a single integer as the indexing method to retrieve the first element of the ages vector. The square brackets are used to access the elements from the data structures.
Negative integer values can be used to exclude elements from the selection.
A range of values, or another vector of indexes, can be specified within the square brackets to select specific items from the data structure.
Another very useful function that works with many different data structures is the length() function.
If you pass a vector to the length() function it will return the number of elements or items in the vector.
Vectors can be created using any of the supported R data types.
Here we have decided to create a vector of called ages.
Instead of using numeric indexing we use a logical vector to access elements of the vector.
We create a logical vector with the same number of elements and call it s.ages.
When we access the elements of the ages vector using a logical vector, only the values from the ages vector with corresponding TRUE values in the logical vector are returned.
Our resulting vector 'm' contains the values 10 and 42 only.
We will soon discover how the conditional access methods for analyzing data structures is a very powerful feature of R.
Let's look at how we can dynamically access elements of a vector based on conditions.
We have a simple numeric vector called nums. We would like to extract the even numbers from the vector and store them in a new vector.
R can handle this task using conditional access.
To make this happen we will use the modulus and equality operators to help use determine if the element is even or odd.
R will apply the conditional expression to each element of the vector because the vector is referenced within the square brackets.
If the expression evaluates as true, a copy of the element will be appended to the new vector called evens.
Finally, we examine the contents of the new vector to verify that the proper values have been copied.
R provides a few different techniques to generate a sequence of values.
Here we use the seq() or sequence function. The function str() is used to display the structure of an object. In this case we notice that the initial structure is a vector of numbers.
We use the type conversion function called as.integer() to change the default double values into integers. There are various other type conversion functions available in R.
Another method of generating a simple sequence is to use the colon operator (:).
Here we are generating a vector of integers from 0 through 9.
R is a functional language with built-in support for vector operations. Whenever possible it is best to utilize vector operations instead of iterating or loops.
R will use very efficient native libraries for many vector operations and therefore your
R script will execute much faster than it would if each operation was performed using loops.
For example, let's take a vector of hourly salaries.
We would like to increase all of the hourly salaries by 10%.
To accomplish this task we will multiply the existing salary by 1.10 to obtain the new salary.
The multiplication operator or asterisk (*) can be used.
In R, the constant value of 1.10 is multiplied with each element of the vector.
This is an example of R's vector recycling rule.
Note that since we did not store the new salaries the original values remain unchanged.
To replace the existing values we must use the assignment operator.
As we saw in the previous example, when R vector operations are performed using vectors of different sizes the smaller vector will be recycled or reused as the operation is completed.
In this example we are performing vector addition, but temps has 4 elements, and it is being summed with a 2 element vector called n.temps.
Following the operation we notice has the elements 3 and 4 are applied a second time for the last 2 elements of the vector temps.
This may or may not be the result that you expected, so be careful when using these built-in functions with vectors.
When you are working with large datasets it is common to have some missing data values.
R has a special value that is used to represent missing data.
Missing data can be represented with the value of NA. NA simply means "Not Available". By default, R will recognize NA values and in this example the mean() cannot be computed when there are missing values.
If you would like to ignore the missing data you can pass the optional argument of na.rm=TRUE.
There are other special values in R, but we will not discuss them in this lesson.
A matrix is simply a vector with 2 dimensions. We still have the restriction that all of the elements must be of the same data type.
In this example we create a matrix using the matrix() function to represent the marks for 2 different students. The dimensions of the matrix should be provided when it is created.
Here we would like to create a 2 by 3 matrix as we have 2 students and 3 different attributes.
The first column is used as a unique student identifier and the 2 elements across each row represent the results of 2 different tests.
When we use the matrix() function the data is provided as a single dimension vector, but we then also describe the number of rows and columns that R should use to represent the data.
By default the data in the vector fills the data structure by columns.
In our example here we have test scores of 80 and 67 for student 1 and scores of 85 and 56 for student 2.
Data within the matrix can be analyzed as a single element or a range of elements using the comma operator.
Here we are calculating the mean or average of the first test for all students.
It is important to remember that the first index value used represents the row in the matrix and the second value represents the column index.
We previously mentioned that the data stored in an R matrix must be of the same data type, but the rows and columns can be given names instead of simply using numeric index values.
In this example, we want to clarify that the columns really represent the student id, test1, and test2 elements.
We accomplish this using the colnames() function. This function accepts a vector with the terms to be associated with each column in the matrix.
Now, we can perform the same computation across the scores matrix using the column name reference of "test1". Using column names provides additional flexibility as you no longer need to worry about the index value changing over time.
A list is an ordered collection of objects. Unlike vectors, the objects can be of mixed data types and they can also be of different lengths.
In this example, we start with 2 independent vectors. A character vector of student names and a numeric vector of ages.
A list structure called classroom is created using the list() function. The list consists of copies of the students and ages vectors.
A single set of square brackets [] is used to retrieve a copy of the data contained in a list.
The data is always returned in the form of a list data structure.
The student names can be retrieved using a set of single or double square brackets.
If we want to modify a value stored in the classroom list we can use the double bracket indexing method
Here we are replacing the first student, "Mary" with a new student "Eva", in the classroom list structure.
The initial vector called students was used to create the initial classroom data structure is left unchanged.
Just like how we were able to give columns a name with matrix data structures, we can name each component of our list.
Here we decided to label or name the list components of our classroom using the terms "students" and "ages".
Now we can reference the list elements using the dollar sign ($) symbol or using the name of the list component.
Since lists allow us the ability to group different types of data they are used frequently in data analysis tasks.
Data frames are useful data structures to represent tabular data.
Like lists, a data frame can consist of different types of data.
Unlike lists, data frames have a defined size -- or number of rows and columns.
Here we have created a data frame called df1 using the data.frame() function.
The data frame represents our three students and their marks on a single test.
The cbind() or column bind function can be used to append another vector as a new column to our data.frame.
In this case the new vector represents a set of ages of the 3 students.
If a new row is to be appended to the data frame the rbind() or row bind function can be used.
In this example we are appending a new student to our data frame.
Sep 4 '14 #1
1 4630
5,501 Recognized Expert Moderator Expert
Please provide proper citations for these articles.
Sep 29 '14 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

by: Charlie | last post by:
Hi, The description of Python always mentions "very high level dynamic data types". Now, I can't seem to find any examples of these (nothing described with this term anyway). Is this simply refering to built-in dynamic data structures such as lists and dictionaries, with a great deal of operators defined on? Or is there something else meant by "dynamic data types" in Python? Regards,
by: faktujaa | last post by:
Hi All, Microsoft says that structures are value types. Also primitive data types are value types. And memory for value types is allocated on the stack. Then why we need new operator to allocate memory for structure value types and not for primitive data types(they r allocated memory on stack as well)??? Please help. Thanks in advance. Faktujaa
by: Anon Email | last post by:
Hi people, I'm learning about header files in C++. The following is code from Bartosz Milewski: // Code const int maxStack = 16; class IStack
by: No One | last post by:
Here is my problem: I have a certain set of well-defined manipulations that I have to apply to different types of data. In all cases the manipulations are exactly the same, and are to be performed on the different types of data. Currently I have a collection of functions that do exactly the same - the only difference between them is the type of data they act on. Let me present a toy example: I have the following two data types:
by: theshowmecanuck | last post by:
As a matter of academic interest only, is there a way to programmatically list the 'c' data types? I am not looking for detail, just if it is possible, and what function could be used to accomplish it. For example: int main void() { while there are more data types { print next data type; }
by: Schnogge | last post by:
Hi! it is possible to generate an multiple-dimensional array with different data types? Or is it possible to combine a one-dimensional array with an other which has an other data type? How must i do that? Thank you so much !
by: Kiran B. | last post by:
Hi, I am new to .net. I have two Data Structure Type ... Sturcture A and Structure B. Structure A Public Fname as String Public LastName as String Public City as String Public Zip as String End Structure
by: dementrio | last post by:
How can I handle low-level data types in Python? What I want to do is writing an interface to a C daemon which waits for stuff like unsigned ints on a socket. For example, I need to craft and decode data structures that look like this: 32-bit unsigned int MSG_LENGTH 32-bit unsigned int MSG_CODE 64-bit signed int DATA 32-bit length + utf-8 characters STRING_DATA etc.
by: ilya2 | last post by:
I am supposed to teach an introductory C course with an unusual slant, and have trouble finding an appropriate textbook. The course will begin traditionally enough with variables, loops, conditionals, structures, pointers and fopen/fclose. Beyond that, however, every course and textbook I had seen is heavy on data *structures*, and touches on other topics lightly if at all. Whereas I need to stress data *types*, converting them one into...
by: Madhur | last post by:
Hi All, I would like you help me in creating an array of data types. I am interested in look at the the data type which looks like this Array a={int,float,char,int*..............................}, so that a should return me int and a should return me
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.