473,245 Members | 1,726 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,245 developers and data experts.

Lesson 4 (part1) – data management and visualization

nbiswas
149 100+
In this lesson we will learn how to import external data from files into R. We will also learn how to export, or write, the data to files if needed.
Once we have our data loaded into memory we can perform data filtering or querying to focus on key aspects of our data set.
We will also learn how to reorder our data within data frames and finally we will explore data visualization techniques.
R provides extensive data visualization options, in this lesson we will examine how to plot data for trend analysis and distribution analysis.
In future lessons we will explore many other types of data visualization techniques.
Let's first examine how to read data from files into R.
Typically our data is provided as a semi-structured text file. The data is considered semi-structured as each line of data represents an observation and multiple variables or measurements associated with each observation. The measurements are either separated using a delimiter or by position.
R can read data from many other formats including spreadsheets and other statistical software products, but we will focus on the most commonly found formats for our analysis.
Let's first examine some of the options for importing data from text files with observations that use delimiters.
The read.table function will import data assuming the first row and every subsequent row contains observations. If your data includes header information in the first line then simply include the argument header=TRUE.
If you specify the data has a header row then the values will be used as column names for the new data frame in R.
By default, the read.table function will assume that the second and subsequent lines in the text file are measurements delimited with white-space characters. If your data is delimited using other means then use the sep argument.
R will attempt to automatically determine the data types for each measurement based on the type of data encountered. It is often a good idea to examine the data file and carefully consider the data types you plan to use for your data while it resides in R for analysis. To accomplish this simply use the colClasses argument to specify a character vector of class names such as "integer" or "Date" for example.

If you do not have a header in your data set, or you wish to define column names that are different from the first row of data, use the col.names argument to specify a character vector of column names for your data frame.
Often the data file is delimited using special characters such as a comma ",". These comma separated values or CSV files are often created by spreadsheet applications such as Microsoft Excel. By default, the read.csv function will import data where the elements for each row are separated with a comma character. Since the comma character is often used as a decimal point in non-English speaking countries the dec or decimal option.
The read.delim function can be used for tab delimited files.
Other useful options include the ability to skip over non-observational data at the beginning of the file and also to ignore any comment lines using a special character in the first position of a line of text.
The skip option can be used to bypass any initial information in the file and the comment.char can be used to ignore any data where the first character represents information that is not part of the data set under analysis. The colClasses option can be used to set the class names for the data being read. For example, if the first elements are date values the colClasses vector can specify the Date Class as the first datatype.
If the data you have been provided has its observations defined in fixed positions on each line the read.fwf (Read Fixed Width Format) function can be used.
You can specify the width of each measurement by providing a numeric vector containing the width of each value.
The column names and/or data types are defined in the same manner as previously discussed.
Now that we have the data in memory let's start exploring.
Let's explore public data using R.
In this example we will work with birth registrations for males born in the province of Ontario in Canada from 1917 until 2010. The data set is provided under an Open Data initiative by the Ontario government.
A snapshot of the Comma Separated Values file is shown here.
Upon examination of the file format we understand that we should skip the first row and use the second row as a header and therefore use the descriptions as column names for our data frame. We will use the read.csv function and load our data into a data frame is called n.
After a successful read operation we often examine the structure of the new data frame. We have confirmed that n is a data.frame with 66,351 observations of 3 different variables.
The measurement or variable names are: Year, Name, and Frequency
R has automatically determined that the Year attribute and the Frequency attribute are both integer data types.
Notice how R has defined the Names to be Factors and not character vectors. This default behaviour can be changed using the option as.is=TRUE or stringsAsFactors=FALSE.
In this scenario we want to consider each name as a Factor and therefore we understand that there are 3,736 levels or unique names in our data.
R users will often examine a few rows of data using the head() function or in this case we validate the number of rows and columns using the dimension of dim() function.
Now it is time to explore. Our goal is to find out the most popular names in the most recent year.
Therefore, we first need to determine the most recent year using the max function. We use the max function and the dollar sign ($) to indicate that we would like to request the largets value for the Year measurement.
We discover that the most recent year of data is from 2010. Now, we will create a new data frame that contains only the male babies born and registered in 2010. This new data frame is called n.2010.
Notice how we use the bracket notation to specify the condition of our filter for the rows to be returned. If you are familiar with SQL this would be equivalent to the WHERE clause or selection operation. Since the indexing method for data frames involves rows and columns, separated by a comma (,), we must follow our conditional expression of n$Year==2010 with the comma operator. There is no condition for the columns and therefore all of the defined column information will be returned.
We use the nrow() function to check the number of names registered in 2010. Now we know that there were 1,503 male names registered. As an aside, according to our data source, if a name was registered less than five (5) times it was not included in the data set for privacy reasons.
Our goal was to find the most popular baby names in the most recent year so now we will need to sort the data to get our answer.
Before we sort the data and determine the 5 most popular name another option to filter data involves the use of the subset() function.
Here we pass the original data frame to the function and then we also provide a conditional expression. By default all of the column data will be returned, but this can be specified.
We validate that have the same set of 1,503 observations. Now let's sort the data.
Sep 4 '14 #1
1 5163
zmbd
5,501 Expert Mod 4TB
nbiswas:
Please provide proper citations for these articles.
-z
Sep 29 '14 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: ben moretti | last post by:
hi i'm learning python, and one area i'd use it for is data management in scientific computing. in the case i've tried i want to reformat a data file from a normalised list to a matrix with some...
3
by: HALLES | last post by:
HELLO ! in upper case: i mean to be heard ;o) Compilers are good ! Myself, i used TP6 and TP7 to work on dBASE V files, once ... a long time ago. I was unaware of Internet Usenet world,...
0
by: Matthieu Siggen | last post by:
Hello, I'm really confused about how to define services when concerning data management. I'm going to take an example to show where is my problem. If I'm developping an application with two...
0
by: Enorme Vigenti | last post by:
Hi all, I have a problem with sqlserver 2000 and large data management. I have a database with a large tables. Every table has a continuative input data flow every morning a job delete old records...
2
by: Bryan.Fodness | last post by:
I would like to have my data in a format so that I can create a contour plot. My data is in a file with a format, where there may be multiple fields field = 1 1a 0 2a 0 3a 5
3
by: Karabo | last post by:
What Are The Features And Data Management Strategies Of Postgresql
1
nbiswas
by: nbiswas | last post by:
Welcome to the lesson on R data structures. To perform any meaningful data analysis we need to collect our data into R data structures. In this lesson we will explore the most frequently used...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.