Read CSV into data frame
If you‘re getting started with data analysis in R, two fundamental tools you need to master are data frames and CSV files. Data frames are R‘s structure for holding tabular data, like you‘d see in a spreadsheet. CSV files are a common plain text format for storing tabular data. Together, they provide a powerful way to import, manipulate, analyze and visualize datasets in R.
In this guide, we‘ll cover everything you need to know to work effectively with data frames and CSV files in R. I‘ll explain the concepts in depth, share best practices and walk through analyzing a real dataset. Whether you‘re a beginner or have some experience with R already, you‘ll come away with a solid grasp of these essential tools.
Importing CSV Files and Creating Data Frames
Let‘s start with the basics: getting your CSV data into R so you can work with it. The most straightforward way is using the read.csv() function. Simply pass it the file path of your CSV:
student_scores <- read.csv("student_scores.csv")
If your CSV file is in the same directory as your R script, you can just provide the file name. Otherwise, use a relative or absolute path to specify the file location.
The read.csv() function has a few important arguments to be aware of:
- header: indicates whether the first row of the CSV contains column names. Default is TRUE.
- sep: the field separator character. Default is comma for CSV files.
- stringsAsFactors: whether to convert string columns to factors. Default is TRUE in base R but FALSE in newer tidyverse functions.
There are other functions for importing tabular data like read.table() and read_csv() from the readr package. They operate similarly but have some different default arguments.
Once you‘ve imported the CSV, R stores that tabular data in a data frame. It‘s a 2-dimensional data structure where each column can have a different data type (numeric, character, logical, etc). Many base R functions and popular packages like dplyr and ggplot2 are designed to work with data frames.
Basic Data Frame Operations
Now that your data is in a data frame, let‘s explore some fundamental operations you‘ll use frequently when working with them.
Accessing Rows and Columns
To access specific rows or columns of a data frame, you can use square bracket [ ] indexing:
student_scores[1:5, ] # first 5 rows, all columns
student_scores[, c("name", "grade")] # all rows, specific columns
student_scores$grade # all rows, single column
The comma inside the brackets separates rows and columns. 1:5 selects the first 5 rows. Leaving one side blank selects all rows/columns. For columns you can provide names or numeric indices.
Filtering Rows
To filter rows that meet certain criteria, subset the data frame in the row index:
student_scores[student_scores$grade >= 90, ]
This returns only the rows where the "grade" column is greater than or equal to 90. You can use any logical expression to filter.
Sorting
To sort a data frame by one or more columns, use the order() function:
student_scores[order(student_scores$grade), ] # ascending
student_scores[order(-student_scores$grade), ] # descending
Provide the column(s) to order by. Use the negative sign – to sort in descending order.
Advanced Data Frame Manipulation
Those are the building blocks, but often you‘ll need to do more complex data manipulation to clean, rearrange or summarize your data. Here are some key techniques.
Adding or Removing Columns
To add a new column, assign into the data frame with $:
student_scores$pass_fail <- ifelse(student_scores$grade >= 60, "Pass", "Fail")
The ifelse() function is a concise way to perform conditional logic. Here it assigns either "Pass" or "Fail" based on the "grade" column.
To remove columns, you can assign them as NULL:
student_scores$temp_column <- NULL
Merging Data Frames
Sometimes your data is spread across multiple CSVs/data frames and you need to combine them. You can do this with the merge() function:
student_info <- merge(student_scores, student_demographics, by="student_id")
Provide the data frames to join and the column(s) to join on. This does an inner join by default, but you can specify the join type with the "all", "all.x", or "all.y" arguments.
Reshaping Data
Data frames can be in "wide" format with values across multiple columns or "long" format with a column for the variable name and a column for the value. You can reshape between them with pivot_longer() and pivot_wider() from the tidyr package:
student_scores_long <- pivot_longer(student_scores, cols=c("midterm", "final"), names_to="exam", values_to="score")
student_scores_wide <- pivot_wider(student_scores_long, names_from="exam", values_from="score")
This is useful for creating data visualizations or calculating summary statistics across categories.
Applying Functions
To apply a function to one or more columns of your data frame, use the mutate() function from dplyr:
student_scores <- mutate(student_scores,
midterm_z = round((midterm - mean(midterm)) / sd(midterm), 2),
final_z = round((final - mean(final)) / sd(final), 2)
)
This adds two new columns with the standardized z-scores for the "midterm" and "final" exams, rounded to 2 decimal places.
To apply a function across multiple columns, you can use across():
student_scores %>%
mutate(across(c(midterm, final), ~ round(scale(.x), 2)))
This applies the rounding and scaling to both midterm and final columns in one step using the handy . notation.
Data Frame and CSV Best Practices
Here are a few tips I‘ve learned working with data frames and CSVs:
- Keep column names lowercase and underscores (snake_case) for consistency
- Set stringsAsFactors=FALSE when reading CSVs to avoid accidentally converting character columns to factors
- Use a consistent naming convention for related data frames/CSVs like student_scores and student_demographics
- When saving data frames to CSV, use row.names=FALSE unless the row names are meaningful
- Be cautious merging data frames with duplicate column names – R will silently add .x and .y suffixes which can cause confusion
- Use dplyr and tidyr functions whenever possible as they tend to be faster and more readable than base R equivalents
Real-World Example: Analyzing Student Test Scores
Let‘s put this all together with a real data analysis example. Say we have a CSV file with student midterm and final exam scores:
name,midterm,final
John,85,92
Jane,88,88
Bob,72,80
Alice,90,85
We want to read in this data, calculate some summary statistics and visualize the relationship between the midterm and final scores. Here‘s how we could do that:
scores <- read.csv("scores.csv")
head(scores)
mean_midterm <- mean(scores$midterm)
sd_midterm <- sd(scores$midterm)
mean_final <- mean(scores$final)
sd_final <- sd(scores$final)
cat("Midterm mean:", mean_midterm, "SD:", sd_midterm, "\n")
cat("Final mean:", mean_final, "SD:", sd_final)
cor(scores$midterm, scores$final)
ggplot(scores, aes(x=midterm, y=final)) +
geom_point() +
geom_smooth(method="lm") +
labs(title="Midterm vs. Final Scores",
x="Midterm", y="Final")
This imports the scores from the CSV file, calculates means, standard deviations and a correlation, then creates a scatterplot with a linear regression line to visualize the relationship between the midterm and final scores.
We could enhance this analysis further by reading in a separate CSV with demographic data, merging it with the test scores and comparing results across groups. The possibilities are endless!
Conclusion
Data frames and CSV files are indispensable tools for doing data analysis in R. With the techniques covered in this guide, you can now confidently import, clean, manipulate and visualize your tabular data.
The key concepts to master are:
- Importing CSVs with read.csv() and relatives
- Indexing and subsetting data frames with [] and $
- Sorting, filtering and applying functions to data frames
- Merging and reshaping data frames for analysis and visualization
- Using dplyr and tidyr packages for readable, efficient code
We walked through a real analysis example to solidify these concepts. But the best way to internalize them is applying them to your own data. Pick a topic you‘re passionate about, find some relevant CSV data and start exploring!
With a solid foundation in data frames and CSVs, you‘re well on your way to becoming a proficient data analyst in R. Happy coding!