library(tidyverse)
library(readxl)
library(here)
library(janitor)
library(haven)
library(sjPlot)
library(surveytoolbox)
# note surveytoolbox installs from github
# remotes::install_github("martinctc/surveytoolbox")
3 Import
4 Packages for this chapter
5 How to read data into R
The code you need to read your data into R depends on the kind of data you are dealing with. Here we will demo how to .csv files, Excel spreadsheets, and then go deep on how to deal with data from Qualtrics.
5.1 Reading in .csv files
Exporting data in its simplest form (comma separated values) means that your data is readable by most software, trackable by version control systems, and lightweight.
Use the following code to read in a csv file. Remember that we use the here()
function to tell R where to find the data file, relative to the top level of the project file.
<- read_csv(here("data", "my_csv_data.csv")) data1
Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gender
dbl (2): age, score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NOTE: the message above prints in red in RStudio, but its not an error. Just a message telling you that this dataset has 5 rows and 3 columns. It also has information about the type of data that R thinks each variable is. Here R thinks the gender variable is character (strings/text) and the age and score variables are double (R speak for numeric).
5.2 Reading in Excel spreadsheets
Sometimes your data is in .xlsx format. You can use the readxl
package to read spreadsheets into R. You can get a sense for the first few rows of your dataframe using the head()
function.
<- read_xlsx(here("data", "my_excel_data.xlsx"))
data2
head(data2)
# A tibble: 5 × 3
age gender score
<dbl> <chr> <dbl>
1 23 F 65
2 34 F 78
3 22 M 67
4 19 F 44
5 23 M 77
5.3 Reading in Qualtrics data
If you are collecting survey data, you are probably using Qualtrics. You can export your Qualtrics data in lots of different formats, but we advocate for exporting it as a .sav file.
Yes this is typically the format used in SPSS. When you export from Qualtrics into .sav/SPSS format, it retains helpful information like question wording and response labels. If you export straight to .csv, you lose that info and will find yourself cross-checking back to Qualtrics. So, strong word of advice to always export to .sav. it is handy because it keeps extra information about your variables in a set of labels, that you can use down the track.
From here, we’ll be using data from a file called sampledata.sav, which you can find in the data
folder. We are using read_sav()
from the haven
package.
The glimpse
function gives a nice overview of the variables, their type, and a preview of the data.
<- read_sav(here("data", "sampledata.sav"))
data
glimpse(data)
Rows: 199
Columns: 13
$ participantid <dbl> 103, 109, 118, 121, 122, 126, 128, 129, 132, 133…
$ IPAddress <chr> "73.173.1", "97.102.1", "108.30.2", "75.128.9", …
$ Variable1 <dbl+lbl> 4, 1, 3, 2, 6, 3, 2, 1, 2, 2, 1, 7, 2, 1, 1,…
$ Variable2 <dbl+lbl> 4, 4, 6, 5, 2, 6, 2, 4, 4, 3, 5, 2, 6, 5, 5,…
$ Variable3 <dbl+lbl> 6, 4, 6, 5, 3, 6, 3, 4, 4, 3, 5, 2, 6, 5, 5,…
$ Variable4 <dbl> 5, 4, 2, 2, 5, 6, 5, 1, 3, 3, 4, 5, 6, 5, 6, 3, …
$ Variable5 <dbl> 5, 5, 3, 1, 6, 7, 6, 1, 4, 2, 5, 7, 6, 7, 4, 3, …
$ Variable6 <dbl> 6, 6, 2, 1, 5, 6, 7, 3, 3, 4, 5, 6, 5, 6, 7, 2, …
$ Demographicscateg <dbl+lbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1,…
$ condition12 <dbl+lbl> 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1,…
$ condition1234 <dbl+lbl> 3, 3, 1, 2, 4, 1, 1, 3, 3, 3, 1, 1, 1, 1, 4,…
$ comments <chr> "", "", "", "", "", "", "", "", "", "", "", "", …
$ CompletionTimeseconds <dbl> 527, 287, 391, 377, 359, 595, 296, 466, 296, 550…
These variable names won’t be very nice to work with with awkward and inconsistent capitalisation. Actual Qualtrics exports are even messier!
The clean_names
function from janitor
helps clean them up!
Here we take our data dataframe, and pipe %>% it into clean_names
, and then assign (<-) to a new object called data_cleanednames
Alternately, you could write it back to the same dataframe (e.g., data <- ), but this should be done very intentionally as when you overwrite dataframes, it can be difficult to debug your code when you get errors.
The general rule is to create a new dataframe each time you implement a big change on the data.
The glimpse
command here shows you that you effectively cleaned the variable names!
<- data %>%
data_cleanednames clean_names()
glimpse(data_cleanednames)
Rows: 199
Columns: 13
$ participantid <dbl> 103, 109, 118, 121, 122, 126, 128, 129, 132, 13…
$ ip_address <chr> "73.173.1", "97.102.1", "108.30.2", "75.128.9",…
$ variable1 <dbl+lbl> 4, 1, 3, 2, 6, 3, 2, 1, 2, 2, 1, 7, 2, 1, 1…
$ variable2 <dbl+lbl> 4, 4, 6, 5, 2, 6, 2, 4, 4, 3, 5, 2, 6, 5, 5…
$ variable3 <dbl+lbl> 6, 4, 6, 5, 3, 6, 3, 4, 4, 3, 5, 2, 6, 5, 5…
$ variable4 <dbl> 5, 4, 2, 2, 5, 6, 5, 1, 3, 3, 4, 5, 6, 5, 6, 3,…
$ variable5 <dbl> 5, 5, 3, 1, 6, 7, 6, 1, 4, 2, 5, 7, 6, 7, 4, 3,…
$ variable6 <dbl> 6, 6, 2, 1, 5, 6, 7, 3, 3, 4, 5, 6, 5, 6, 7, 2,…
$ demographicscateg <dbl+lbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1…
$ condition12 <dbl+lbl> 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1…
$ condition1234 <dbl+lbl> 3, 3, 1, 2, 4, 1, 1, 3, 3, 3, 1, 1, 1, 1, 4…
$ comments <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
$ completion_timeseconds <dbl> 527, 287, 391, 377, 359, 595, 296, 466, 296, 55…
A few things about working with files in SPSS format (.sav) before we continue. The reason why we bother with this is that the SPSS format maximises the information in the file. Unlike exporting to .csv or another spreadsheet format, .sav retains information about question wording (saved as a variable label) and response labelling (saved as a value label).
If you look at the variable types at the right of the glimpse
output, you’ll see the some of the variables are dbl (numeric) while some are dbl+lbl (numeric with labelled values). If you view the data
object (by clicking on it in the Environment or using view(data)
) you will see that some of the variables have the question wording below the variable name.
Having this information on hand is really helpful when working with your data!
5.4 Codebooks & Data dictionaries
When you have a really large survey dataset, information about what each variable refers to is essential to a reproducable analysis. Often it is helpful to create a codebook or data dictionary that you can share alongside the datafile, that helps the user understand what the numbers in the file refer to and where they came time.
The view_df
function from the sjPlot
package creates a really nicely formatted html file that includes variable names, question wording, response options, and response labelling. This code saves the html file to the output_files
folder using the here
package (which starts where your Rproj file is). This html file is nice as a reference for your own use or to share with a supervisor or collaborator!
view_df(data_cleanednames)
ID | Name | Label | Values | Value Labels |
1 | participantid | range: 103-996 | ||
2 | ip_address | <output omitted> | ||
3 | variable1 | variable 1 question wording | 1 7 |
strongly disagree strongly agree |
4 | variable2 | variable 2 question wording | 1 7 |
strongly disagree strongly agree |
5 | variable3 | variable 3 question wording | 1 7 |
strongly disgree strongly agree |
6 | variable4 | variable 4 question wording | range: 1-7 | |
7 | variable5 | variable 5 question wording | range: 1-7 | |
8 | variable6 | variable 6 question wording | range: 1-7 | |
9 | demographicscateg | demographics question | 1 2 |
category 1 category 2 |
10 | condition12 | two conditions | 1 2 |
condition 1 condition 2 |
11 | condition1234 | four conditions | 1 2 3 4 |
condition 1 condition 2 condition 3 condition 4 |
12 | comments | comments | ||
13 | completion_timeseconds | completion time (seconds) | range: 30-597 |
view_df(data_cleanednames, file=here("output_files","spsstest_codebook.html"))
ID | Name | Label | Values | Value Labels |
1 | participantid | range: 103-996 | ||
2 | ip_address | <output omitted> | ||
3 | variable1 | variable 1 question wording | 1 7 |
strongly disagree strongly agree |
4 | variable2 | variable 2 question wording | 1 7 |
strongly disagree strongly agree |
5 | variable3 | variable 3 question wording | 1 7 |
strongly disgree strongly agree |
6 | variable4 | variable 4 question wording | range: 1-7 | |
7 | variable5 | variable 5 question wording | range: 1-7 | |
8 | variable6 | variable 6 question wording | range: 1-7 | |
9 | demographicscateg | demographics question | 1 2 |
category 1 category 2 |
10 | condition12 | two conditions | 1 2 |
condition 1 condition 2 |
11 | condition1234 | four conditions | 1 2 3 4 |
condition 1 condition 2 condition 3 condition 4 |
12 | comments | comments | ||
13 | completion_timeseconds | completion time (seconds) | range: 30-597 |
The data_dict
function from surveytoolbox
makes a dataframe with all the variable and response labels - similar to the html created above, but this can be called upon later in R as it’s now part of the environment.
<- data_cleanednames %>%
datadictionary data_dict()
Let’s say you just want to know the question wording or response labels for a particular variable, you can do this with code rather than checking the whole dataset. The extract_vallab
command from surveytoolbox
returns the value labels for a given variable.
Here we are interested in what the values in the demographicscateg variable refer to.
%>%
data_cleanednames extract_vallab("demographicscateg")
# A tibble: 2 × 2
id demographicscateg
<dbl> <chr>
1 1 category 1
2 2 category 2
There are (evidently) times when packages do not like labelled data. So, here are a few tools for removing labels from the haven
package. Keep these up your sleeve for problem solving later!
zap_labels
and zap_label
each remove labels. Yes it would be nice if those functions were easier to distinguish! The first zaps variable labels, and the second zaps value labels.
The code below makes a new data dictionary of the zapped dataframe and glimpses the new dataframe to confirm the labels are gone.
<- data_cleanednames %>%
data_zapped zap_labels() %>%
zap_label()
<- data_zapped %>%
datadictionary_zapped data_dict()
glimpse(data_zapped)
Rows: 199
Columns: 13
$ participantid <dbl> 103, 109, 118, 121, 122, 126, 128, 129, 132, 13…
$ ip_address <chr> "73.173.1", "97.102.1", "108.30.2", "75.128.9",…
$ variable1 <dbl> 4, 1, 3, 2, 6, 3, 2, 1, 2, 2, 1, 7, 2, 1, 1, 3,…
$ variable2 <dbl> 4, 4, 6, 5, 2, 6, 2, 4, 4, 3, 5, 2, 6, 5, 5, 6,…
$ variable3 <dbl> 6, 4, 6, 5, 3, 6, 3, 4, 4, 3, 5, 2, 6, 5, 5, 7,…
$ variable4 <dbl> 5, 4, 2, 2, 5, 6, 5, 1, 3, 3, 4, 5, 6, 5, 6, 3,…
$ variable5 <dbl> 5, 5, 3, 1, 6, 7, 6, 1, 4, 2, 5, 7, 6, 7, 4, 3,…
$ variable6 <dbl> 6, 6, 2, 1, 5, 6, 7, 3, 3, 4, 5, 6, 5, 6, 7, 2,…
$ demographicscateg <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
$ condition12 <dbl> 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1,…
$ condition1234 <dbl> 3, 3, 1, 2, 4, 1, 1, 3, 3, 3, 1, 1, 1, 1, 4, 2,…
$ comments <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
$ completion_timeseconds <dbl> 527, 287, 391, 377, 359, 595, 296, 466, 296, 55…
For the rest of this script, we will work with the zapped dataframe. This is the recommended approach to save headaches with errors down the line.