3  Import

4 Packages for this chapter

library(tidyverse)
library(readxl)
library(here)
library(janitor)
library(haven)
library(sjPlot)
library(surveytoolbox) 

# note surveytoolbox installs from github
# remotes::install_github("martinctc/surveytoolbox")

5 How to read data into R

The code you need to read your data into R depends on the kind of data you are dealing with. Here we will demo how to .csv files, Excel spreadsheets, and then go deep on how to deal with data from Qualtrics.

5.1 Reading in .csv files

Exporting data in its simplest form (comma separated values) means that your data is readable by most software, trackable by version control systems, and lightweight.

Use the following code to read in a csv file. Remember that we use the here() function to tell R where to find the data file, relative to the top level of the project file.

data1 <- read_csv(here("data", "my_csv_data.csv"))
Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gender
dbl (2): age, score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

NOTE: the message above prints in red in RStudio, but its not an error. Just a message telling you that this dataset has 5 rows and 3 columns. It also has information about the type of data that R thinks each variable is. Here R thinks the gender variable is character (strings/text) and the age and score variables are double (R speak for numeric).

5.2 Reading in Excel spreadsheets

Sometimes your data is in .xlsx format. You can use the readxl package to read spreadsheets into R. You can get a sense for the first few rows of your dataframe using the head() function.

data2 <- read_xlsx(here("data", "my_excel_data.xlsx"))

head(data2)
# A tibble: 5 × 3
    age gender score
  <dbl> <chr>  <dbl>
1    23 F         65
2    34 F         78
3    22 M         67
4    19 F         44
5    23 M         77

5.3 Reading in Qualtrics data

If you are collecting survey data, you are probably using Qualtrics. You can export your Qualtrics data in lots of different formats, but we advocate for exporting it as a .sav file.

Yes this is typically the format used in SPSS. When you export from Qualtrics into .sav/SPSS format, it retains helpful information like question wording and response labels. If you export straight to .csv, you lose that info and will find yourself cross-checking back to Qualtrics. So, strong word of advice to always export to .sav. it is handy because it keeps extra information about your variables in a set of labels, that you can use down the track.

From here, we’ll be using data from a file called sampledata.sav, which you can find in the data folder. We are using read_sav() from the haven package.

The glimpse function gives a nice overview of the variables, their type, and a preview of the data.

data <- read_sav(here("data", "sampledata.sav")) 

glimpse(data)
Rows: 199
Columns: 13
$ participantid         <dbl> 103, 109, 118, 121, 122, 126, 128, 129, 132, 133…
$ IPAddress             <chr> "73.173.1", "97.102.1", "108.30.2", "75.128.9", …
$ Variable1             <dbl+lbl> 4, 1, 3, 2, 6, 3, 2, 1, 2, 2, 1, 7, 2, 1, 1,…
$ Variable2             <dbl+lbl> 4, 4, 6, 5, 2, 6, 2, 4, 4, 3, 5, 2, 6, 5, 5,…
$ Variable3             <dbl+lbl> 6, 4, 6, 5, 3, 6, 3, 4, 4, 3, 5, 2, 6, 5, 5,…
$ Variable4             <dbl> 5, 4, 2, 2, 5, 6, 5, 1, 3, 3, 4, 5, 6, 5, 6, 3, …
$ Variable5             <dbl> 5, 5, 3, 1, 6, 7, 6, 1, 4, 2, 5, 7, 6, 7, 4, 3, …
$ Variable6             <dbl> 6, 6, 2, 1, 5, 6, 7, 3, 3, 4, 5, 6, 5, 6, 7, 2, …
$ Demographicscateg     <dbl+lbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1,…
$ condition12           <dbl+lbl> 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1,…
$ condition1234         <dbl+lbl> 3, 3, 1, 2, 4, 1, 1, 3, 3, 3, 1, 1, 1, 1, 4,…
$ comments              <chr> "", "", "", "", "", "", "", "", "", "", "", "", …
$ CompletionTimeseconds <dbl> 527, 287, 391, 377, 359, 595, 296, 466, 296, 550…

These variable names won’t be very nice to work with with awkward and inconsistent capitalisation. Actual Qualtrics exports are even messier!

The clean_names function from janitor helps clean them up!

Here we take our data dataframe, and pipe %>% it into clean_names, and then assign (<-) to a new object called data_cleanednames

Alternately, you could write it back to the same dataframe (e.g., data <- ), but this should be done very intentionally as when you overwrite dataframes, it can be difficult to debug your code when you get errors.

The general rule is to create a new dataframe each time you implement a big change on the data.

The glimpse command here shows you that you effectively cleaned the variable names!

data_cleanednames <- data %>%
  clean_names()

glimpse(data_cleanednames)
Rows: 199
Columns: 13
$ participantid          <dbl> 103, 109, 118, 121, 122, 126, 128, 129, 132, 13…
$ ip_address             <chr> "73.173.1", "97.102.1", "108.30.2", "75.128.9",…
$ variable1              <dbl+lbl> 4, 1, 3, 2, 6, 3, 2, 1, 2, 2, 1, 7, 2, 1, 1…
$ variable2              <dbl+lbl> 4, 4, 6, 5, 2, 6, 2, 4, 4, 3, 5, 2, 6, 5, 5…
$ variable3              <dbl+lbl> 6, 4, 6, 5, 3, 6, 3, 4, 4, 3, 5, 2, 6, 5, 5…
$ variable4              <dbl> 5, 4, 2, 2, 5, 6, 5, 1, 3, 3, 4, 5, 6, 5, 6, 3,…
$ variable5              <dbl> 5, 5, 3, 1, 6, 7, 6, 1, 4, 2, 5, 7, 6, 7, 4, 3,…
$ variable6              <dbl> 6, 6, 2, 1, 5, 6, 7, 3, 3, 4, 5, 6, 5, 6, 7, 2,…
$ demographicscateg      <dbl+lbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1…
$ condition12            <dbl+lbl> 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1…
$ condition1234          <dbl+lbl> 3, 3, 1, 2, 4, 1, 1, 3, 3, 3, 1, 1, 1, 1, 4…
$ comments               <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
$ completion_timeseconds <dbl> 527, 287, 391, 377, 359, 595, 296, 466, 296, 55…

A few things about working with files in SPSS format (.sav) before we continue. The reason why we bother with this is that the SPSS format maximises the information in the file. Unlike exporting to .csv or another spreadsheet format, .sav retains information about question wording (saved as a variable label) and response labelling (saved as a value label).

If you look at the variable types at the right of the glimpse output, you’ll see the some of the variables are dbl (numeric) while some are dbl+lbl (numeric with labelled values). If you view the data object (by clicking on it in the Environment or using view(data)) you will see that some of the variables have the question wording below the variable name.

Having this information on hand is really helpful when working with your data!

5.4 Codebooks & Data dictionaries

When you have a really large survey dataset, information about what each variable refers to is essential to a reproducable analysis. Often it is helpful to create a codebook or data dictionary that you can share alongside the datafile, that helps the user understand what the numbers in the file refer to and where they came time.

The view_df function from the sjPlot package creates a really nicely formatted html file that includes variable names, question wording, response options, and response labelling. This code saves the html file to the output_files folder using the here package (which starts where your Rproj file is). This html file is nice as a reference for your own use or to share with a supervisor or collaborator!

view_df(data_cleanednames)
Data frame: data_cleanednames
ID Name Label Values Value Labels
1 participantid range: 103-996
2 ip_address <output omitted>
3 variable1 variable 1 question wording 1
7
strongly disagree
strongly agree
4 variable2 variable 2 question wording 1
7
strongly disagree
strongly agree
5 variable3 variable 3 question wording 1
7
strongly disgree
strongly agree
6 variable4 variable 4 question wording range: 1-7
7 variable5 variable 5 question wording range: 1-7
8 variable6 variable 6 question wording range: 1-7
9 demographicscateg demographics question 1
2
category 1
category 2
10 condition12 two conditions 1
2
condition 1
condition 2
11 condition1234 four conditions 1
2
3
4
condition 1
condition 2
condition 3
condition 4
12 comments comments
13 completion_timeseconds completion time (seconds) range: 30-597
view_df(data_cleanednames, file=here("output_files","spsstest_codebook.html"))
Data frame: data_cleanednames
ID Name Label Values Value Labels
1 participantid range: 103-996
2 ip_address <output omitted>
3 variable1 variable 1 question wording 1
7
strongly disagree
strongly agree
4 variable2 variable 2 question wording 1
7
strongly disagree
strongly agree
5 variable3 variable 3 question wording 1
7
strongly disgree
strongly agree
6 variable4 variable 4 question wording range: 1-7
7 variable5 variable 5 question wording range: 1-7
8 variable6 variable 6 question wording range: 1-7
9 demographicscateg demographics question 1
2
category 1
category 2
10 condition12 two conditions 1
2
condition 1
condition 2
11 condition1234 four conditions 1
2
3
4
condition 1
condition 2
condition 3
condition 4
12 comments comments
13 completion_timeseconds completion time (seconds) range: 30-597

The data_dict function from surveytoolbox makes a dataframe with all the variable and response labels - similar to the html created above, but this can be called upon later in R as it’s now part of the environment.

datadictionary <- data_cleanednames %>%
  data_dict()

Let’s say you just want to know the question wording or response labels for a particular variable, you can do this with code rather than checking the whole dataset. The extract_vallab command from surveytoolbox returns the value labels for a given variable.

Here we are interested in what the values in the demographicscateg variable refer to.

data_cleanednames %>%
  extract_vallab("demographicscateg")
# A tibble: 2 × 2
     id demographicscateg
  <dbl> <chr>            
1     1 category 1       
2     2 category 2       

There are (evidently) times when packages do not like labelled data. So, here are a few tools for removing labels from the haven package. Keep these up your sleeve for problem solving later!

zap_labels and zap_label each remove labels. Yes it would be nice if those functions were easier to distinguish! The first zaps variable labels, and the second zaps value labels.

The code below makes a new data dictionary of the zapped dataframe and glimpses the new dataframe to confirm the labels are gone.

data_zapped <- data_cleanednames %>%
  zap_labels() %>%
  zap_label()

datadictionary_zapped <- data_zapped %>%
  data_dict()

glimpse(data_zapped)
Rows: 199
Columns: 13
$ participantid          <dbl> 103, 109, 118, 121, 122, 126, 128, 129, 132, 13…
$ ip_address             <chr> "73.173.1", "97.102.1", "108.30.2", "75.128.9",…
$ variable1              <dbl> 4, 1, 3, 2, 6, 3, 2, 1, 2, 2, 1, 7, 2, 1, 1, 3,…
$ variable2              <dbl> 4, 4, 6, 5, 2, 6, 2, 4, 4, 3, 5, 2, 6, 5, 5, 6,…
$ variable3              <dbl> 6, 4, 6, 5, 3, 6, 3, 4, 4, 3, 5, 2, 6, 5, 5, 7,…
$ variable4              <dbl> 5, 4, 2, 2, 5, 6, 5, 1, 3, 3, 4, 5, 6, 5, 6, 3,…
$ variable5              <dbl> 5, 5, 3, 1, 6, 7, 6, 1, 4, 2, 5, 7, 6, 7, 4, 3,…
$ variable6              <dbl> 6, 6, 2, 1, 5, 6, 7, 3, 3, 4, 5, 6, 5, 6, 7, 2,…
$ demographicscateg      <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
$ condition12            <dbl> 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1,…
$ condition1234          <dbl> 3, 3, 1, 2, 4, 1, 1, 3, 3, 3, 1, 1, 1, 1, 4, 2,…
$ comments               <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
$ completion_timeseconds <dbl> 527, 287, 391, 377, 359, 595, 296, 466, 296, 55…

For the rest of this script, we will work with the zapped dataframe. This is the recommended approach to save headaches with errors down the line.