Some tricks I use for data manipulation in R

Sat, Apr 2, 2022 7-minute read rdplyr

 

Recently I have been working on a project that demands lots of efforts on cleaning and manipulating for a collection of longitudinal data sets. It took me a few months to understand the practical meanings and structures of the collected data and to refine the algorithms to generate core variables from the raw data. The task has put my data wrangling skills through a comprehensive test, and during which some interesting challenges emerged. In the post, I will share some of those challenges and give my solutions. The discussion is based on some key functions from the R package dplyr.

1. Date

When mutating on date, case_when() function is better than ifelse() because the former keeps data format consistent.

The example data has two columns: DATE in date type and LAB in numerical type. If we want to generate a new date variable, as shown in the example, ifelse() returns numerical values while case_when() returns dates. Of course one can further covert the numerical value to date using function as.Date(), but it’s better to keep the code simple and straightforward.


library(tidyverse)
df <- data.frame(DATE = c(seq(as.Date("01-03-2021", format = "%d-%m-%Y"),
                                  as.Date("02-03-2021", format = "%d-%m-%Y"),
                                  by = "day"),
                                NA, NA),
                 LAB = c(1:3, NA))

#         DATE LAB
# 1 2021-03-01   1
# 2 2021-03-02   2
# 3       <NA>   3
# 4       <NA>   NA

# mutate to generate new Date variable
df %>% 
  mutate(
    DATE_IF = ifelse(LAB < 2, DATE, NA),
    DATE_CASE = case_when(LAB < 2 ~ DATE,
                          LAB >= 2 ~ NA_real_)
    )
#         DATE LAB DATE_IF  DATE_CASE
# 1 2021-03-01   1   18687 2021-03-01
# 2 2021-03-02   2      NA       <NA>
# 3       <NA>   3      NA       <NA>
# 4       <NA>  NA      NA       <NA>

2. Missing

Personally I think it’s very important to be clear about the definition/use of null values (i.e., NA) before making any change on the data, or even before collecting the data, otherwise it would cause confusion and inaccuracy down the road. Here are some questions worth discussing at the beginning.

  • Don’t know VS Null

    It’s very common in clinical trial and survey data that a subject would answer “don’t know” to the question or not be able to answer the question, which are two totally different responses. The best practice is to assign different values to them. For example, assign 99 to “don’t know” and assign NA to missing, in this way one can discover patterns such as how many people skipped the question, and among all the people who answered, how many do not know.

  • NA is a logical value in R

    One thing I often prefer case_when() over ifelse() is that the former one allows us to specify all the evaluated logical values explicitly. More specifically, ifelse() only considers two logical values: TRUE or not TRUE. However, case_when() considers all three logical values: TRUE, FALSE, and NA. In real data, the condition often consists of a chain of conditions that likely leads to NA.

    
    NA | TRUE
    # [1] TRUE  
    # TRUE | TRUE is TRUE and FALSE | TRUE is also TRUE.
    
    NA | FALSE
    # [1] NA  
    # TRUE | FALSE is TRUE but FALSE | FALSE is FALSE.
    
    NA & TRUE
    # [1] NA  
    # TRUE & TRUE is TRUE but FALSE & TRUE is FALSE.
    
    NA & FALSE
    # [1] FALSE
    # TRUE & FALSE is FALSE and FALSE & FALSE is also FALSE.
    
  • sort() is better than max() or min()

    Sometimes I want to find the maximum or minimum values for each group, and some groups contain all missing (NA) values while others contain at least one non-missing value. In this case, sort() is better than max() or min() because the latter two would return error message if the inputs are all NA, while sort() would output NA instead of error.

    If you can be sure that all groups contain at least one non-missing value, go ahead and use max() and min() with the argument na.rm = TRUE. However, it may not be always known in advance.

    # find the maximum
    sort(c(NA, NA), decreasing = FALSE)[1]
    # [1] NA
    max(c(NA, NA), na.rm = TRUE)
    # Warning message:
    #   In max(c(NA, NA), na.rm = TRUE) :
    #   no non-missing arguments to max; returning -Inf
    
    # find the minimum
    sort(c(NA, NA), decreasing = TRUE)[1]
    # [1] NA
    min(c(NA, NA), na.rm = TRUE)
    # Warning message:
    #   In max(c(NA, NA), na.rm = TRUE) :
    #   no non-missing arguments to max; returning -Inf
    

3. rowwise() VS across()

It gave me a headache when I first encountered with rowwise(), because it was not clear when to use it, like the example below. The purpose is to calculate the sum of each row, but if I only apply the function sum(), the summation of the entire data set is returned. However, if I use operator + or add rowwise(), we get the desired output. There are two reasons behind this: 1) sum() does not have built-in vectorization like operator +, so it does not work as expected in this case; 2) rowwise() performs operations row-by-row, it works slower than the vectorized functions like + but it works with any function.


df <- data.frame(V1 = c(1,2,3), 
                 V2 = c(4,5,6))
#   V1 V2
# 1  1  4
# 2  2  5
# 3  3  6


# use sum() gets the summation of entire data frame
# use vectorized operator + gives the summation of each row
df %>% 
  mutate(V_SUM = sum(V1, V2)) %>% 
  mutate(V_SUM2 = V1 + V2)
#   V1 V2 V_SUM V_SUM2
# 1  1  4    21      5
# 2  2  5    21      7
# 3  3  6    21      9

# use rowwise() + sum() gets the summation of each row
df %>% 
  rowwise() %>% 
  mutate(V_SUM = sum(V1, V2))
#      V1    V2 V_SUM
# 1     1     4     5
# 2     2     5     7
# 3     3     6     9

Contrary to rowwise() which performs operations row-by-row, across allows us to perform operations column-wise. More examples can be found here.

df %>% 
  mutate(across(starts_with("V"), ~ .x+0.5))
#    V1  V2
# 1 1.5 4.5
# 2 2.5 5.5
# 3 3.5 6.5

4. Tidy Evaluation

I don’t think the concept Tidy Evaluation is mentioned as much often as other functions in dplyr, but it’s a very important idea that can potentially help you write more concise and compact functions. When I deal with multiple data sets, very often the data sets have the same data types but different column names. In this scenario, it might be very helpful to have one single function that accepts data with different columns names, and tidy evaluation does exactly what we want.

In the example we have two data sets with the same structure but different column names. We write a function with three arguments: 1) data is the data name; 2) name is the name of the existed variable on which we want to operate; 3) new_name is the name of newly generated variable. The example shows that if we apply the function directly on data set, an error is reported because mutate does not recognize name variable. Tidy evaluation helps translate the argument value to the variable name by using the “curly curly” operator {{}}, by which mutate can work properly. More details can be found here.

# two data sets with same data structure but different column names
df_var <- data.frame(var = c(1,2,3))
df_col <- data.frame(col = c(1,2,3))

df_function <- function(data, name, new_name){
  
  # data: data set
  # name: variable name
  # new_name: new variable name
  
  data %>% 
    mutate(new_name = name*2)
}

df_function(df_var, var, var2)
# Error in `mutate()`:
#   ! Problem while computing `new_name = name * 2`.

# use tidy evaluation {{}} 
df_function <- function(data, name, new_name){
  
  # data: data set
  # name: variable name
  # new_name: new variable name
  
  data %>% 
    mutate({{new_name}}:= {{name}}*2)
}

df_function(df_var, var, var2)
#   var var2
# 1   1    2
# 2   2    4
# 3   3    6
df_function(df_col, col, col2)
#   col col2
# 1   1    2
# 2   2    4
# 3   3    6