Tips to prepare code without data at hand

R Cafe August

Aleksandra & Clementine

2022-08-31

TIP 1: Prepare your environment

  • Setup RStudio
  • Load packages
  • Use relative paths

Setup R studio

  • Create a project: File->New Project->New Directory->New Project
  • Create folder structure
  • Create first script: File->New File->RScript->Save in scripts folder

Source: Data Carpentry: R for Social Scientists

Load packages

# Load packages -----------------------------------------------
 library("data.table")
 library("here")
 library("tidyverse")
 library("tools")
# Load packages -----------------------------------------------
# Package names
packages <- c( "data.table", "here", "tidyverse",
               "highcharter", "data.table",
               "revealjs","tools")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

Choose relative paths!

Do not set absolute paths with set_wd(). here()package absolute path given current working directory.

here()
## [1] "C:/Users/awilczynski/Desktop/script-no-data"
here('data')
## [1] "C:/Users/awilczynski/Desktop/script-no-data/data"
here('data', 'dummy_LFS.csv')
## [1] "C:/Users/awilczynski/Desktop/script-no-data/data/dummy_LFS.csv"

TIP 2: Generalise your code

  • Create functions
  • Avoid hard-coding, assign values to objects
  • Avoid repetition, use e.g.:
    • for loops
    • apply() family functions
    • map() and map2() family functions()

Functions

my_function <- function(arg1, arg2, arg3=0) {
    x <- arg1 + arg2
    x
}

my_function(1,2)
## [1] 3

Functions

You can customise your function for your purpose or further generalise it further:

###### Function to join the 2 tables
join_tabs <- function(p_tab, hh_tab){
   single_tab <- left_join(p_tab, hh_tab, 
                           by=c("RINPERSOON" = "RINPERSOON"))
   
   return(single_tab)
}

single_tab <- join_tabs(p_tab, hh_tab)
###### Function to join the 2 tables
join_tabs <- function(p_tab, hh_tab, key = "RINPERSOON"){
   single_tab <- left_join(p_tab, hh_tab, by=c(key = key))
   return(single_tab)
}

single_tab <- join_tabs(p_tab, hh_tab)

(l)apply() it

#### lapply()
new_list <- lapply(old_list, function)

#### map2()
new_df <- map2_dfr(object1, object2, function )

#### mapply()
new_df <- mapply(function,arg1 = object1, arg2= object2,  SIMPLIFY=F ) %>% as.data.frame()

Create objects!

This will help you changing your code only in one place, rather than everywhere you used the value.

text1<-paste('Today, I have run 10.2 km. 10.2 km is 12000 m. It was a good result.')
print(text1)
## [1] "Today, I have run 10.2 km. 10.2 km is 12000 m. It was a good result."
today<-Sys.Date()
distance <- 10.2 
text2<- paste0('On ', today,', I have run ', distance ,'km. '
              , distance , 'km is ', distance*1000,'m. It was a '
              , ifelse(distance>=10, 'good','bad'),' result.')

print(text2)
## [1] "On 2022-08-31, I have run 10.2km. 10.2km is 10200m. It was a good result."

Avoid repetition

# Get file names of all our files
data_path = here('data')
files = here(data_path, dir(data_path))
# Read in first rows of each data.frame
data1 <- fread(files[1])
data2 <- fread(files[2])
# Read first rows of each data.frame
file_list = list()

for (i in files) { 
  file_list[[i]] == fread(i)
}
# Read first rows of each data.frame
file_list2 = lapply(files, fread )

TIP 3: Read(write) data documentation

In some case (like with CBS data), although you cannot access the data, you can already access the documentation.

If you are collecting data, it’s a good moment to create your data model and start writing documentation

E.g.: CBS

E.g.: EULFS

Create metadata object

Create a metadata data frame with info about the variables you are going to use. It can include fields:

  • Names of variables
  • Variable labels
  • Variable preferred data type (factor/numeric/date/(…))
  • Variable data type in the dataset
  • Coding of missing values
  • Possible values
  • (…)

Create metadata object

meta_table = data.table::data.table(
  var_name = c('NIGHTWK', 'SATWK', 'SUNWK', 'HWUSUAL'),
  var_label = c('Night work', 'Saturday work',
                'Sunday work', 'Hours worked'),
  pref_dtype = c('factor', 'factor', 'factor', 'numeric'),
  dset_dtype = c('numeric', 'numeric', 'numeric', 'numeric'),
  missing = list(c('9',''), c('9',''), c('9',''), c('00','99','')),
  possible_values =list(1:3, 1:3, 1:3, 1:98) 
  )

print(meta_table)
##    var_name     var_label pref_dtype dset_dtype missing possible_values
## 1:  NIGHTWK    Night work     factor    numeric      9,           1,2,3
## 2:    SATWK Saturday work     factor    numeric      9,           1,2,3
## 3:    SUNWK   Sunday work     factor    numeric      9,           1,2,3
## 4:  HWUSUAL  Hours worked    numeric    numeric  00,99, 1,2,3,4,5,6,...

TIP 4: Prepare to read data

  • Prepare script to read the data
  • Create dummy data

Prepare for a single file

my_data <- read_csv(here('data','dummy_LFS.csv'))

Get all the files from data directory

data_path = here('data')
files = here(data_path, dir(data_path))
files_no_ext = tools::file_path_sans_ext(files)

Custom for certain file paths (e.g. csv)

file_list_1 = lapply(files, read_csv)

Generalize (data.table output)

file_list_2 = lapply(files, fread)

# Get the first dataset 
data1 <- file_list_2[[1]]
nrows_1 <- nrow(data1)
head(data1)
##    ID EVENWK NIGHTWK SATWK SUNWK HWUSUAL SHIFTWK TEMPAGCY
## 1:  1      1      NA     3     3      35       9        9
## 2:  2      3       1     2     3      34       9        9
## 3:  3     NA      NA     2     2      40       9        9
## 4:  4      9       9     9     9      99       1        1
## 5:  5      9       9     9     9      99       1        0
## 6:  6      2       1     3     3      48      NA       NA

Prepare a dummy dataset

  • Manually
# Number of rows to generate
n_rows <- 10000

# Set seed to always get the same result in PRNGs  
set.seed(12345)

# Generate the data frame
dummy_data <- data.frame(
  ID = 1:n_rows,
  NIGHTWK = sample(as.numeric(c(1,2,3,9,'')),n_rows,rep=T),
  SATWK = sample(as.numeric(c(1,2,3,9,'')),n_rows,rep=T),
  SUNWK = sample(as.numeric(c(1,2,3,9,'')),n_rows,rep=T),
  HWUSUAL = sample(as.numeric(c(0:99,'')),n_rows,rep=T)
  )

Prepare a dummy dataset

  • From a metadata file
### Create variable with all possible values of a variable
meta_table$all_values <- mapply(c, meta_table$missing, meta_table$possible_values, SIMPLIFY = FALSE)

### Character vector of function names to adapt the data type
funs <- paste0("as.", meta_table$dset_dtype)

### Change the data type so that it's aligned with the documentation
meta_table[,all_values:= mapply(function(arg,f) f(arg),
                   meta_table$all_values, 
                   lapply(funs, get),
                   SIMPLIFY = F)]

### Number of rows to generate
n_rows <- 10000

### Set seed to always get the same result in PRNGs  
set.seed(12345)

### Sample 10,000 data points from all possible values
dummy_data2 <- lapply(meta_table$all_values, 
                      sample, n_rows, rep=T) %>% as.data.table()

### Name the variables
colnames(dummy_data2) <- meta_table$var_name

### Add ID variable
dummy_data2[, ID:= 1:n_rows]

### Reorder vars
setcolorder(dummy_data2, c('ID', meta_table$var_name))

And use it to automate cleaning - Recode NA

# Remove NAs different on variable-specific codes

### Convert a certain character to NA
recode_to_na <- function(x, char){
  out_x <- x
  out_x[which(out_x %in% char)] = NA
  return(out_x)
}

### Keep only the variables from the metadata 
meta_var <- meta_table$var_name
dummy_data_clean <- dummy_data2[,..meta_var]

### Keep only the vars that we chose in the metadata file
dummy_data_clean<-map2_dfr(dummy_data_clean,meta_table$missing, recode_to_na )

And use it to automate cleaning - Adapt data types

### Vector of function names to change data types 
funs <- paste0("as.", meta_table$pref_dtype)


### Convert all the columns in the data set to the preferred data type 
dummy_data_clean <- as.data.table(mapply(function(a,b) b(a), 
                                         dummy_data_clean, 
                                         lapply(funs, get), 
                                         SIMPLIFY = F))


#str(dummy_data_clean)

TIP 5: Prepare for data vises

  • Choose (and adapt) your theme
  • Choose or create your color palette
  • Create your custom functions

Choose and adapt your theme

Choose and adapt your theme

dcc_theme <- theme_minimal() + 
  theme(panel.grid.major.x = element_blank(),
        axis.title = element_blank(),
        text = element_text(color = 'gray60', size = rel(3)),
        axis.text = element_text(color = 'gray20'),
        legend.position="bottom"
        )

theme_set(dcc_theme)

Choose and adapt your theme

plot_iris()

Choose or create your color palette

  • Use pre-made palettes, e.g. from viridis()
  • Or save your own selection of colors in a vector
  • Inspect color palettes with scales::show_col()
  • Change ggplot default colors with options()

Choose or create your color palette

  • Use pre-made palettes, e.g. from viridis()
  • Or save your own selection of colors in a vector
tud_cols  <- c('#00A6D6','#FFFFFF', '#000000', '#0C2340', '#00B8C8'  )
cols_cont <- viridis::viridis(9)
tud_qual  <- c('#00B8C8', '#EC6842', '#FFB81C', '#EF60A3' )

Choose or create your color palette

  • Inspect color palettes with scales::show_col()
scales::show_col( tud_cols  ); scales::show_col( cols_cont  )

Choose or create your color palette

  • Change ggplot default colors with options()
opts <- options()  # save old options

options(ggplot2.continuous.colour="viridis")
options(ggplot2.continuous.fill = "viridis")

options(ggplot2.discrete.colour=tud_qual)
options(ggplot2.discrete.fill = tud_qual)

Choose or create your color palette

plot_iris()

Create your custom functions

###### Function to visualise the distribution of birthday months
viz_birth_month <- function(single_tab){
  
  levels(single_tab$GBAGEBOORTEMAAND) <- c(
    "January","February", "March", "April", 
    "May", "June", "July", "August",
    "September", "October", "November", "December"
    )
  
  q <- ggplot(data=single_tab, aes(x=GBAGEBOORTEMAAND)) + 
    geom_bar(fill="orange") + 
    ggtitle("Distribution of birthdays by month in register population") +
    coord_flip() + 
    scale_x_discrete(limits = rev(levels(single_tab$GBAGEBOORTEMAAND))) +
    theme_minimal()
  return(q)
}

TIP 6: When you get the data …

  • Store the data in the right location (e.g. data/ folder)
  • See the data structure with: summary() or str()
  • See if the the data you need is there (confront with metadata file?)
  • Do initial data cleaning, e.g.:
    • Recode NAs
    • Remove white spaces
    • Recode
    • Check and adapt data types
    • Make sure dates are read in the right format (try lubridate())

Example : CBS data

Let’s do some coding

Discussion

Questions? Comments? Your own examples?

Next R Cafe