Tips to prepare code without data at hand

R Cafe August

Aleksandra & Clementine

2022-08-31

TIP 1: Prepare your environment

Setup RStudio
Load packages
Use relative paths

Setup R studio

Create a project: File->New Project->New Directory->New Project
Create folder structure
Create first script: File->New File->RScript->Save in scripts folder

Source: Data Carpentry: R for Social Scientists

Load packages

# Load packages -----------------------------------------------
 library("data.table")
 library("here")
 library("tidyverse")
 library("tools")

# Load packages -----------------------------------------------
# Package names
packages <- c( "data.table", "here", "tidyverse",
               "highcharter", "data.table",
               "revealjs","tools")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

Choose relative paths!

Do not set absolute paths with set_wd(). here()package absolute path given current working directory.

here()

## [1] "C:/Users/awilczynski/Desktop/script-no-data"

here('data')

## [1] "C:/Users/awilczynski/Desktop/script-no-data/data"

here('data', 'dummy_LFS.csv')

## [1] "C:/Users/awilczynski/Desktop/script-no-data/data/dummy_LFS.csv"

TIP 2: Generalise your code

Create functions
Avoid hard-coding, assign values to objects
Avoid repetition, use e.g.:
- for loops
- apply() family functions
- map() and map2() family functions()

Functions

my_function <- function(arg1, arg2, arg3=0) {
    x <- arg1 + arg2
    x
}

my_function(1,2)

## [1] 3

Functions

You can customise your function for your purpose or further generalise it further:

###### Function to join the 2 tables
join_tabs <- function(p_tab, hh_tab){
   single_tab <- left_join(p_tab, hh_tab, 
                           by=c("RINPERSOON" = "RINPERSOON"))
   
   return(single_tab)
}

single_tab <- join_tabs(p_tab, hh_tab)

###### Function to join the 2 tables
join_tabs <- function(p_tab, hh_tab, key = "RINPERSOON"){
   single_tab <- left_join(p_tab, hh_tab, by=c(key = key))
   return(single_tab)
}

single_tab <- join_tabs(p_tab, hh_tab)

(l)apply() it

#### lapply()
new_list <- lapply(old_list, function)

#### map2()
new_df <- map2_dfr(object1, object2, function )

#### mapply()
new_df <- mapply(function,arg1 = object1, arg2= object2,  SIMPLIFY=F ) %>% as.data.frame()

Create objects!

This will help you changing your code only in one place, rather than everywhere you used the value.

text1<-paste('Today, I have run 10.2 km. 10.2 km is 12000 m. It was a good result.')
print(text1)

## [1] "Today, I have run 10.2 km. 10.2 km is 12000 m. It was a good result."

today<-Sys.Date()
distance <- 10.2 
text2<- paste0('On ', today,', I have run ', distance ,'km. '
              , distance , 'km is ', distance*1000,'m. It was a '
              , ifelse(distance>=10, 'good','bad'),' result.')

print(text2)

## [1] "On 2022-08-31, I have run 10.2km. 10.2km is 10200m. It was a good result."

Avoid repetition

# Get file names of all our files
data_path = here('data')
files = here(data_path, dir(data_path))

# Read in first rows of each data.frame
data1 <- fread(files[1])
data2 <- fread(files[2])

# Read first rows of each data.frame
file_list = list()

for (i in files) { 
  file_list[[i]] == fread(i)
}

# Read first rows of each data.frame
file_list2 = lapply(files, fread )

TIP 3: Read(write) data documentation

In some case (like with CBS data), although you cannot access the data, you can already access the documentation.

If you are collecting data, it’s a good moment to create your data model and start writing documentation

E.g.: CBS

E.g.: EULFS

Create metadata object

Create a metadata data frame with info about the variables you are going to use. It can include fields:

Names of variables
Variable labels
Variable preferred data type (factor/numeric/date/(…))
Variable data type in the dataset
Coding of missing values
Possible values
(…)

Create metadata object

meta_table = data.table::data.table(
  var_name = c('NIGHTWK', 'SATWK', 'SUNWK', 'HWUSUAL'),
  var_label = c('Night work', 'Saturday work',
                'Sunday work', 'Hours worked'),
  pref_dtype = c('factor', 'factor', 'factor', 'numeric'),
  dset_dtype = c('numeric', 'numeric', 'numeric', 'numeric'),
  missing = list(c('9',''), c('9',''), c('9',''), c('00','99','')),
  possible_values =list(1:3, 1:3, 1:3, 1:98) 
  )

print(meta_table)

##    var_name     var_label pref_dtype dset_dtype missing possible_values
## 1:  NIGHTWK    Night work     factor    numeric      9,           1,2,3
## 2:    SATWK Saturday work     factor    numeric      9,           1,2,3
## 3:    SUNWK   Sunday work     factor    numeric      9,           1,2,3
## 4:  HWUSUAL  Hours worked    numeric    numeric  00,99, 1,2,3,4,5,6,...

TIP 4: Prepare to read data

Prepare script to read the data
Create dummy data

Prepare for a single file

my_data <- read_csv(here('data','dummy_LFS.csv'))

Get all the files from data directory

data_path = here('data')
files = here(data_path, dir(data_path))
files_no_ext = tools::file_path_sans_ext(files)

Custom for certain file paths (e.g. csv)

file_list_1 = lapply(files, read_csv)

Generalize (data.table output)

file_list_2 = lapply(files, fread)

# Get the first dataset 
data1 <- file_list_2[[1]]
nrows_1 <- nrow(data1)

head(data1)

##    ID EVENWK NIGHTWK SATWK SUNWK HWUSUAL SHIFTWK TEMPAGCY
## 1:  1      1      NA     3     3      35       9        9
## 2:  2      3       1     2     3      34       9        9
## 3:  3     NA      NA     2     2      40       9        9
## 4:  4      9       9     9     9      99       1        1
## 5:  5      9       9     9     9      99       1        0
## 6:  6      2       1     3     3      48      NA       NA

Prepare a dummy dataset

Manually

# Number of rows to generate
n_rows <- 10000

# Set seed to always get the same result in PRNGs  
set.seed(12345)

# Generate the data frame
dummy_data <- data.frame(
  ID = 1:n_rows,
  NIGHTWK = sample(as.numeric(c(1,2,3,9,'')),n_rows,rep=T),
  SATWK = sample(as.numeric(c(1,2,3,9,'')),n_rows,rep=T),
  SUNWK = sample(as.numeric(c(1,2,3,9,'')),n_rows,rep=T),
  HWUSUAL = sample(as.numeric(c(0:99,'')),n_rows,rep=T)
  )

Prepare a dummy dataset

From a metadata file

### Create variable with all possible values of a variable
meta_table$all_values <- mapply(c, meta_table$missing, meta_table$possible_values, SIMPLIFY = FALSE)

### Character vector of function names to adapt the data type
funs <- paste0("as.", meta_table$dset_dtype)

### Change the data type so that it's aligned with the documentation
meta_table[,all_values:= mapply(function(arg,f) f(arg),
                   meta_table$all_values, 
                   lapply(funs, get),
                   SIMPLIFY = F)]

### Number of rows to generate
n_rows <- 10000

### Set seed to always get the same result in PRNGs  
set.seed(12345)

### Sample 10,000 data points from all possible values
dummy_data2 <- lapply(meta_table$all_values, 
                      sample, n_rows, rep=T) %>% as.data.table()

### Name the variables
colnames(dummy_data2) <- meta_table$var_name

### Add ID variable
dummy_data2[, ID:= 1:n_rows]

### Reorder vars
setcolorder(dummy_data2, c('ID', meta_table$var_name))

And use it to automate cleaning - Recode NA

# Remove NAs different on variable-specific codes

### Convert a certain character to NA
recode_to_na <- function(x, char){
  out_x <- x
  out_x[which(out_x %in% char)] = NA
  return(out_x)
}

### Keep only the variables from the metadata 
meta_var <- meta_table$var_name
dummy_data_clean <- dummy_data2[,..meta_var]

### Keep only the vars that we chose in the metadata file
dummy_data_clean<-map2_dfr(dummy_data_clean,meta_table$missing, recode_to_na )

And use it to automate cleaning - Adapt data types

### Vector of function names to change data types 
funs <- paste0("as.", meta_table$pref_dtype)


### Convert all the columns in the data set to the preferred data type 
dummy_data_clean <- as.data.table(mapply(function(a,b) b(a), 
                                         dummy_data_clean, 
                                         lapply(funs, get), 
                                         SIMPLIFY = F))


#str(dummy_data_clean)

TIP 5: Prepare for data vises

Choose (and adapt) your theme
Choose or create your color palette
Create your custom functions

Choose and adapt your theme

dcc_theme <- theme_minimal() + 
  theme(panel.grid.major.x = element_blank(),
        axis.title = element_blank(),
        text = element_text(color = 'gray60', size = rel(3)),
        axis.text = element_text(color = 'gray20'),
        legend.position="bottom"
        )

theme_set(dcc_theme)

Choose and adapt your theme

plot_iris()

Choose or create your color palette

Use pre-made palettes, e.g. from viridis()
Or save your own selection of colors in a vector
Inspect color palettes with scales::show_col()
Change ggplot default colors with options()

Choose or create your color palette

Use pre-made palettes, e.g. from viridis()
Or save your own selection of colors in a vector

tud_cols  <- c('#00A6D6','#FFFFFF', '#000000', '#0C2340', '#00B8C8'  )
cols_cont <- viridis::viridis(9)
tud_qual  <- c('#00B8C8', '#EC6842', '#FFB81C', '#EF60A3' )

Choose or create your color palette

Inspect color palettes with scales::show_col()

scales::show_col( tud_cols  ); scales::show_col( cols_cont  )

Choose or create your color palette

Change ggplot default colors with options()

opts <- options()  # save old options

options(ggplot2.continuous.colour="viridis")
options(ggplot2.continuous.fill = "viridis")

options(ggplot2.discrete.colour=tud_qual)
options(ggplot2.discrete.fill = tud_qual)

Choose or create your color palette

plot_iris()

Create your custom functions

###### Function to visualise the distribution of birthday months
viz_birth_month <- function(single_tab){
  
  levels(single_tab$GBAGEBOORTEMAAND) <- c(
    "January","February", "March", "April", 
    "May", "June", "July", "August",
    "September", "October", "November", "December"
    )
  
  q <- ggplot(data=single_tab, aes(x=GBAGEBOORTEMAAND)) + 
    geom_bar(fill="orange") + 
    ggtitle("Distribution of birthdays by month in register population") +
    coord_flip() + 
    scale_x_discrete(limits = rev(levels(single_tab$GBAGEBOORTEMAAND))) +
    theme_minimal()
  return(q)
}

TIP 6: When you get the data …

Store the data in the right location (e.g. data/ folder)
See the data structure with: summary() or str()
See if the the data you need is there (confront with metadata file?)
Do initial data cleaning, e.g.:
- Recode NAs
- Remove white spaces
- Recode
- Check and adapt data types
- Make sure dates are read in the right format (try lubridate())

Tips to prepare code without data at hand

R Cafe August

Aleksandra & Clementine

2022-08-31

TIP 1: Prepare your environment

Setup R studio

Load packages

Choose relative paths!

TIP 2: Generalise your code

Functions

Functions

(l)apply() it

Create objects!

Avoid repetition

TIP 3: Read(write) data documentation

E.g.: CBS

E.g.: EULFS

Create metadata object

Create metadata object

TIP 4: Prepare to read data

Prepare for a single file

Get all the files from data directory

Prepare a dummy dataset

Prepare a dummy dataset

And use it to automate cleaning - Recode NA

And use it to automate cleaning - Adapt data types

TIP 5: Prepare for data vises

Choose and adapt your theme

Choose and adapt your theme

Choose and adapt your theme

Choose or create your color palette

Choose or create your color palette

Choose or create your color palette

Choose or create your color palette

Choose or create your color palette

Create your custom functions

TIP 6: When you get the data …

Example : CBS data

Discussion

Next R Cafe