C R at GAD

The following are the GAD sensible defaults for R:

C.1 R Version & IDE

The dominant IDE for R is Rstudio, which comes packaged with R. For a new project you should use the latest version of Rstudio available from the software portal.

C.2 General

Default to packages from the Tidyverse.These have been carefully designed to work together effectively as part of a modern data analysis workflow. More info can be found here: R for Data Science by Hadley Wickham.

For example:

Prefer tibbles to data.frames
Use ggplot2 rather than base graphics
Use the pipe %>% rather than nesting function calls. (…but not always e.g. see here).
Prefer purrr to the apply family of functions. See here

C.3 Packages

Recommended Packages:

Data Manupulation and Tidying
- dplyr
- tidyr
Working with messy spreadsheets
- tidyxl
- unpivotr
Working with large datasets:
- data.table
- dtplyr
Database Interface
- dbplyr
Plotting
- ggplot2
- plotly
Mapping
- sf
- leaflet
Reproducibility
- renv
Producing gov.uk style output
- govstyle (Government Theme for ggplot2)
- govdown
Machine Learning
- tensorflow
- h2o

C.4 Project Workflow

Always work in a project. See the guide to Using Projects.

Projects functionality is broken in GAD’s packaged version of Rstudio - see the fix here

C.5 Packaging Your Code

Packages are the fundamental unit of reproducible R code. Therefore, if possible, build an R Package to share and document your code.

Hadley’s book on R Packages is an effective guide on how to produce a package.

The usethis package has lots of useful shortcuts for package builders.

C.6 Managing Dependencies

There are two key competing ways of managing dependencies for an R Project:

packrat - current established way to manage R dependencies
renv - rapidly maturing, successor to packrat.

C.6.1 Using old versions of packages

You may come across code which doesn’t work because it depends on a different version of a package to the one you have.

Fortunately, Microsoft keep daily snapshots of CRAN and store them on the Microsoft R Application Network.

The checkpoint package from Microsoft lets you use these snapshots to install packages as if it were any day since 2017-07-01.

Simply start your script with:

library(checkpoint)

checkpoint(snapshotDate = "2015-01-15",
           checkpointLocation = getwd())

This will download and fetch all the packages as they existed on the given date and install them to a library on your home drive.

Notes:

If the code depends on BH (a lot of tidyverse code will) then this will take some time!
By default checkpoint puts packages on your P drive - this will be slow.
- You can use the checkpointLocation argument to tell checkpoint to use the C drive.

C.7 Error Handling

Base R includes the try() and tryCatch() functions for handling errors. You can find an example of basic use of these on r-bloggers.

Effective error handling in R requires understanding the conditions system. There is a good chapter on this in Hadley’s Advanced R book

If you are iterating over many inputs, it is recommended that you use the safely() family of functions from purrr to create versions which return errors within a list for handling at a later stage.

C.8 Unit Testing

Use the testthat package for performing unit tests. For details see the ‘tests’ chapter of R Packages.

C.9 Assignment

C.9.1 The `<-` and `->` operators

Assignment should always involve assigning a value on the right-hand-side of the <- operator to the object on its left hand side. The -> operator should never be used, and the = operator should not be used for assignment.

#good
a <- 1

#bad
1 -> a

#also bad
a = 1

An exception is where the assign() function or <<- are used, in which case assignment can be done without <- or ->. This should only be done when <- cannot be used. And as with the above, never use ->> instead of <<- .

Note - if the code below is confusing, section 10.2.4 here may be of use: https://adv-r.hadley.nz/function-factories.html#stateful-funs

#only do this when `<-` will not work, e.g. when
#the name of the object being assigned to is stored 
#as a string:
assign('a', 1, envir = globalenv())

#if assigning to the scope of the enclosing environment, 
#use `<<-` 
create_counter <- function() {
  i <- 0
  
  function() {
    i <<- i + 1
    return(i)
  }
}

#but don't use `<<-` when `<-` would work just as well!

#good:
a <- 1

#bad:
a <<- 1

#even worse:
1 ->> a

C.10 Commenting

Code should always be well commented; the sections below provide guidance on good commenting practice.

C.10.1 Position of comments

Comments should appear above related sections of code.

#Good:

  #assigning value of 1 to object a:
  a <- 1

#Bad:

  a <- 1 #assigning value of 1 to object a

C.10.2 Detail of comments

It is generally preferable to err on the side of including more comments rather than not commenting enough. Nevertheless, you should be judicious in your use of comments - adding a comment to explain something obvious will unnecessarily clutter your code. The most important places to include comments are where the code is doing something complicated and non-obvious, particularly if it is using functions or syntax that others may not be familiar with. This being the case, comment usage should sometimes be adjusted based on the technical proficiency of others working on the project: if you know that the code will be checked by someone who has a lot of experience in R, you will rarely need comments that help the reader to understand the syntax being employed. Conversely, if it will be checked by a relative novice, it may be helpful to provide more details comments about the operations being performed by the code:

#The code chunk below changes the format of a dataframe 
#from a 'wide' format to a 'long' format. The extent 
#of commenting required may depend on the technical
#proficiency of other people who will work with the 
#code. If this is subject to uncertainty, err on the
#side of more thorough explanations

data.frame(
  animal_id = 1:10, 
  weight = rnorm(n = 10, mean = 5, sd = 3), 
  height = rnorm(n = 10, mean = 7, sd = 3) 
  ) %>%
  pivot_longer(
    c(weight, height)
  )

C.10.3 Outdated comments

When making changes to code, it is important to correspondingly update comments. Otherwise, comments may become misleading:

#Uh-oh! Somebody has updated the 'convert_units()' 
#function so that it converts from grams to 
#kilograms instead of  grams to tonnes. But they
#didn't update the comment!

  #The function below converts from grams to tonnes:
  convert_units <- function(value) {
    value <- value / 1000
    return(value)
  }

C.10.4 Different commenting contexts

In some cases, such as when commenting non-code sections of Rmarkdown documents or pre-pending comments to functions within your own packages using roxygen or similar, the above conventions will not suffice because different commenting syntax will be involved. These situations are likely to be rare in GAD’s work and are thus beyond the scope of this guidance.

C.11 Indentation and spacing

When coding in R, indentation and spacing generally has absolutely no impact on the way in which code is evaluated. We should use the freedom that this allows us to space and indent code in a manner that maximises readability. A few good practices to follow are:

Opening brackets should be followed by a new line plus tab indendtation
Closing brackets should be preceded by a new line plus removal of tab indentation
We can make exceptions to the above when brackets only enclose a small expression
The <-, = and <<- operators should be followed and preceded by a space, as should commas
Within tidyverse and magrittr, pipe operators like %>%, %$% and %T>% should come at the end of a line

#good:
data <- data.frame(
  animal_id = 1:10, 
  #it's OK to have rnorm() on a single line because the
  #brackets don't enclose much code.
  weight = rnorm(n = 10, mean = 5, sd = 3), 
  height = rnorm(n = 10, mean = 7, sd = 3) 
  ) %>%
  pivot_longer(
    c(weight, height)
  )

#bad:
data<-data.frame(animal_id=1:10,weight = rnorm(n = 10, 
mean = 5, sd = 3),height=rnorm(
  n = 10, mean = 7,sd=3
)) %>% pivot_longer(c(weight, height)
)

C.12 Data management

If possible, it is better to store data in tidy dataframes rather than, for example, in nested lists:

#good:
data <- data.frame(
  gender = c('male', 'male','female','female'), 
  scheme = c('legacy', 'care', 'legacy','care'),
  value = c(1000, 2000, 3000, 4000)
)

#bad:
data <- list(
  male = list(legacy = 1000, care = 2000),
  female = list(legacy = 3000, care = 4000)
)

#It often makes more sense to create a tibble
#rather than a standard dataframe, but this
#can depend on context:

#also good:
data <- dplyr::tibble(
  gender = c('male', 'male','female','female'), 
  scheme = c('legacy', 'care', 'legacy','care'),
  value = c(1000, 2000, 3000, 4000)
)

C.13 Naming conventions

Within a given piece of code, all objects and names within vectors, lists or dataframes should follow camel case or snake case convention, but not both. Camel case involves writing names with the first chunk of the name uncapitalised and all subseqeunt chunks capitalised; a ‘chunk’ in this context means a part of the name that would be surrounded by spaces in normal writing. Snake case involves replacing spaces with underscores and removing all capitalisation:

thisObjectNameIsWrittenInCamelCase

this_object_name_is_written_in_snake_case

In addition, we should try to avoid making names too long (unlike the above), but this should be balanced against the need to avoid shortening long strings into incomprehensible abbreviations

#bad - not camel case or snake case:
my_Object_Nm <- 1
AnotherOBJECT <- 2

#bad - mixing camel case and snake case in same code:
my_object_name <- 1
anotherObject <- 2

#bad - confusing abbreviations or arbitrary names make it
#difficult to easily discern what the objects are supposed
#to represent:
m_o_n <- 1
x <- 2

#good - exclusively camel case, 
#abbreviations comprehensible:
myObjName <- 1
anotherObj <- 2

#good - exclusively snake case, 
#abbreviations comprehensible:
my_obj_name <- 1
another_obj <- 2

C.14 Folder structure and path specification

In general, R code should be contained within folders that have a pre-arranged structure from the outset, allowing for the use of relative file paths. The advantage of this is that the relevant folder can contain all code, data and other necessary files and can be moved around to different parent folders and locations, without paths needing to be re-written when, for example, code is sent to a client:

###Assume we have a folder structure as follows and that
###the code below is found in 'our_code.R'

#project_folder
#   - data_folder
#     - dataset_1.csv
#     - dataset_2.csv
#   - code_folder
#     - our_code.R
#     - more_code.R

#good:
  
  #load packages
  library(readr)

  #read in data using relative paths
  dataset_1 <- read_csv('
      ../data_folder/dataset_1.csv'
      )
  dataset_2 <- read_csv('
      ../data_folder/dataset_2.csv'
      )
  #create user-defined functions
  source('./more_code.R')
  
#bad:
  
  #load packages
  library(readr)

  #read in data using absolute paths
  dataset_1 <- read_csv(
    '<long_and_convoluted_path>/dataset_1.csv'
    )
  dataset_2 <- read_csv(
    '<long_and_convoluted_path>/dataset_2.csv'
    )
  
  #create user-defined functions
  source('<long_and_convoluted_path>/more_code.R')

C.15 Reading in data from csv files

When reading in data from a csv file, it is preferable to use the read_csv() function from the readr package, rather than the base function read.csv() This is because the latter is liable to behave differently for different users, see: https://r4ds.had.co.nz/data-import.html