The big picture

Reproducibility

Author

Prof. Eric Friedlander

Published

Aug 23, 2024

Questions from last class?

The Big Picture

Topics

  • Data analysis life cycle
  • Reproducible data analysis

Source: R for Data Science with additions from The Art of Statistics: How to Learn from Data.

Reproducibility

Reproducibility checklist

What does it mean for an analysis to be reproducible?

. . .

Near term goals:

✔️ Can the tables and figures be exactly reproduced from the code and data?

✔️ Does the code actually do what you think it does?

✔️ In addition to what was done, is it clear why it was done?

. . .

Long term goals:

✔️ Can the code be used for other data?

✔️ Can you extend the code to do other things?

Why is reproducibility important?

When things go wrong

Reproducibility error Consequence Source(s)
Limitations in Excel data formats Loss of 16,000 COVID case records in the UK (Kelion 2020)
Automatic formatting in Excel Important genes disregarded in scientific studies (Ziemann, Eren, and El-Osta 2016)
Deletion of a cell caused rows to shift Mix-up of which patient group received the treatment (Wallensteen et al. 2018)
Using binary instead of explanatory labels Mix-up of the intervention with the control group (Aboumatar and Wise 2019)
Using the same notation for missing data and zero values Paper retraction (Whitehouse et al. 2021)
Incorrectly copying data in a spreadsheet Delay in the opening of a hospital (Picken 2020)

Source: Ostblom and Timbers (2022)

Toolkit

  • Scriptability \(\rightarrow\) R

  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto

  • Version control \(\rightarrow\) Git / GitHub (Beyond the scope of this course)

R and RStudio

  • R is a statistical programming language

  • RStudio is a convenient interface for R (an integrated development environment, IDE)


RStudio IDE


Quarto

  • Fully reproducible reports – the analysis is run from the beginning each time you render

  • Code goes in chunks and narrative goes outside of chunks

  • Visual editor to make document editing experience similar to a word processor (Google docs, Word, Pages, etc.)

Quarto

How will we use Quarto?

  • Every application exercise and assignment is written in a Quarto document

  • You’ll have a template Quarto document to start with

  • The amount of scaffolding in the template will decrease over the semester

Rest of class

  • Work on HW 0!

References

Alexander, Rohan. 2023. “Telling Stories with Data,” June. https://doi.org/10.1201/9781003229407.
Ostblom, Joel, and Tiffany Timbers. 2022. “Opinionated Practices for Teaching Reproducibility: Motivation, Guided Instruction and Practice.” Journal of Statistics and Data Science Education 30 (3): 241–50. https://doi.org/10.1080/26939169.2022.2074922.