This book was published by SAGE in March, 2021. It is available for purchase on Amazon and elsewhere. Below, you will find links to the RMarkdown source code of the book, copies of the data files uses in the book, and preprints of all the chapters.

book cover image

Doing Data Science in R is about statistical data analysis of real world data using modern tools. It is aimed at those who are currently engaged in, or planning to be engaged in, analysis of statistical data of the kind that might arise at or beyond PhD level scientific research. It is ostensibly aimed at researchers in social science fields, but in fact is equally applicable to many other scientific fields, particularly the biological, medical, and life sciences fields. The data in these types of scientific fields is complex. There are many variables and complex relationships between them. Analyzing this data almost always requires data wrangling, exploration, and visualization. Above all, it involves statistical modelling the data using flexible probabilistic models. These models are then used to reason and make predictions about the scientific phenomenon being studied. This book aims to address all of these topics.

Source code

This book was written entirely in RMarkdown and compiled to pdf using GNU Makefiles that are run inside a Docker container. The RMarkdown source code, any requisite R or Stan scripts and data files, and the shell scripts and GNU Makefiles and Docker files for building the pdfs of the book are all available in this GitHub repository.

Data files

The csv data files used in each chapter are available in this zipfile, which has subdirectories for each chapter, inside of which are its data files.

Chapter preprints

There are 17 chapters in the book, listed below. For each one, you can download a preprint pdf. These are the pdfs of the versions of the chapters before the book manuscript was sent to the publisher. Therefore, there are differences between these and the chapters in the published book.