Introduction

What is a Data Science Workflow?

  • A data science workflow is a structured, repeatable process for analyzing data.

  • It typically includes stages such as data cleaning, exploration, modeling, and evaluation.

  • Rather than jumping straight into modeling, a good workflow emphasizes understanding and preparing the data first.

  • A clear workflow helps ensure that analyses are organized, transparent, and reproducible.

  • Reproducibility means that you (or someone else) can rerun the same analysis and get the same results.

  • Workflows are especially important when working on collaborative projects or scaling to larger datasets.

  • In this workshop, we will focus on a simple, practical workflow:

    • Clean → Explore → Model → Evaluate → Scale

Why focus on workflows?

  • Many beginners focus on learning individual tools or models, but struggle to connect them into a coherent process.

  • A well-defined workflow helps you know:

    • where to start
    • what to do next
    • how to avoid common mistakes
  • Most of the time in real-world data science is spent on data cleaning and preparation—not modeling.

  • A consistent workflow makes your work easier to debug, share, and extend.

  • Proper workflows help prevent common pitfalls such as:

    • overfitting models
    • evaluating models on the wrong data
    • making unreliable predictions (extrapolation)
  • The same workflow you learn here can be applied to:

    • different datasets
    • different models
    • larger computing environments (including HPC)
  • The goal of this workshop is not just to build models, but to build good habits for working with data in R.