3 Data Cleaning

In this section, we clean and prepare the dataset for analysis. Data cleaning is a critical step in any workflow, as real-world data is often incomplete or inconsistent.

3.1 Clean and Transform the Data

We will:

select relevant variables
convert categorical variables to factors
create a new feature
remove missing values

df <- df_raw %>%
  select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm,
         body_mass_g, sex) %>%
  mutate(
    species = as.factor(species),
    island = as.factor(island),
    sex = as.factor(sex),
    large_body = ifelse(body_mass_g > median(body_mass_g, na.rm = TRUE), 1, 0)
  ) %>%
  filter(
    !is.na(bill_length_mm),
    !is.na(bill_depth_mm),
    !is.na(flipper_length_mm),
    !is.na(body_mass_g),
    !is.na(sex)
  )

3.2 Inspect the Cleaned Data

str(df)

tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ large_body       : num [1:333] 0 0 0 0 0 0 1 0 0 1 ...

summary(df)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex        large_body    
 Min.   :172       Min.   :2700   female:165   Min.   :0.0000  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:0.0000  
 Median :197       Median :4050                Median :0.0000  
 Mean   :201       Mean   :4207                Mean   :0.4835  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:1.0000  
 Max.   :231       Max.   :6300                Max.   :1.0000

3.3 Compare Before and After

nrow(df_raw)

[1] 344

nrow(df)

[1] 333

This comparison shows how many observations were removed during cleaning, reinforcing that data preparation can significantly impact your dataset.