df <- df_raw %>%
select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, sex) %>%
mutate(
species = as.factor(species),
island = as.factor(island),
sex = as.factor(sex),
large_body = ifelse(body_mass_g > median(body_mass_g, na.rm = TRUE), 1, 0)
) %>%
filter(
!is.na(bill_length_mm),
!is.na(bill_depth_mm),
!is.na(flipper_length_mm),
!is.na(body_mass_g),
!is.na(sex)
)3 Data Cleaning
In this section, we clean and prepare the dataset for analysis. Data cleaning is a critical step in any workflow, as real-world data is often incomplete or inconsistent.
3.1 Clean and Transform the Data
We will:
- select relevant variables
- convert categorical variables to factors
- create a new feature
- remove missing values
3.2 Inspect the Cleaned Data
str(df)tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
$ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
$ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
$ body_mass_g : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
$ large_body : num [1:333] 0 0 0 0 0 0 1 0 0 1 ...
summary(df) species island bill_length_mm bill_depth_mm
Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60
Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30
Mean :43.99 Mean :17.16
3rd Qu.:48.60 3rd Qu.:18.70
Max. :59.60 Max. :21.50
flipper_length_mm body_mass_g sex large_body
Min. :172 Min. :2700 female:165 Min. :0.0000
1st Qu.:190 1st Qu.:3550 male :168 1st Qu.:0.0000
Median :197 Median :4050 Median :0.0000
Mean :201 Mean :4207 Mean :0.4835
3rd Qu.:213 3rd Qu.:4775 3rd Qu.:1.0000
Max. :231 Max. :6300 Max. :1.0000
3.3 Compare Before and After
This comparison shows how many observations were removed during cleaning, reinforcing that data preparation can significantly impact your dataset.