Assignment 02: Data Wrangling & Tidying with Tidyverse

🎯 Learning Objectives

After completing this assignment, you should be able to:

Import and explore real-world datasets from the TidyTuesday project.
Use dplyr functions to filter, arrange, and select subsets of data.
Create new variables with mutate().
Summarize grouped data with group_by() + summarise().
Reshape messy datasets using pivot_longer() and pivot_wider().
Apply piping (|>) for readable workflows.

📂 Dataset

For this assignment, each student should choose any dataset from the TidyTuesday GitHub repository.

👉 Recommendations

Pick a dataset that interests you (e.g., food, sports, education, social trends, space, animals).
Avoid datasets that are too small (less than 100 rows) or too complicated (multiple nested tables).
Read the description and context of the dataset on the TidyTuesday page to understand its variables and structure.
From my experience, I know it is very tempting to find “the one” dataset. Do not spend too much time searching. Just pick one and start working with it.

Load the dataset directly from GitHub using readr::read_csv() or read the instructions on the TidyTuesday page for that week. For example, you can use the following code to load the dataset for the week of October 22, 2024:

library(tidytuesdayR)

tuesdata <- tidytuesdayR::tt_load('2024-10-22')

---- Compiling #TidyTuesday Information for 2024-10-22 ----
--- There is 1 file available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 1: "cia_factbook.csv"

tuesdata$cia_factbook

# A tibble: 259 × 11
   country       area birth_rate death_rate infant_mortality_rate internet_users
   <chr>        <dbl>      <dbl>      <dbl>                 <dbl>          <dbl>
 1 Russia      1.71e7       11.9      13.8                   7.08       40853000
 2 Canada      9.98e6       10.3       8.31                  4.71       26960000
 3 United Sta… 9.83e6       13.4       8.15                  6.17      245000000
 4 China       9.60e6       12.2       7.44                 14.8       389000000
 5 Brazil      8.51e6       14.7       6.54                 19.2        75982000
 6 Australia   7.74e6       12.2       7.07                  4.43       15810000
 7 India       3.29e6       19.9       7.35                 43.2        61338000
 8 Argentina   2.78e6       16.9       7.34                  9.96       13694000
 9 Kazakhstan  2.72e6       19.6       8.31                 21.6         5299000
10 Algeria     2.38e6       24.0       4.31                 21.8              NA
# ℹ 249 more rows
# ℹ 5 more variables: life_exp_at_birth <dbl>, maternal_mortality_rate <dbl>,
#   net_migration_rate <dbl>, population <dbl>, population_growth_rate <dbl>

📝 Tasks

Important

Below is only a basic scenario. Please take the initiative of a researcher — ask questions about the data and find answers to them. This is much more interesting than simply “memorizing” the language’s syntax.

Import and Explore the Data:
- Load the dataset into R.
- Use functions like head(), glimpse(), and summary() to understand its structure and contents.
Filtering & Selecting:
- Apply at least two filtering conditions (e.g., values above a threshold, excluding missing data).
- Select 3–5 relevant variables. Show the first 10 rows.
- Find the top 5 records according to some numeric variable of your choice.
Mutating Variables:
- Create a new column that is a ratio, difference, or transformation of existing variables.
- Identify the record with the maximum value of this new variable.
Grouping & Summarizing:
- Group the data by a categorical variable and calculate summary statistics (mean, median, count) for a numeric variable.
- Present the results in a clear table.
- Interpret the findings.

Note

For some datasets, you may need to join multiple tables or tidy the data first using pivot_longer() or pivot_wider(). Feel free to do so if necessary.

📤 Submission

Submit your .r file.
Your code must run without errors.
Add short comments explaining what each step does.