Assignment 02: Data Wrangling & Tidying with Tidyverse

🎯 Learning Objectives

After completing this assignment, you should be able to:

  • Import and explore real-world datasets from the TidyTuesday project.
  • Use dplyr functions to filter, arrange, and select subsets of data.
  • Create new variables with mutate().
  • Summarize grouped data with group_by() + summarise().
  • Reshape messy datasets using pivot_longer() and pivot_wider().
  • Apply piping (|>) for readable workflows.

πŸ“‚ Dataset

For this assignment, each student should choose any dataset from the TidyTuesday GitHub repository.

TipπŸ‘‰ Recommendations
  • Pick a dataset that interests you (e.g., food, sports, education, social trends, space, animals).
  • Avoid datasets that are too small (less than 100 rows) or too complicated (multiple nested tables).
  • Read the description and context of the dataset on the TidyTuesday page to understand its variables and structure.
  • From my experience, I know it is very tempting to find β€œthe one” dataset. Do not spend too much time searching. Just pick one and start working with it.

Load the dataset directly from GitHub using readr::read_csv() or read the instructions on the TidyTuesday page for that week. For example, you can use the following code to load the dataset for the week of October 22, 2024:

library(tidytuesdayR)

tuesdata <- tidytuesdayR::tt_load('2024-10-22')
---- Compiling #TidyTuesday Information for 2024-10-22 ----
--- There is 1 file available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 1: "cia_factbook.csv"
tuesdata$cia_factbook
# A tibble: 259 Γ— 11
   country       area birth_rate death_rate infant_mortality_rate internet_users
   <chr>        <dbl>      <dbl>      <dbl>                 <dbl>          <dbl>
 1 Russia      1.71e7       11.9      13.8                   7.08       40853000
 2 Canada      9.98e6       10.3       8.31                  4.71       26960000
 3 United Sta… 9.83e6       13.4       8.15                  6.17      245000000
 4 China       9.60e6       12.2       7.44                 14.8       389000000
 5 Brazil      8.51e6       14.7       6.54                 19.2        75982000
 6 Australia   7.74e6       12.2       7.07                  4.43       15810000
 7 India       3.29e6       19.9       7.35                 43.2        61338000
 8 Argentina   2.78e6       16.9       7.34                  9.96       13694000
 9 Kazakhstan  2.72e6       19.6       8.31                 21.6         5299000
10 Algeria     2.38e6       24.0       4.31                 21.8              NA
# β„Ή 249 more rows
# β„Ή 5 more variables: life_exp_at_birth <dbl>, maternal_mortality_rate <dbl>,
#   net_migration_rate <dbl>, population <dbl>, population_growth_rate <dbl>

πŸ“ Tasks

Important

Below is only a basic scenario. Please take the initiative of a researcher β€” ask questions about the data and find answers to them. This is much more interesting than simply β€œmemorizing” the language’s syntax.

  1. Import and Explore the Data:
    • Load the dataset into R.
    • Use functions like head(), glimpse(), and summary() to understand its structure and contents.
  2. Filtering & Selecting:
    • Apply at least two filtering conditions (e.g., values above a threshold, excluding missing data).
    • Select 3–5 relevant variables. Show the first 10 rows.
    • Find the top 5 records according to some numeric variable of your choice.
  3. Mutating Variables:
    • Create a new column that is a ratio, difference, or transformation of existing variables.
    • Identify the record with the maximum value of this new variable.
  4. Grouping & Summarizing:
    • Group the data by a categorical variable and calculate summary statistics (mean, median, count) for a numeric variable.
    • Present the results in a clear table.
    • Interpret the findings.
Note

For some datasets, you may need to join multiple tables or tidy the data first using pivot_longer() or pivot_wider(). Feel free to do so if necessary.

πŸ“€ Submission

  • Submit your .r file.
  • Your code must run without errors.
  • Add short comments explaining what each step does.