Pick a dataset that interests you (e.g., food, sports, education, social trends, space, animals).
Avoid datasets that are too small (less than 100 rows) or too complicated (multiple nested tables).
Read the description and context of the dataset on the TidyTuesday page to understand its variables and structure.
From my experience, I know it is very tempting to find βthe oneβ dataset. Do not spend too much time searching. Just pick one and start working with it.
Load the dataset directly from GitHub using readr::read_csv() or read the instructions on the TidyTuesday page for that week. For example, you can use the following code to load the dataset for the week of October 22, 2024:
---- Compiling #TidyTuesday Information for 2024-10-22 ----
--- There is 1 file available ---
ββ Downloading files βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 of 1: "cia_factbook.csv"
tuesdata$cia_factbook
# A tibble: 259 Γ 11
country area birth_rate death_rate infant_mortality_rate internet_users
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Russia 1.71e7 11.9 13.8 7.08 40853000
2 Canada 9.98e6 10.3 8.31 4.71 26960000
3 United Sta⦠9.83e6 13.4 8.15 6.17 245000000
4 China 9.60e6 12.2 7.44 14.8 389000000
5 Brazil 8.51e6 14.7 6.54 19.2 75982000
6 Australia 7.74e6 12.2 7.07 4.43 15810000
7 India 3.29e6 19.9 7.35 43.2 61338000
8 Argentina 2.78e6 16.9 7.34 9.96 13694000
9 Kazakhstan 2.72e6 19.6 8.31 21.6 5299000
10 Algeria 2.38e6 24.0 4.31 21.8 NA
# βΉ 249 more rows
# βΉ 5 more variables: life_exp_at_birth <dbl>, maternal_mortality_rate <dbl>,
# net_migration_rate <dbl>, population <dbl>, population_growth_rate <dbl>
π Tasks
Important
Below is only a basic scenario. Please take the initiative of a researcher β ask questions about the data and find answers to them. This is much more interesting than simply βmemorizingβ the languageβs syntax.
Import and Explore the Data:
Load the dataset into R.
Use functions like head(), glimpse(), and summary() to understand its structure and contents.
Filtering & Selecting:
Apply at least two filtering conditions (e.g., values above a threshold, excluding missing data).
Select 3β5 relevant variables. Show the first 10 rows.
Find the top 5 records according to some numeric variable of your choice.
Mutating Variables:
Create a new column that is a ratio, difference, or transformation of existing variables.
Identify the record with the maximum value of this new variable.
Grouping & Summarizing:
Group the data by a categorical variable and calculate summary statistics (mean, median, count) for a numeric variable.
Present the results in a clear table.
Interpret the findings.
Note
For some datasets, you may need to join multiple tables or tidy the data first using pivot_longer() or pivot_wider(). Feel free to do so if necessary.
π€ Submission
Submit your .r file.
Your code must run without errors.
Add short comments explaining what each step does.