Assignment 05: Web scraping and APIs

🎯 Learning Objectives

By the end of this assignment, you will be able to:

  • Scrape data from a webpage using the rvest package.
  • Retrieve data from a public API using the httr2 and jsonlite packages.
  • Clean and structure the collected data into a tidy format.
  • Visualize the data to explore and communicate insights.
  • Document your data collection and cleaning process.

📝 Task 1. Basic

In this task, you will work with a website where data is directly embedded in the HTML page. Your goal is to choose a website of interest and collect structured data from it.

Instructions

  1. Choose a source. Find a webpage that contains data in a table or a list that can be converted into a table. This can be anything that interests you.
    • Ideas for inspiration:
      • Sports competition results (e.g., a football league table, Olympic Games results).
      • A list of movies, books, or music albums with ratings (e.g., IMDb Top 250 movies, bestseller lists).
      • Data from Wikipedia (e.g., a list of the world’s tallest buildings, demographic data for countries).
      • A product catalog from a small online store (avoid large marketplaces that actively protect against scraping).
  2. Scrape the data using rvest. Use SelectorGadget or your browser’s developer tools to find the correct CSS selectors for the elements you need.
    • Use functions like read_html(), html_elements(), html_table() or html_text2() to extract the data.
  3. Clean and structure the data. Transform the raw data into a tidy tibble.
    • Use janitor::clean_names() to standardize column names.
    • Use dplyr and stringr to clean the data: remove unnecessary characters, and convert data types (e.g., from text to numbers or dates).
  4. Save the cleaned data.
  5. Make some visualizations to explore the data.
  6. Document your process. Write a brief summary of the steps you took, any challenges you faced, and how you overcame them.

📝 Task 2. Advanced

This task will teach you how to retrieve data from web services via their Application Programming Interfaces (APIs). This is a more reliable and “polite” way to collect data than HTML scraping.

Instructions

  1. Find a public API. There are countless services that provide data through an API. This usually requires a free registration to obtain an API key.
  2. Register and get an API key. Store it securely, for example, using usethis::edit_r_environ().
  3. Make a request. Using httr2 and jsonlite, send a request to the API, receive the response in JSON format, and parse it.
  4. Create a dataframe. Extract the necessary data from the resulting structure and convert it into a tidy table.
  5. Save the cleaned data.
  6. Make some visualizations to explore the data.
  7. Document your process. Write a brief summary of the steps you took, any challenges you faced, and how you overcame them.

📤 Submission

Submit the .qmd file, and rendered document (PDF) to the assignment submission portal.