Assignment 05: Web scraping and APIs
🎯 Learning Objectives
By the end of this assignment, you will be able to:
- Scrape data from a webpage using the
rvestpackage. - Retrieve data from a public API using the
httr2andjsonlitepackages. - Clean and structure the collected data into a tidy format.
- Visualize the data to explore and communicate insights.
- Document your data collection and cleaning process.
📝 Task 1. Basic
In this task, you will work with a website where data is directly embedded in the HTML page. Your goal is to choose a website of interest and collect structured data from it.
Instructions
- Choose a source. Find a webpage that contains data in a table or a list that can be converted into a table. This can be anything that interests you.
- Ideas for inspiration:
- Sports competition results (e.g., a football league table, Olympic Games results).
- A list of movies, books, or music albums with ratings (e.g., IMDb Top 250 movies, bestseller lists).
- Data from Wikipedia (e.g., a list of the world’s tallest buildings, demographic data for countries).
- A product catalog from a small online store (avoid large marketplaces that actively protect against scraping).
- Ideas for inspiration:
- Scrape the data using
rvest. Use SelectorGadget or your browser’s developer tools to find the correct CSS selectors for the elements you need.- Use functions like
read_html(),html_elements(),html_table()orhtml_text2()to extract the data.
- Use functions like
- Clean and structure the data. Transform the raw data into a tidy tibble.
- Use
janitor::clean_names()to standardize column names. - Use
dplyrandstringrto clean the data: remove unnecessary characters, and convert data types (e.g., from text to numbers or dates).
- Use
- Save the cleaned data.
- Make some visualizations to explore the data.
- Document your process. Write a brief summary of the steps you took, any challenges you faced, and how you overcame them.
📝 Task 2. Advanced
This task will teach you how to retrieve data from web services via their Application Programming Interfaces (APIs). This is a more reliable and “polite” way to collect data than HTML scraping.
Instructions
- Find a public API. There are countless services that provide data through an API. This usually requires a free registration to obtain an API key.
- Ideas for inspiration:
- OpenWeatherMap — Weather data.
- The Movie Database (TMDB) — Information about movies, actors, etc.
- FRED — Economic data (as seen in the lecture).
- NASA APIs — Data and photos from NASA.
- Ideas for inspiration:
- Register and get an API key. Store it securely, for example, using
usethis::edit_r_environ(). - Make a request. Using
httr2andjsonlite, send a request to the API, receive the response in JSON format, and parse it. - Create a dataframe. Extract the necessary data from the resulting structure and convert it into a tidy table.
- Save the cleaned data.
- Make some visualizations to explore the data.
- Document your process. Write a brief summary of the steps you took, any challenges you faced, and how you overcame them.
📤 Submission
Submit the .qmd file, and rendered document (PDF) to the assignment submission portal.