Assignment 06: Advanced Imports and Data Manipulations
π― Learning Objectives
After completing this assignment, you will be able to:
- Work with large, partitioned datasets (DuckDB, Arrow, or Polars).
- Explore and visualize time-dependent or categorical patterns.
- Reproducibly document a complete analysis pipeline.
π¦ Dataset
In this assignment, you will work with New York City Taxi & Limousine Commission (TLC) Trip Record Data: π https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
These open datasets contain detailed trip records from NYC taxis and for-hire vehicles (FHV, green/yellow cabs, etc.).
π§ Your task
You are a data analyst exploring taxi activity in New York. Your goal is to import, transform, and analyze one or more subsets of the NYC taxi data using advanced import and manipulation techniques.
You decide: - which category of taxi trips to analyze (Yellow, Green, FHV, etc.), - which time period to focus on (for example, one month, one quarter, or a comparison between months/years).
βοΈ Step 1 β Data import
Choose an import method suited for working with large files:
- Use
arrow::open_dataset()to read directly from a Parquet/CSV directory. - Or use DuckDB to query data locally.
- You can also use partitioned datasets (e.g. year/month folders) to analyze multiple files efficiently.
You can also save your selected subset in .csv format for performance comparison.
π Step 2 β Data exploration and transformation
Perform data manipulation using SQL queries with dplyr, arrow, or polars.
Possible ideas:
- Compute average trip distance or fare month-to-month.
- Compare tips by vendor or taxi type (Yellow vs Green).
- Identify peak hours or days with the highest number of trips.
- Detect outliers or anomalies in trip duration or amount.
Itβs only a suggestion. Feel free to explore other interesting questions.
π Step 3 β Visualization and insights
Create 1β2 visualizations that best communicate your findings, such as:
- Line chart: monthly average trip distance or total rides.
- Bar chart: tip comparison across taxi types.
- Scatter plot: temperature vs. number of rides.
Make sure your plots have:
- clear titles, labels, and legends;
- an interpretation paragraph in your Quarto report.
π€ Submission
Submit a Quarto report (.qmd and PDF/HTML) that includes:
- description of your question and selected dataset subset;
- data import process (DuckDB, Arrow, or Polars);
- cleaning, transformation, and merging steps;
- visualizations with comments;
- (optional) performance note comparing two import methods.
If you use HTML, publish your report at Quarto Pub or GitHub Pages and include the link in your submission.