n <- 100
x <- runif(n, 0, 10)
y <- 2 * x + runif(n, -1, 1)Assignment 01: Welcome to R
Exercise 1: Pearson
Let’s look how you can read formulas in math notation and translate them into R code.
Your first task is to compute the Pearson linear correlation coefficient1 using its mathematical formula:
\[ r = \frac{\sum_{i=1}^n \left(x_i - \frac{1}{n}\sum_{j=1}^n x_j\right)\left(y_i - \frac{1}{n}\sum_{j=1}^n y_j\right)} {\sqrt{\sum_{i=1}^n \left(x_i - \frac{1}{n}\sum_{j=1}^n x_j\right)^2} \sqrt{\sum_{i=1}^n \left(y_i - \frac{1}{n}\sum_{j=1}^n y_j\right)^2}} \]
Let x and y be two vectors of identical lengths n, say. Next code chunk generates two such vectors of length n = 100:
Do not use the built-in cor() function.
- In this exercise, you need to calculate the correlation coefficient manually using the formula above. And please, do not use loops, as they are not necessary here.
- Functions that may be useful:
mean(),length(),sqrt(). - Read the documentation for the
runif(). Why does it return different values each time you run it? How can you make it return the same values each time?
Exercise 2: Cryptocurrency Trading
You trade two types of cryptocurrencies: Bitcoin (BTC) and Ethereum (ETH). Over the course of a week of trading, you achieved the following results. This code generates random earnings/losses for each day of the week:
n <- 7
BTC <- round(rnorm(n, mean = 100, sd = 200))
ETH <- round(rnorm(n, mean = 100, sd = 200))- Add names to the vectors
BTCandETHcorresponding to the days of the week, starting from Monday. - Calculate the total profit/loss for each cryptocurrency over the week.
- Calculate the average daily profit/loss for each cryptocurrency.
- Determine the day with the highest profit for each cryptocurrency.
- Which cryptocurrency was more profitable over the week?
In data science, it is essential to ask questions and find answers to them. So you can start to think of more questions you can ask about this data and find answers to them using R.
Exercise 3: Planets
In this exercise, you will work with data2 about the exoplanets in our galaxy. You will analyze their characteristics and compare them to the planets in our solar system.
In NASA’s Exoplanet Catalog you can find hypothetical visualizations of exoplanets.
exoplanets_df <- read.csv("data/exoplanets_unique.csv")| loc_rowid | pl_name | hostname | pl_letter | sy_snum | sy_pnum | disc_year | pl_orbper | pl_rade | pl_masse | pl_eqt | st_spectype | st_teff | sy_dist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2816 | HD 20781 d | HD 20781 | d | 2 | 5 | 2019 | 29.15800 | NA | NA | NA | K0 V | 5256.0 | 35.9715 |
| 13421 | Kepler-1482 b | Kepler-1482 | b | 1 | 1 | 2016 | 12.25389 | NA | NA | NA | NA | 5567.9 | 567.5560 |
| 17217 | Kepler-1801 c | Kepler-1801 | c | 1 | 2 | 2023 | 116.58261 | 3.43 | NA | 371 | NA | 5738.0 | 839.3320 |
| 2399 | HD 158996 b | HD 158996 | b | 1 | 1 | 2018 | 820.20000 | NA | NA | NA | K5 III | 4069.0 | 283.6590 |
| 2805 | HD 206255 b | HD 206255 | b | 1 | 1 | 2019 | 96.04500 | NA | NA | NA | G5 IV/V | 5635.0 | 75.2363 |
The data frame exoplanets_df contains the following columns:
| Column Name | Description |
|---|---|
loc_rowid |
Unique identifier for each row in the dataset |
pl_name |
Name of the exoplanet |
hostname |
Name of the host star |
pl_letter |
Letter designation of the exoplanet |
sy_snum |
Number of stars in the host star system |
sy_pnum |
Number of planets in the host star system |
disc_year |
Year of discovery |
pl_orbper |
Orbital period of the exoplanet (in days) |
pl_rade |
Radius of the exoplanet (in Earth radii) |
pl_massee |
Mass of the exoplanet (in Earth masses) |
pl_eqt |
Equilibrium temperature of the exoplanet (in Kelvin) |
st_spectype |
Spectral type of the host star |
st_teff |
Effective temperature of the host star (in Kelvin) |
st_dist |
Distance to the host star (in parsecs) |
- Display the structure of the
exoplanets_dfdata frame using thestr()function. How many rows and columns does it have? What are the data types of each column? - Calculate the average radius (
pl_rade) and mass (pl_massee) of the exoplanets in the dataset. How do these values compare to Earth’s radius (1 Earth radius) and mass (1 Earth mass)? - Identify the exoplanet with the highest equilibrium temperature (
pl_eqt). What is its name and temperature? Convert the temperature from Kelvin to Celsius using the formula: \(C = K - 273.15\). - How many exoplanets were discovered last year?
- Select 5 random (or not) exoplanets from the dataset. Compare their characteristics with the ones of the planets in our solar system from the table below:
| Planet | Diameter (km) | Rings | Moons | Mean Temperature (C) | Orbit Period (days) |
|---|---|---|---|---|---|
| Mercury | 4879 | False | 0 | 167 | 88 |
| Venus | 12104 | False | 0 | 464 | 225 |
| Earth | 12756 | False | 1 | 15 | 365 |
| Mars | 6792 | False | 2 | -65 | 687 |
| Jupiter | 142984 | True | 95 | -110 | 4333 |
| Saturn | 120536 | True | 146 | -140 | 10759 |
| Uranus | 51118 | True | 28 | -195 | 30687 |
| Neptune | 49528 | True | 16 | -200 | 60190 |
Footnotes
You will use this formula in couple of courses, so it is worth memorizing it.↩︎
The dataset is adapted from the NASA Exoplanet Archive. It contains information about various exoplanets discovered outside our solar system.↩︎