Linear Regression and Temperature#
In this notebook, we’ll look at using linear regression to study changes in temperature.
Setup#
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%config InlineBackend.figure_format ='retina'
Getting our data#
We’ll be getting data from North America Land Data Assimilation System (NLDAS), which provides the daily average temperature from 1979-2011 for the United States.
For the next step, you will need to choose some settings in the data request form. These are:
GroupBy: Month Day, Year
Your State
Export Results (check box)
Show Zero Values (check box)
Download the data for your home state (or state of your choosing) and upload it to M2 in your work directory.
Loading our data#
df = pd.read_csv('North America Land Data Assimilation System (NLDAS) Daily Air Temperatures and Heat Index (1979-2011).txt',delimiter='\t',skipfooter=14,engine='python')
df
Clean the data#
Drop any rows that have the value “Total” in the Notes column, then drop the Notes column
Make a column called Date that is in the pandas datetime format
Make columns for ‘Year’, ‘Month’, and ‘Day’ by splitting the column ‘Month Day, Year’
df['DateInt'] = df['Date'].astype(int)/10e10 # This will be used later
Generating a scatter plot#
Use df.plot.scatter to plot ‘Date’ vs ‘Avg Daily Max Air Temperature (F)’. You might want to add figsize=(50,5) as an argument to make it more clear what is happening.
Describe your plot.
Adding colors for our graph#
# No need to edit this unless you want to try different colors or a pattern other than colors by month
cmap = matplotlib.cm.get_cmap("nipy_spectral", len(df['Month'].unique())) # Builds a discrete color mapping using a built in matplotlib color map
c = []
for i in range(cmap.N): # Converts our discrete map into Hex Values
rgba = cmap(i)
c.append(matplotlib.colors.rgb2hex(rgba))
df['color']=[c[int(i-1)] for i in df['Month'].astype(int)] # Adds a column to our dataframe with the color we want for each row
Make the same plot as 4) but add color by adding the argument c=df[‘color’] to our plotting command.
Pick a subset of the data#
Select a 6 month period from the data. # Hint use logic and pd.datetime(YYYY, MM, DD)
Plot the subset using the the same code you used in 6). You can change the figsize if needed.
Linear Regression#
We are going to use a very simple linear regression model. You may implement a more complex model if you wish.
The method described here is called the least squares method and is defined as:
\(m = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}))}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\)
\(b = \bar{y} - m\bar{x}\)
Where \(\bar{x}\) and \(\bar{y}\) are the average value of \(x\) and \(y\) respectively.
First we need to define our X and Y values.
X=subset['DateInt'].values
Y=subset['Avg Daily Max Air Temperature (F)'].values
def lin_reg(x,y):
# Calculate the average x and y
x_avg = np.mean(x)
y_avg = np.mean(y)
num = 0
den = 0
for i in range(len(x)): # This represents our sums
num = num + (x[i] - x_avg)*(y[i] - y_avg) # Our numerator
den = den + (x[i] - x_avg)**2 # Our denominator
# Calculate slope
m = num / den
# Calculate intercept
b = y_avg - m*x_avg
print (m, b)
# Calculate our predicted y values
y_pred = m*x + b
return y_pred
Y_pred = lin_reg(X,Y)
subset.plot.scatter(x='Date', y='Avg Daily Max Air Temperature (F)',c=subset['color'])
plt.plot([min(subset['Date'].values), max(subset['Date'].values)], [min(Y_pred), max(Y_pred)], color='red') # best fit line
plt.show()
What are the slope and intercept of your best fit line?
What are the minimum and maximum Y values of your best fit line? Is your slope positive or negative?
Putting it all together#
Generate a best fit line for the full data set and plot the line over top of the data.
Is the slope positive or negative? What do you think that means?