Linear Regression and Temperature#

In this notebook, we’ll look at using linear regression to study changes in temperature.

Setup#

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%config InlineBackend.figure_format ='retina'

Getting our data#

We’ll be getting data from North America Land Data Assimilation System (NLDAS), which provides the daily average temperature from 1979-2011 for the United States.

For the next step, you will need to choose some settings in the data request form. These are:

  • GroupBy: Month Day, Year

  • Your State

  • Export Results (check box)

  • Show Zero Values (check box)

  1. Download the data for your home state (or state of your choosing) and upload it to M2 in your work directory.

Loading our data#

df = pd.read_csv('North America Land Data Assimilation System (NLDAS) Daily Air Temperatures and Heat Index (1979-2011).txt',delimiter='\t',skipfooter=14,engine='python')
df

Clean the data#

  1. Drop any rows that have the value “Total” in the Notes column, then drop the Notes column


  1. Make a column called Date that is in the pandas datetime format


  1. Make columns for ‘Year’, ‘Month’, and ‘Day’ by splitting the column ‘Month Day, Year’


df['DateInt'] = df['Date'].astype(int)/10e10 # This will be used later

Generating a scatter plot#

  1. Use df.plot.scatter to plot ‘Date’ vs ‘Avg Daily Max Air Temperature (F)’. You might want to add figsize=(50,5) as an argument to make it more clear what is happening.


  1. Describe your plot.

Adding colors for our graph#

# No need to edit this unless you want to try different colors or a pattern other than colors by month

cmap = matplotlib.cm.get_cmap("nipy_spectral", len(df['Month'].unique())) # Builds a discrete color mapping using a built in matplotlib color map

c = []
for i in range(cmap.N): # Converts our discrete map into Hex Values
    rgba = cmap(i)
    c.append(matplotlib.colors.rgb2hex(rgba))

df['color']=[c[int(i-1)] for i in df['Month'].astype(int)] # Adds a column to our dataframe with the color we want for each row
  1. Make the same plot as 4) but add color by adding the argument c=df[‘color’] to our plotting command.


Pick a subset of the data#

  1. Select a 6 month period from the data. # Hint use logic and pd.datetime(YYYY, MM, DD)


  1. Plot the subset using the the same code you used in 6). You can change the figsize if needed.


Linear Regression#

We are going to use a very simple linear regression model. You may implement a more complex model if you wish.

The method described here is called the least squares method and is defined as:

\(m = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}))}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\)

\(b = \bar{y} - m\bar{x}\)

Where \(\bar{x}\) and \(\bar{y}\) are the average value of \(x\) and \(y\) respectively.

First we need to define our X and Y values.

X=subset['DateInt'].values
Y=subset['Avg Daily Max Air Temperature (F)'].values
def lin_reg(x,y):
    # Calculate the average x and y
    x_avg = np.mean(x)
    y_avg = np.mean(y)

    num = 0
    den = 0
    for i in range(len(x)): # This represents our sums
        num = num + (x[i] - x_avg)*(y[i] - y_avg) # Our numerator
        den = den + (x[i] - x_avg)**2 # Our denominator
    # Calculate slope
    m = num / den
    # Calculate intercept
    b = y_avg - m*x_avg

    print (m, b)
    
    # Calculate our predicted y values
    y_pred = m*x + b
    
    return y_pred
Y_pred = lin_reg(X,Y)
subset.plot.scatter(x='Date', y='Avg Daily Max Air Temperature (F)',c=subset['color'])
plt.plot([min(subset['Date'].values), max(subset['Date'].values)], [min(Y_pred), max(Y_pred)], color='red') # best fit line
plt.show()
  1. What are the slope and intercept of your best fit line?


  1. What are the minimum and maximum Y values of your best fit line? Is your slope positive or negative?


Putting it all together#

  1. Generate a best fit line for the full data set and plot the line over top of the data.



  1. Is the slope positive or negative? What do you think that means?