This is the blog post for all the projects I have worked related to Data Science, Machine Learning. With programming tools, including python3, jupyter notebooks.
Boston House data pricing
Data description
Origin: The origin of the boston housing data is Natural.
Usage: This dataset may be used for Assessment.
Number of Cases: The dataset contains a total of 506 cases.
Order: The order of the cases is mysterious.
Variables: There are 14 attributes in each case of the dataset. They are:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s
Code
To donwload the notebook, follow to my GitHub repository.
The whole code and development was implemented with python
and in jupyter notebook
.
Importing libraries & First display
# imposrt libraries and data reading
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
# asign headers to 'Population', 'Profit'
df = pd . read_csv ( 'Population.csv' , names = [ 'Population' , 'Profit' ])
df . head ()
Plotting Data (Exploratory data analysis)
Now we proceed to view the data, by doing a exploratory data analysis. By defining a function, for the scatter plot between the Population and Profit in units of $10,000s.
def PlotData ( x , y ):
''' This function makes the scatter plot from x,y data.
@param : x and y data.
With options of the size of figure, color, marker, plot and axes legends.
Args:
x: x data (Population of City in 10,000s)
y: y data (Profit in $10,000s)
Return:
scatter plot of y vs x '''
plt . figure ( figsize = ( 5 , 4 ), dpi = 78 )
plt . scatter ( x , y , c = 'red' , marker = 'x' , alpha = 0.5 , label = 'training data' )
plt . xlabel ( 'Population of City in 10,000s' )
plt . ylabel ( 'Profit in $10,000s' )
plt . legend ( loc = 'lower right' )
plt . grid ()
plt . show ()
## ========== Part 2: Plotting =======================
# sending the columns 'Population', 'Profit', as x and y, from the data frame
PlotData ( df . Population , df . Profit )
As a result of the code above, we can see that there is a tendency of the data towards a positive slope.
Cost Computation And Hypothesis
For the cost, the use of the correct cost function needs to be appropriate for correct interpetration of cost computation. In here the cost function used is the mean squared error.
Cost Function: \(J(\theta_{0},\theta_{1}) = \frac{1}{2n} \sum^{n}_{i=1} (h_{\theta (x^{i})} - y^{i})^{2}\)
Which in other words we can stablish a hypothesis, the parameters that are goint to be used, the cost function and the goal of the minimization.
Model attributes
Hypothesis Fit(linear function): \( f_{\theta} = \theta_{0} + \theta_{1} \cdot x \)
Parameters used: \( theta_{0}, \theta_{1} \)
Cost Function: \( J(\theta_{0},\theta_{1}) = \frac{1}{2n} \sum^{n}_{i=1} (h_{\theta (x^{i})} - y^{i})^{2} \)
Goal:\( \underset{\theta_{0}, \theta_{1}}{min} \ J(\theta_{0},\theta_{1}) \)
But first we need to prepare data for the data ingestion to the model.
# obtaining the number of features, which is 1, 'Population of City in 10,000s'
n = len ( df . columns ) - 1 # subtract the target column
# Create a function to pepare the data.
def Prepare_Data ( df , n ):
"""
Add 1s column, convert to matrices,
initialize theta.
Args:
df: read the data file
n: int
Return:
x: a m by n+1 matrix
y: a m by 1 vector
theta: a n+1 by 1 vector
"""
# Adding one column of ones (1's) to the dataset, at start of the dataframe
# Ones|Population|Profit
# 1 | ... | ...
df . insert ( 0 , 'Ones' , 1 )
# defining x y y, separating the dataset with iloc by its index
# x selects all rows and cols from 0 to 2 (n+1) -> Ones|Population
# y selects all rwos and cols from 2 (n+1) to 3 (n+2) -> Profit
x = df . iloc [:, 0 : n + 1 ]
y = df . iloc [:, n + 1 : n + 2 ]
# converts the matrices and initialize de parameters theta to 0s (zeros)
# theta is a vector 2*1 (n+1*1) and its transpose is a vector of 1*2 (1*n+1)
# where n is the number of freatures
x = np . asmatrix ( x . values )
y = np . asmatrix ( y . values )
theta = np . asmatrix ( np . zeros (( n + 1 , 1 )))
return x , y , theta
x , y , theta = Prepare_Data ( df , n )
Initialize parameters for iterations and learning rate α.
# alpha init
iterations = 1500
alpha = 0.01
# Check the dimensions of the matrices.
x . shape , y . shape , theta . shape
Which gives the next result. Meaning that the shape of the matrix x, y and theta.
Out : = (( 97 , 2 ), ( 97 , 1 ), ( 2 , 1 ))
Let us now create the cost computation, implementation
# Create a function to compute cost.
def ComputeCost ( x , y , theta ):
"""
Compute the cost function.
Args:
x: a m by n+1 matrix
y: a m by 1 vector
theta: a n+1 by 1 vector
Returns:
cost: float
"""
m = len ( x ) # cuantos datos hay
J_cost = 1 / ( 2 * m ) * ( np . sum ( np . square (( x * theta ) - y )))
# here is the operation, with numpy matrix, making all the operations in one iteration
# print(f"For theta {theta[0,0]:.2f} and {theta[1,0]:.2f}, the cost (J) is {J_cost:.2f}") # for more info uncomment
return J_cost
ComputeCost ( x , y , theta )
Having a return value of the whole value of J, for the base case.
Out : = 32.072733877455676
Now we are ready to implement the gradient descent in order to iterate and get better fits every time the model learns.
# Create a function to implement gradient descent.
def gradientDescent ( x , theta , iterations ):
"""
Implement gradient descent.
Args:
x: a m by n+1 matrix
theta: a n+1 by 1 vector
Return:
theta: a n+1 by 1 vector
J_vals: a #iterations by 1 vector
"""
m = len ( x ) # tamaño de los ejemplos de entrenamiento
J_vals = [] # inicializar J como una lista
for i in range ( iterations ):
error = ( x * theta ) - y
for j in range ( len ( theta )):
theta . T [ 0 , j ] = theta . T [ 0 , j ] - ( alpha / m ) * np . sum ( np . multiply ( error , x [:, j ]))
J_vals . append ( ComputeCost ( x , y , theta ))
return ( theta , J_vals )
theta , J_vals = gradientDescent ( x , theta , iterations )
Plotting Data and Fit
We are now ready how the model implementation worked and plot the fitted hypothesis in the scatter data.
theta_f = list ( theta . flat )
xs = np . arange ( 5 , 23 )
ys = theta_f [ 0 ] + theta_f [ 1 ] * xs
plt . figure ( figsize = ( 8 , 5 ))
plt . xlabel ( 'Population of City in 10,000s' )
plt . ylabel ( 'Profit in $10,000s' )
plt . grid ()
plt . plot ( df . Population , df . Profit , 'rx' , label = ' Training Data' )
plt . plot ( xs , ys , 'b-' , label = 'Linear Regression: h(x) = %0.2f + %0.2fx' % ( theta [ 0 ], theta [ 1 ]))
plt . legend ( loc = 4 )
Which can be seen in the figure below.
Obtaining Predicted Values
Testing the model and output predictions for an estimate of some test cases population.
# Predict the profit for population of 35000 and 70000.
print (( theta_f [ 0 ] + theta_f [ 1 ] * 3.5 ) * 10000 )
print (( theta_f [ 0 ] + theta_f [ 1 ] * 7 ) * 10000 )
As can be seen the return output gives us an estimate of how much the price can increase in terms of the population fitted model.
Out : = 4519.767867701772
45342.45012944714
For more visualization on the how the cost function is minimized, and the plot of the estimate of the cost function J, for all of the iterations.
from mpl_toolkits.mplot3d import axes3d
# Create meshgrid.
xs = np . linspace ( - 10 , 10 , 100 )
ys = np . linspace ( 4 , - 1 , 100 )
xx , yy = np . meshgrid ( xs , ys )
# Initialize J values to a matrix of 0's.
J_vals = np . zeros (( xs . size , ys . size ))
# Fill out J values.
for index , v in np . ndenumerate ( J_vals ):
J_vals [ index ] = ComputeCost ( x , y , [[ xx [ index ]], [ yy [ index ]]])
# Create a set of subplots.
fig = plt . figure ( figsize = ( 16 , 6 ))
ax1 = fig . add_subplot ( 121 , projection = '3d' )
ax2 = fig . add_subplot ( 122 )
# Create surface plot.
hh = ax1 . plot_surface ( xx , yy , J_vals , alpha = 0.5 , cmap = 'jet' )
ax1 . set_zlabel ( 'Cost' , fontsize = 12 , rotation = 90 )
ax1 . set_title ( 'Surface plot of cost function' )
cbar = fig . colorbar ( hh )
# Create contour plot.
ax2 . contour ( xx , yy , J_vals , np . logspace ( - 2 , 3 , 20 ), cmap = 'jet' )
ax2 . plot ( theta_f [ 0 ], theta_f [ 1 ], 'rx' )
ax2 . set_title ( 'Contour plot of cost function, showing minimum' )
# Create labels for both plots.
for ax in fig . axes :
ax . set_xlabel ( r '$\theta_0$' , fontsize = 14 )
ax . set_ylabel ( r '$\theta_1$' , fontsize = 14 )
ax1 . view_init ( 20 , - 135 )
Plotting the 3D cost funcition J and the contour plot of the vest minimization value of the fitted model.