Week 3: Computational Basics – Numerical computation and optimization, Introduction to Machine Learning packages

  • How machine store numbers
    • Machine only stores numbers to a finite precision
    • The floating point format: ±S × b^e
    • For a 64-bit storage scheme IEEE standard: 1-bit sign, 11-bits signed exponent, 52-bits mantissa
    • Underflow and overflow
  • Conditional number
    • The condition number of a function with respect to an argument measures how much the output value of the function can change for a small change in the input argument.
    • For symmetric matrices, condition number is the ratio of the largest to smallest eigenvalue.
    • cond(A)= ||A|| ||A-1||     A should be square non-singular matrix.
  • Derivative
    • The derivative of a function of a real variable measures the sensitivity to change of the function value with respect to a change in its argument.
    • Partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant.
  • Gradient
    • The gradient is a multi-variable generalization of the derivative. While a derivative can be defined on functions of a single variable, for functions of several variables, the gradient takes its place. The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued.
    • The gradient of a function ƒ, denoted as ∇ƒ, is the collection of all its partial derivatives into a vector.
  • Hessian
    • It is the gradient of the gradient.
    • It is a square matrix of second-order partial derivatives of a scalar-valued function.
  • Jacobian
    • the Jacobian matrix is the matrix of all first-order partial derivatives of a vector-valued function.
  • Taylor Series
    • A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point.
    • A function can be approximated by using a finite number of terms of its Taylor series.
  • Optimization
    • The general optimization task is to maximize or minimize a function f(x) by varying x.
    • The function f(x) is called objective function, cost function or loss function.
    • Any maximization problem can be written as minimization of -f(x).
  • Unconstrained problem
    • Find x that minimizes f(x) with x ∊ ℝ
    • Any local extremum will have the property f ′(x)=0
    • Such points are called stationary points or critical points
    • The stationary points may be a (local) minimum, maximum or saddle point.
    • If f ′′(x)>0, it is a local minimum.
    • If f ′′(x)<0,it is a local maximum.
    • If f ′′(x)=0, it could be a saddle point.
    • The absolute lowest/highest level of f(x) is called the global maximum/minimum
  • Optimization - Multivariate x
    • Since x is now a vector quantity, we need to evaluate the gradient.
    • The type of critical point is decided by the nature of the Hessian.
    • If H is positive definite, it is a local minimum.
    • If H is negative definite, it is a local maximum.
    • If H is indefinite(neither p.d or n.d), then it is a saddle point.
  • Constrained optimization
    • The general constrained optimization task is to maximize or minimize a function f(x) by varying x given certain constraints on x.
    • All constraints can be converted to two types of constraints:
      • Equality constraints
      • Inequality constraints
    • Canonical form of optimization problems.
  • Generalized Lagrange function
    • It is a strategy for finding the local maxima and minima of a function subject to equality constraints.
  • Gradient Descent
    • Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.
    • It is possible to gradient descent algorithm to:
      • Diverge
      • Oscillate witoutdiverging or converging
      • Converge slowly
      • Converge rapidly
    • When should iteration stop?
  • Gradient Descent Procedure
    1. Decide on learning rate(α), stopping precision and stopping criteria.
    2. Make an initial guess for w = w0
    3. Calculate wk+1 = wk - α ∇J
    4. If stopping criteria not satisfied, go to step 3
    5. Stop
  • Packages/Tools
    • Python
    • Scikit-Learn
    • Tensorflow
    • Keras
    • PyTorch
    • Caffe
    • Google Colab
    • MATLAB

Assignment Tips


  • Question 1: The first question is smiliar to the example shown in the video lecture The python code for the example problem is given below. Try to run in google colab.
-----------------------------------------------------------
import numpy as np

def J(w):
    return w[0]**2 + w[1]**2 + 4 

def gradientJ(w):
    return 2*w # same as [2*w[0], 2*w[1]]

w = np.array([3, 4])
alpha = 0.1

print("%-2s %-16s %-16s %-7s %-16s" 
      % ("k", "w", "gradient w", "J(w)", "new w"))
print("-----------------------------------------------------------")

for k in range(32):
    print("%02d %-16s %-16s %05.2f " 
          % (k, np.round(w,4), np.round(gradientJ(w),4), J(w)), end="")
    w = w - alpha * gradientJ(w)   
    print(" %-16s" % (np.round(w,4)))
-----------------------------------------------------------

Output:

-----------------------------------------------------------
k  w                gradient w       J(w)    new w           
-----------------------------------------------------------
00 [3 4]            [6 8]            29.00  [2.4 3.2]       
01 [2.4 3.2]        [4.8 6.4]        20.00  [1.92 2.56]     
02 [1.92 2.56]      [3.84 5.12]      14.24  [1.536 2.048]   
03 [1.536 2.048]    [3.072 4.096]    10.55  [1.2288 1.6384] 
... 
29 [0.0046 0.0062]  [0.0093 0.0124]  04.00  [0.0037 0.005 ] 
30 [0.0037 0.005 ]  [0.0074 0.0099]  04.00  [0.003 0.004]   
-----------------------------------------------------------
-----------------------------------------------------------
import numpy as np
from numpy import linalg as LA
x = np.array([
    [1, 0],
    [1, 0.25],
    [1, 0.5],
    [1, 0.75],
    [1, 1.00]
]) # adding 1 in every inputs in order to accomodate bias/intercept.

y = np.array([
    [0.88822],
    [1.2165],
    [1.3171],
    [1.7930],
    [1.9826]
])
# normal equation
wAccordingToNormalEquation = LA.inv(x.T.dot(x)).dot(x.T).dot(y)

print("W according to normal equation: %s " % wAccordingToNormalEquation)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x[:,1:],y) # Remove all 1's added for accomodating bias/intercept
wAccordingToLinearRegression = lr.intercept_,lr.coef_
print("W according to linear equation: %s %s" % wAccordingToLinearRegression)
-----------------------------------------------------------

Output:

-----------------------------------------------------------
W according to normal equation: [[0.886432] [1.106104]] 
W according to linear equation: [0.886432] [[1.106104]]
-----------------------------------------------------------

Reference list