 How machine store numbers
 Machine only stores numbers to a finite precision
 The floating point format: ±S × b^{^e}
 For a 64bit storage scheme IEEE standard: 1bit sign, 11bits signed exponent, 52bits mantissa
 Underflow and overflow
 Conditional number
 The condition number of a function with respect to an argument measures how much the output value of the function can change for a small change in the input argument.
 For symmetric matrices, condition number is the ratio of the largest to smallest eigenvalue.
 cond(A)= A A^{1} A should be square nonsingular matrix.
 Derivative
 The derivative of a function of a real variable measures the sensitivity to change of the function value with respect to a change in its argument.
 Partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant.
 Gradient
 The gradient is a multivariable generalization of the derivative. While a derivative can be defined on functions of a single variable, for functions of several variables, the gradient takes its place. The gradient is a vectorvalued function, as opposed to a derivative, which is scalarvalued.
 The gradient of a function ƒ, denoted as ∇ƒ, is the collection of all its partial derivatives into a vector.
 Hessian
 It is the gradient of the gradient.
 It is a square matrix of secondorder partial derivatives of a scalarvalued function.
 Jacobian
 the Jacobian matrix is the matrix of all firstorder partial derivatives of a vectorvalued function.
 Taylor Series
 A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point.
 A function can be approximated by using a finite number of terms of its Taylor series.
 Optimization
 The general optimization task is to maximize or minimize a function f(x) by varying x.
 The function f(x) is called objective function, cost function or loss function.
 Any maximization problem can be written as minimization of f(x).
 Unconstrained problem
 Find x that minimizes f(x) with x ∊ ℝ
 Any local extremum will have the property f ′(x)=0
 Such points are called stationary points or critical points
 The stationary points may be a (local) minimum, maximum or saddle point.
 If f ′′(x)>0, it is a local minimum.
 If f ′′(x)<0,it is a local maximum.
 If f ′′(x)=0, it could be a saddle point.
 The absolute lowest/highest level of f(x) is called the global maximum/minimum
 Optimization  Multivariate x
 Since x is now a vector quantity, we need to evaluate the gradient.
 The type of critical point is decided by the nature of the Hessian.
 If H is positive definite, it is a local minimum.
 If H is negative definite, it is a local maximum.
 If H is indefinite(neither p.d or n.d), then it is a saddle point.
 Constrained optimization
 The general constrained optimization task is to maximize or minimize a function f(x) by varying x given certain constraints on x.
 All constraints can be converted to two types of constraints:
 Equality constraints
 Inequality constraints
 Canonical form of optimization problems.
 Generalized Lagrange function
 It is a strategy for finding the local maxima and minima of a function subject to equality constraints.
 Gradient Descent
 Gradient descent is a firstorder iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.
 It is possible to gradient descent algorithm to:
 Diverge
 Oscillate witoutdiverging or converging
 Converge slowly
 Converge rapidly
 When should iteration stop?
 Gradient Descent Procedure
 Decide on learning rate(α), stopping precision and stopping criteria.
 Make an initial guess for w = w^{0}
 Calculate w^{k+1} = w^{k}  α ∇J
 If stopping criteria not satisfied, go to step 3
 Stop
 Packages/Tools
 Python
 ScikitLearn
 Tensorflow
 Keras
 PyTorch
 Caffe
 Google Colab
 MATLAB
Assignment Tips
 Question 1: The first question is smiliar to the example shown in the video lecture The python code for the example problem is given below. Try to run in google colab.
Output:

Question 2 and 3: It is always better to cross verify your answer. Use minimumfunction (Here f(x)=(x)^2+(y)^2+4x6y7 with respect to x,y) to verify whether your gradient descent algorithm implementation is correct or not.

Question 7 to 10: The objective function here is mean square error (divided by 2 for simplifying derivative). There are two ways to cross verify your gradient descent implemenatation; using normal equation and linear regression (scikitlearn)
Output: