Skip to content

Exercises and notes of ML course on Coursera


Notifications You must be signed in to change notification settings



Folders and files

Last commit message
Last commit date

Latest commit



9 Commits

Repository files navigation


Gradient Descent

θj = θj - α * derivative{θj}(J(θ))

Feature scaling:

  • Features being on similar scales is better for GD, making it converge faster.
  • Rule of thumbs: change scales to between [-1/3; 1/3] and [-3; 3]

Mean normalization: Replace xi with xi - μi to make feature have approximately zero mean.

Linear Regression

derivative{θj}(J(θ)) = (1/m) sum{i}{1}{m}((h(i) - y(i)) * xj(i))
θ = θ - (α/m) * (XT * diff) with diff = X * θ - y.

Normal equation (analytic method):

  • θ = (XTX)-1XTy
  • Slow if n (# of features) is very large (10k can still be fine, GD can be more suitable for more than that).
  • XTX might be noninvertible when: some features are linearly dependent, or there are too many features (n ≥ m)

Logistic Regression

  • hθ(X) = 1/(1 + e-θX) and θX = ln(h/(1-h)).
  • The odds is the ratio between the amounts staked by parties to a bet. Here the odds is h/(1-h).
  • The likelihood of h is a function of θ, which shows the probability that the classification is correct in all examples, based on θ. In the discrete case, each example is a binomial count, so the probability of each example is the PMF that there are yi correct classification out of ni samples in that example, which is C{ni}{yi} * hiyi * (1-hi)(ni - yi). The likelihood is the product of all examples' probability.
  • Or we can use GD by defining the cost function of each example, and then minimizing the total cost function J:
    cost(h,y) = [-ln(h) if y=1 and -ln(1-h) if y=0] = -y * ln(h) - (1-y) * ln(1-h) and J(θ) = (1/m) sum{i}{1}{m}(cost)
    derivative{θj}(J(θ)) = the same as this of Linear Regression (but h(i) is different).
    θ = θ - (α/m) * (XT * diff) with diff = h - y.

Maximum Likelihood Estimation

  • The likelihood function is the joint probability distribution of observed data expressed as a function of parameters.
  • The likelihood function has the same form as the PMF (in case of discrete inputs) and the PDF (in case of continuous inputs), but is a function of parameters, not y.


  • Used to fight overfitting: keep all features but reduce magnitude/values of parameters θj.
  • Underfitting: high bias; overfitting: high variance.
  • We need to have a regularization parameter λ so that when we minimize the cost function, we would need to decrease values of θj's more than usual. Like a penalization.
  • We should not regularize the parameter θ0.
  • Setting λ too large might result in the algorithm being underfit (all θj might get to 0).

Use L2 Reg. (lambda squared):

  • Gradient Descent: This is usually true: 1 - α * λ / m < 1, it is usually a bit < 1.
  • Linear Regression - Normal Equation:
    • θ = (XTX + λL)-1XTy with L a diagonal matrix of size (n+1)x(n+1) with all diagonal elements being 1, except the first one.
    • XTX + λL is invertible.


Exercises and notes of ML course on Coursera







No releases published


No packages published


  • MATLAB 100.0%