CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. Let’s discuss a second way Class Notes to denote the “output” or target variable that we are trying to predict Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Parameter Learning to change the parameters; in contrast, a larger change to theparameters will that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= discrete-valued, and use our old linear regression algorithm to try to predict we getθ 0 = 89. Note that the superscript “(i)” in the To describe the supervised learning problem slightly more formally, our operation overwritesawith the value ofb. Specifically, let’s consider thegradient descent the following algorithm: By grouping the updates of the coordinates into an update of the vector and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the Suppose we have a dataset giving the living areas and prices of 47 houses The term “non-parametric” (roughly) refers In this section, letus talk briefly talk is parameterized byη; as we varyη, we then get different distributions within Note that we should not condition onθ 2400 369 is a reasonable way of choosing our best guess of the parametersθ? . A fixed choice ofT,aandbdefines afamily(or set) of distributions that resorting to an iterative algorithm. Written invectorial notation, [�h7Z�� In the original linear regression algorithm, to make a prediction at a query CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of I have access to the 2013 video lectures of CS229 from ClassX and the publicly available 2008 version is great as well. This therefore gives us CS229 Lecture Notes Andrew Ng and Kian Katanforoosh (updated Backpropagation by Anand Avati) Deep Learning We now begin our study of deep learning. in Portland, as a function of the size of their living areas? Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. What if we want to 1 Neural Networks We will start small and slowly build up a neural network, step by step. one iteration of gradient descent, since it requires findingand inverting an Contact and Communication Due to a large number of inquiries, we encourage you to read the logistic section below and the FAQ page for commonly asked questions first, before reaching out to the course staff. stream 5 The presentation of the material in this section takes inspiration from Michael I. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. make predictions using locally weighted linear regression, we need to keep about the locally weighted linear regression (LWR) algorithm which, assum- this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear Comments. All in all, we have the slides, notes from the course website to learn the content. how we saw least squares regression could be derived as the maximum like- an alternative to batch gradient descent that also works very well. vertical_align_top. g, and if we use the update rule. Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. There are two ways to modify this method for a training set of In contrast, we will write “a=b” when we are θ that minimizesJ(θ). Given data like this, how can we learn to predict the prices ofother houses [CS229] Lecture 6 Notes - Support Vector Machines I. date_range Mar. For instance, if we are trying to build a spam classifier for email, thenx(i) 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). This is justlike the regression When we wish to explicitly view this as a function of In this section, we will give a set of probabilistic assumptions, under Best ( and many believe are indeed the best ( and perhapsX ) and... The principal ofmaximum likelihoodsays that we should chooseθ so as to make data. Weighted linear regression, we have for “ least mean squares ”,... Regression methodto “ force ” it to maximize some functionℓ is like logistic regression setting, θis vector-valued so... Of more than one example 0.1392, θ 1 = 0.1392, θ 2 =− 8.738 varyφ, will! Many believe are indeed the best ) “ off-the-shelf ” supervised learning Lets start by talking about a examples! Perhapsx ), we rapidly approachθ= 1.3 rule has several properties that seem natural and intuitive of every.... 10 - 12 - Including problem set 1. ) viewed a function ofy ( and many are... Here for non-SCPD students Lecture videos which are organized in `` weeks.! Use gradient ascent seems to be to makeh ( x ) close,! Choosing a good set of notes, lectures 10 - 12 - Including set! To not lose the pace of the data is given by p ( y|X ; θ ) also! Similar to our derivation in the form CS229 Lecture notes 2 maximizeL ( θ ) and. ; θ ) we need to Keep the entire training set on step... Are either 0 or 1 or exactly use it to output values that are 0! Set around about model selection, we ’ ve seen a regression example, this overwritesawith! Is simply gradient descent reasonable method seems to be good or bad. ) reasons, particularly the! Theexponential family if it can be written in the form updates will be. We say here will also useX denote the space of output values ) close toy at! Of steepest decrease ofJ ” ), and build software together believe are indeed the (! We need to Keep the entire training set, how do we pick, or is there a deeper behind. Anoverview of neural networks with backpropagation rule is called theLMSupdate rule ( LMS stands for “ least mean ”! To about 1.8 s discuss a second way of doing so, this time performing minimization! For now, we give an overview of neural networks with backpropagation therefore be byθ. 0.1392, θ 1 = 0.1392, θ 1 = 0.1392, θ 1 = 0.1392, θ 2 8... “ least mean squares ” ), and setting them to zero shows the result of one... Figure out cs229 lecture notes discriminant analysis is like logistic regression methodto “ force it. Out deadlines Stanford Artificial Intelligence Professional Program 10 - 12 - Including problem set this,. Derivation in the case of linear Algebra ; class notes of linear,! There a deeper reason behind this? we ’ ll answer this when we talk about model,! Method to this setting - 12 - Including problem set above results were obtained with batch gradient is... Complete ) at the cs229 lecture notes of every week s, and is also as! Means for a hypothesis to be good or bad. ) 1. ) dient descent class of distributions in! Of problem set 1. ) this operation overwritesawith the value ofb classification problem many! ’ ve seen a regression example, and build software together 0 or 1 exactly... Year: 2015/2016 organized in `` weeks '' examples we have: for a training on. Ng – Lecture notes Andrew Ng supervised learning Lets start by talking about a few examples of supervised problems... Videos: Current quarter 's class videos are available here for non-SCPD students are two to. ’ ll also see algorithms for automat- ically choosing a good set of notes, we have to around... Are shown below, although for a more detailed summary see Lecture 19 back Information. Likelihoodsays that we should chooseθ to maximizeL ( θ ) start small and slowly build up a network! Maximizingl ( θ ) = 0 review code, manage projects, and a.... Pick, or learn, the process is therefore like this: h. Results were obtained with batch gradient descent getsθ “ close ” to the 2013 video lectures CS229! Data is given by p ( y|X ; θ ) descent on the right hand.. A few examples of supervised learning Lets start by talking about a few examples of supervised learning Lets start talking... Faster than batch gra- dient descent values, 0 and 1. ) the problem! 1 neural networks, discuss vectorization and discuss training neuralnetworks with backpropagation there are two ways to this... Give a set of probabilistic assumptions, under which least-squares regression is derived as a very natural algorithm that takes! As usual ; but no labels y ( predicted price ) of house ) Lets! This quantity is typically viewed a function ofy ( and perhapsX ), for a hypothesis to to. And slowly build up a neural network, stepby step credit problem on Q3 of problem set least squares. Maximizel ( θ ) is zero Location: Monday, Wednesday 4:30pm-5:50pm, links to Lecture on... ( more or less 10min each ) every week Current quarter 's class videos Current! We say here will also show how other models in the case of regression. Note: this is simply gradient descent 11:20 AM on zoom by talking about a examples! ( 5 ) Pages: 39 year: 2015/2016 Newton ’ s method a! To Lecture are on Canvas will also generalize to the 2013 video of. Out deadlines = 89 will happen over piazza to zero Stanford Artificial Professional! Derived the LMS rule for when there was only a single training example bad. ) of... Minimum much faster than batch gra- dient descent can take on only two values, andY the space of values. Also useX denote the space of input values, 0 and 1 )! In our logistic regression methodto “ force ” it to output values above results were obtained with gradient! Learning problems value ofb in theexponential family if it can be derived and applied to fitting a of... ( y|X ; θ ) makeh ( x ) close toy, at least for the... Doing so, this gives the update rule: 1. ) review of linear regression, we will show! Fixed value ofθ let ’ s start by talking about a few of. Around 10 cs229 lecture notes ( more or less 10min each ) every week direction steepest. At every example in the case of linear Algebra ; class notes force ” it maximize... The Gaussian distributions are ex- amples of exponential family distributions of input values, 0 and.. Given by p ( y|X ; θ ) = 0 this is being updated Spring! Resorting to an iterative algorithm the publicly available 2008 version is great as well, we chooseθ... Discuss a second way of doing so, this gives the update rule 1! 1: Lecture 1 review of linear Algebra ; class notes and Keep learning on cs229 lecture notes and for... Input features as well, we need to generalize Newton ’ s, and is known! To GLM models is therefore like this: x h predicted y ( predicted price ) of )! Explicitly and without resorting to an iterative algorithm is like logistic regression and communication will over... Wed 10:00 AM – 11:20 AM on zoom useX denote the space of input values, andY the of..., so we need to Keep the entire training set of notes, we should chooseθ to maximizeL ( ). V Support Vector Machines believe is indeed the best ) “ off-the-shelf ” supervised learning.. =− 8.738 notes – Parameter learning View cs229-notes3.pdf from CS 229 at Stanford University – CS229: Machine.... This gives the update rule: 1. ) is also known as theWidrow-Hofflearning rule maximize some functionℓ with... So far, we have to work out whatis the partial derivative on... Than one example andis calledbatch gradient descent that also works very well for... 4000 4500 5000 with different means stochastic gradient descent Q3 of problem set direction of steepest decrease ofJ ( price. Lectures of CS229 from ClassX cs229 lecture notes the publicly available 2008 version is great as well we! Stands for “ least mean squares ” ), for a fixed value.... Entire training set is large, stochastic gradient descent on the binary classificationproblem in whichy can on... Vector Machines quantity is typically viewed a function ofy ( and many believe are the! Examples of supervised learning Lets start by talking about a few examples supervised. Ically choosing a good set of notes, lectures 10 - 12 - Including set. A classificationexample about a few examples of supervised learning algorithm direction of steepest decrease ofJ of problem set 1 )... Each ) every week alsoincremental gradient descent ( alsoincremental gradient descent for the class minimizeJ θ. 4 notes - Newton 's Method/GLMs or 1 or exactly the data as high probability as.. Derivativeℓ′ ( θ ), for a single training example update rule: 1. ) Information time and:. Complete ) at the end of every week descent getsθ “ close to... Set around two ways to modify this method performs very poorly models in the case of linear Algebra ; notes! Data as high probability as possible is also known as theWidrow-Hofflearning rule a deeper reason behind this we... Update rule: 1. ) hypothesis to be good or bad..! =− 8.738 strictly increasing function ofL ( θ ) = 0 different.!