# CS 194-10, Fall 2011: Introduction to Machine Learning Reading list

This list is still under construction. An empty bullet item indicates more readings to come for that week.

Readings marked in blue are ones you should cover; readings marked in green are alternatives that are often helpful but probably not essential.

### Background reading (review of prerequisite material):

The following stuff is what you "should" already know from Math 53 and 54, CS 70, and CS 188. Just in case some of it has rusted away, the TAs will cover the essentials in the Week 1 discussion section (8/24). Assignment 0 will also help to scrape off the rust, as well as introducing some basic Python tools for the course.
• RN: 4.1-4.2 (local search and optimization; 4.1.4 is not needed); 7.4 (propositional logic).
• M Probability: 2.3.1-3 (discrete); 2.4.1-4 (continuous); 2.5.1-2, 2.5.4-5 (means, variances, etc.); 2.6.1-6 (joint); 2.7.1-2 (covariance); 2.10.3-4 (change of variables).
Basic math: 31.4.1-2 (derivatives); 31.5.1-2 (convexity); 31.8.1 (Lagrange multipliers).
Linear algebra: 32.1-32.4.4; 32.5.3-4; 32.6.1-3.
Includes a lot of detail; useful as a reference as well as refresher.
• B: 1.2 and Appendix B (probability); Appendix C (linear algebra); Appendix E (Lagrange multipliers).
A reasonable alternative to the material in Murphy. We won't need all of the distributions in App.B.
• RN: 13.2-13.5 (probability); A.2-A.3 (matrices, probability).
13 is quite leisurely, but provides intuition; A.2 and A.3 cover less than you need. Murphy's supplementary chapters are more thorough.

### Week 1 (8/25 only): Introduction to machine learning

• M: 1.1-1.2.4
After a very brief introduction, introduces decision trees and k-nearest-neighbor.
• RN: 18.1-18.2
A broader introduction to learning from the point of view of its role in intelligent systems.
• HTF: 1, 2.1-2.3.
A brief introduction to simple forms of attribute-based statistical learning. 2.3 introduces linear regression and k-nearest-neighbor methods, but in a bit more detail than we need for this week!
• B: 1.1-1.3
A solid if somewhat technical introduction to statistical learning. 1.2 provides some useful background on probability.
• WFH: Ch.1, Ch.2.
Nontechnical introduction stressing applications. Sec. 1.2 provides helpful examples; Ch.2 covers basic issues with handling input data.

### Week 2 (8/30, 9/1): Linear regression, least squares, methodology

• M: 1.3
Covers the basics quite quickly, including the maximum-likelihood interpretation for least squares; also introduces robust regression, which uses a non-Gaussian error model to avoid giving too much weight to outliers. Later chapters (3.7, 5.5, 13, 15) go into much more detail; revised versions are expected from the author by 8/31.
• RN: 18.4 (general concepts for supervised learning), 18.6.1-2 (linear regression), 20.2.3 (MLE interpretation)
18.4 introduces several general topics: errors, loss functions, overfitting, cross-validation, and regularization; 18.6 applies them to regression.
• HTF: 3 (skip 3.2.2, 3.3, 3.4.6)
A statistically oriented discussion reflecting 50 years of experience with linear regression. Several excellent figures. The material on principal-components regression and partial least squares is interesting but also skippable for now - we will come back to these ideas later in the course.
• B: 1.2.5-6 (MLE, MAP, Bayesian regression), 1.5.5 (loss functions for regression), 3.1, 3.3, 3.5.
The material in Ch.1 is brief and conceptual, while Ch.3 is a nice exposition with just enough technical detail.
• WFH: 5.1-5.7.
A fairly thorough and practical treatment of measurement and methodology.

### Week 3 (9/6, 9/8): Linear classifiers, logistic regression

For the first lecture (machine learning methodology), the material is covered in the readings for Week 2. For the second lecture (linear classifiers):

### Week 4 (9/13, 9/15): Max-margin learning, SVMs, kernels

• HTF: 4.5.2 (introduces max-margin separator); 12.1 - 12.3.2 (SVMs and kernels).
You can skim the remainder of 12.3 for further algorithmic details and mathematical analysis.
• RN: 18.9.
Gives the basic ideas without too much math.
• M: 18.
Includes 18.3 (relevance vector machines) which are worth looking at (see also Hastie's discussion).
• B: 6.1-6.2 (kernels), 7.1 (SVMs).
Begins with a discussion of kernels and kernel construction before introducing SVMs.
• WFH: 6.4 (pages 223-232).
A practical, non-mathematical guide to SVMs.

### Week 5 (9/20, 9/22): Decision trees, ensemble learning

• HTF: 9.2 (decision trees; skim 9.1 too); 8.7 (bagging); 10.1-10.9 (boosting).
Simple and clear discussion of the CART (classification and regression tree) methodology. Unfortunately the material on bagging and boosting is scattered and bound up with other technical matters. 10.6 (robust loss functions) and 10.7 (general properties of good machine learning algorithms) are useful, but touch on more general issues than just trees and boosting.
• RN: 18.3 (decision trees), 18.10 (bagging and boosting).
The material on decision trees focuses more on the entropy-based heuristic for growing trees. Figure 18.33 is helpful in explaining boosting (even though I say so myself).
• M: 19.5 (boosting), 19.7.1 (bagging, random forests).
Although decision trees are the first classification method mentioned in Ch.1, there is as yet not much coverage except for a Bayesian analysis later on (which we will probably not get to). Future versions will include classification and regression trees.
• B: 14.4 (decision trees), 14.2-3 (bagging, boosting).
A concise summary of the material in Hastie.
• WFH: 6.1 (decision trees), 6.6 (model trees), 8.1-2, 8.4 (ensemble learning, bagging, boosting).
Solid, detailed description of decision tree learning and ensemble learning. Includes extensive description of "model trees", which are decision trees with linear separators defining the splits at each node.

### Week 6 (9/27, 9/29): Instance-based learning, neural networks

• HTF: 13.3-5 (nearest-neigbor methods); 11 (neural networks).
Good coverage at about the right level. Material on nearest neighbors includes details on methods that adapt the distance metric.
• RN: 18.8 (instance-based/nonparametric methods), 18.7 (neural networks).
18.8 includes locally weighted regression and some explanations of efficient methods for finding nearest neighbors.
• M: 1.2.4-8 (nearest neighbors, very briefly), 19.6 (neural networks).
Although decision trees are the first classification method mentioned in Ch.1, there is very little coverage except for a Bayesian analysis later on (which we will probably not get to).
• B: 2.5.2 (nearest neighbors), 5.1-3 (neural networks).
k-NN material is very brief, buried in a density estimation section. Neural network material emphasizes optimization methods; 5.4 looks at the second derivatives in depth.
• WFH: 4.7, 6.5 (nearest neighbors), 6.4 pp.232-243 (neural networks).
Includes a useful description of methods for finding nearest neighbors as well as adapting the distance metric. The neural network material is gentle compared to Bishop.

### Week 8 (10/11, 10/13):

• RN: 18.5 and 18.10.1 (learning theory); 20.1-20.2.2 (probabilistic methods).
18.5 gives a very simple proof of PAC learning for discrete hypothesis spaces, while 18.10.1 describes an online learning algorithm that makes a bounded number of mistakes. Chapter 20 takes a view of learning as probabilistic inference, including maximum likelihood, maximum a posteriori, and Bayesian methods; 20.2.2 introduces the naive Bayes classifier.
• HTF: 7.9 (learning theory); 6.6.3 (naive Bayes).
7.9 covers Vapnik-Chervonenkis bounds (which can be applied to continuous hypothesis spaces), but does not provide much explanation or insight. Being a frequentist book, HTF also has limited coverage of probabilistic/Bayesian ideas.
• M: 10.3.9 (learning theory); 3.1-3, 6.1-3 (probabilistic methods, naive Bayes).
10.3.9 provides only a brief mention and proof of a PAC bound. 3.1-2 gives an interesting viewpoint on probabilistic methods, emphasizing Bayesian ideas and human learning; 3.3 describes a nice example of Bayesian learning in a discrete hypothesis space, like the one in Russell and Norvig except that learning is done with positive examples only.
• B: 7.1.5 (learning theory); 1.2.3, 4.2.1-3 (probabilistic methods).
Learning theory coverage is minimal. Bayesian analysis (much of it quite heavy) pervades the book, so it is hard to pick out specific sections. 4.2 includes Gaussian discriminants.

### Week 9 (10/18, 10/20):

• M: 6.4 (Gaussian discriminants), 7.4 (logistic regression); 3.5-6, 7.3, 7.5 (Bayesian parameter learning and regression).
Provides thorough coverage of all the required topics for this week.
• RN: 18.6.4 (logistic regression); 20.2.4 (Bayesian parameter learning).
18.6.4 is a simple maximum-likelihood treatment of logistic regression (mainly review). 20.2.4 covers the beta/Bernoulli model for Bayesian parameter learning. Does not cover Gaussian discriminants or Bayesian regression.
• HTF: 8.1-8.4 (Bayesian regression).
Describes a form of Bayesian regression on a spline basis; its main interest is the attempt to relate Bayesian and boostrap methods.
• B: 4.2-4.3.3 (Gaussian discriminants, logistic regression); 2.1, 3.3, 4.5 (Bayesian parameter learning and regression).
Thorough coverage, a good supplement/alternative to Murphy. 4.3.3 (IRLS algorithm) is optional.

### Week 10 (10/25, 10/27):

• M: 9.1-2, 9.5.1-3 (Mixtures, EM and kernel density estimation).
Introduced mixtures etc. in the context of latent variable models. 9.4 and 9.5.6 provide additional theoretical insight.
• RN: 20.2.6 (density estimation), 20.3 (EM, subsections 1, 2, 4).
Covers the basic ideas for kernel and k-NN density estimation and introduces EM through a series of examples.
• HTF: 6.6-6.8 (kernel density estimation), 8.5, 13.2, 14.3.6-7 (k-means, Gaussian mixtures, and EM).
The material is good but a bit scattered and repetitive across the different sections.

### Week 11 (11/1, 11/3):

• RN: 14.1-3 (representation using Bayes nets).
Defines syntax, semantics, and discusses representational issues for conditional distributions.
• M: 4.1-2 (Bayes nets).
A short introduction including the genetic linkage example, phylogenetic trees, and HMMs.
• HTF: 17.1-2 (Markov nets).
Does not cover Bayes nets (directed graphical models); but this material is a useful adjunct.
• B: 8.1-2 (representation using Bayes nets).
Mathematically solid, but mostly "anonymous" networks with no knowledge-based structure. 8.3 covers undrected networks

### Week 12 (11/8, 11/10):

• RN: 14.4-5, 20.2-3 (inference and learning in Bayes nets), 15 (temporal models).
Covers the basics of variable elimination and Monte Carlo inference, EM and Bayesian learning, HMMs, DBNs, Kalman filters.
• M: 4.4, 13.1-4, 14.1-4 (exact and approximate Bayes net inference); 4.5 (Bayesian learning), 2.6, 12.3, 20.1-4 (Markov models, Kalman filters, HMMs).
Extensive material on inference and learning, often intertwined. 2.6 includes nice n-gram examples.
• B: 8.4.1-6 (exact inference), 11.1.1-4, 11.2-3 (Monte Carlo), 13 (temporal models).
Very thorough analytical material; Bayes net parameter learning not covered separately.
• HTF: 17.3-4 (learning in Markov nets).
Covers both parameter and structure learning for Markov nets; not much on inference. No material on temporal models