CS 19410, Fall 2011: Introduction to Machine Learning
Reading list
This list is still under construction. An empty bullet item indicates more readings to come for that week.
Readings marked in blue are ones you
should cover; readings marked
in green are alternatives that are
often helpful but probably not essential.
Books
 HTF: Trevor Hastie, Rob Tibshirani, and Jerry Friedman, Elements of Statistical Learning, Second Edition, Springer, 2009. (Full pdf available for download.)
 M: Kevin P. Murphy, Machine Learning: A Probabilistic Perspective. Unpublished. Access information will be provided.
 RN: Stuart Russell and Peter Norvig, Artificial Intelligence: A
Modern Approach, Third Edition, Prentice Hall, 2010.
The machine learning chapters were substantially revised in the third edition; previous editions are not usable for this course.
 B: Christopher Bishop,
Pattern Recognition and Machine Learning, Springer, 2006.
 WFH: Ian Witten, Eibe Frank, and Mark Hall, Data Mining: Practical Machine Learning Tools and Techniques,
Third Edition, Morgan Kaufmann, 2011.
Background reading (review of prerequisite material):
The following stuff is what you "should" already know from Math 53 and 54, CS 70, and CS 188.
Just in case some of it has rusted away, the TAs will cover the essentials in the Week 1 discussion section (8/24).
Assignment 0 will also help to scrape off the rust, as well as introducing some basic Python tools for the course.
 RN: 4.14.2 (local search and optimization; 4.1.4 is not needed); 7.4 (propositional logic).
 M Probability: 2.3.13 (discrete); 2.4.14 (continuous); 2.5.12, 2.5.45 (means, variances, etc.); 2.6.16 (joint); 2.7.12 (covariance); 2.10.34 (change of variables).
Basic math: 31.4.12 (derivatives); 31.5.12 (convexity); 31.8.1 (Lagrange multipliers).
Linear algebra: 32.132.4.4; 32.5.34; 32.6.13.
Includes a lot of detail; useful as a reference as well as refresher.
 B: 1.2 and Appendix B (probability); Appendix C (linear algebra); Appendix E (Lagrange multipliers).
A reasonable alternative to the material in Murphy. We won't need all of the distributions in App.B.
 RN: 13.213.5 (probability); A.2A.3 (matrices, probability).
13 is quite leisurely, but provides intuition; A.2 and A.3 cover less than you need. Murphy's supplementary chapters are more thorough.
Week 1 (8/25 only): Introduction to machine learning
 M: 1.11.2.4
After a very brief introduction, introduces decision trees and knearestneighbor.
 RN: 18.118.2
A broader introduction to learning from the point of view of its role in intelligent systems.
 HTF: 1, 2.12.3.
A brief introduction to simple forms of attributebased statistical learning. 2.3 introduces linear regression and knearestneighbor methods, but in a bit more detail than we need for this week!
 B: 1.11.3
A solid if somewhat technical introduction to statistical learning. 1.2 provides some useful background on probability.
 WFH: Ch.1, Ch.2.
Nontechnical introduction stressing applications. Sec. 1.2 provides helpful examples; Ch.2 covers basic issues with handling input data.
Week 2 (8/30, 9/1): Linear regression, least squares, methodology
 M: 1.3
Covers the basics quite quickly, including the maximumlikelihood interpretation for least squares; also introduces robust regression, which uses a nonGaussian error model to avoid giving too much weight to outliers. Later chapters (3.7, 5.5, 13, 15) go into much more detail; revised versions are expected from the author by 8/31.
 RN: 18.4 (general concepts for supervised learning), 18.6.12 (linear regression), 20.2.3 (MLE interpretation)
18.4 introduces several general topics: errors, loss functions, overfitting, crossvalidation, and regularization; 18.6 applies them to regression.
 HTF: 3 (skip 3.2.2, 3.3, 3.4.6)
A statistically oriented discussion reflecting 50 years of experience with linear regression. Several excellent figures. The material on principalcomponents regression and partial least squares is interesting but also skippable for now  we will come back to these ideas later in the course.
 B: 1.2.56 (MLE, MAP, Bayesian regression), 1.5.5 (loss functions for regression), 3.1, 3.3, 3.5.
The material in Ch.1 is brief and conceptual, while Ch.3 is a nice exposition with just enough technical detail.
 WFH: 5.15.7.
A fairly thorough and practical treatment of measurement and methodology.
Week 3 (9/6, 9/8): Linear classifiers, logistic regression
For the first lecture (machine learning methodology), the material is covered in the readings for Week 2. For the second lecture (linear classifiers):
 HTF: 4.1, 4.3 (except 4.3.3), 4.4, 4.5.1.
Pretty thorough coverage, with many connections to other machine learning topics.
 RN: 18.6.34 (linear classifiers, logistic regression).
Straightforward treatments of perceptron learning and of logistic regression via MLE; does not cover linear discriminant analysis.
 M: 1.2.9 (logistic regression), 1.2.13 (applications), in the new version of Chapter 1; 6.1, 6.2, 6.4 (linear discriminant analysis); 7.1, 7.4 (logistic regression); 11.5.13 (online algorithms including perceptron algorithm).
Ch.1 provides a brief introduction for logistic regression only; full treatments come later in the book. Access to later chapters will probably be by hardcopy only (TBD).
 B: 4.1  4.3.4.
Thorough, somewhat mathematically oriented discussion of linear discriminants, logistic regression, and linear separators (perceptrons).
 WFH: 4.6 (pages 1259).
Very quick summary of linear classifier and logistic regression.
Week 4 (9/13, 9/15): Maxmargin learning, SVMs, kernels
 HTF: 4.5.2 (introduces maxmargin separator); 12.1  12.3.2 (SVMs and kernels).
You can skim the remainder of 12.3 for further algorithmic details and mathematical analysis.
 RN: 18.9.
Gives the basic ideas without too much math.
 M: 18.
Includes 18.3 (relevance vector machines) which are worth looking at (see also Hastie's discussion).
 B: 6.16.2 (kernels), 7.1 (SVMs).
Begins with a discussion of kernels and kernel construction before introducing SVMs.
 WFH: 6.4 (pages 223232).
A practical, nonmathematical guide to SVMs.
Week 5 (9/20, 9/22): Decision trees, ensemble learning
 HTF: 9.2 (decision trees; skim 9.1 too); 8.7 (bagging); 10.110.9 (boosting).
Simple and clear discussion of the CART (classification and regression tree) methodology. Unfortunately the material on bagging and boosting is scattered and bound up with other technical matters. 10.6 (robust loss functions) and 10.7 (general properties of good machine learning algorithms) are useful, but touch on more general issues than just trees and boosting.
 RN: 18.3 (decision trees), 18.10 (bagging and boosting).
The material on decision trees focuses more on the entropybased heuristic for growing trees. Figure 18.33 is helpful in explaining boosting (even though I say so myself).
 M: 19.5 (boosting), 19.7.1 (bagging, random forests).
Although decision trees are the first classification method mentioned in Ch.1, there is as yet not much coverage except for a Bayesian analysis
later on (which we will probably not get to). Future versions will include classification and regression trees.
 B: 14.4 (decision trees), 14.23 (bagging, boosting).
A concise summary of the material in Hastie.
 WFH: 6.1 (decision trees), 6.6 (model trees), 8.12, 8.4 (ensemble learning, bagging, boosting).
Solid, detailed description of decision tree learning and ensemble learning. Includes extensive description of "model trees", which are decision trees with linear separators defining the splits at each node.
Week 6 (9/27, 9/29): Instancebased learning, neural networks
 HTF: 13.35 (nearestneigbor methods); 11 (neural networks).
Good coverage at about the right level. Material on nearest neighbors includes details on methods that adapt the distance metric.
 RN: 18.8 (instancebased/nonparametric methods), 18.7 (neural networks).
18.8 includes locally weighted regression and some explanations of efficient methods for finding nearest neighbors.
 M: 1.2.48 (nearest neighbors, very briefly), 19.6 (neural networks).
Although decision trees are the first classification method mentioned in Ch.1, there is very little coverage except for a Bayesian analysis
later on (which we will probably not get to).
 B: 2.5.2 (nearest neighbors), 5.13 (neural networks).
kNN material is very brief, buried in a density estimation section. Neural network material emphasizes optimization methods; 5.4 looks at the second derivatives in depth.
 WFH: 4.7, 6.5 (nearest neighbors), 6.4 pp.232243 (neural networks).
Includes a useful description of methods for finding nearest neighbors as well as adapting the distance metric. The neural network material is gentle compared to Bishop.
Week 7 (10/4, 10/6): Instancebased learning contd. (10/4 only)
Week 8 (10/11, 10/13):
 RN: 18.5 and 18.10.1 (learning theory); 20.120.2.2 (probabilistic methods).
18.5 gives a very simple proof of PAC learning for discrete hypothesis spaces, while 18.10.1 describes an online learning algorithm that makes a bounded number of mistakes. Chapter 20 takes a view of learning as probabilistic inference, including maximum likelihood, maximum a posteriori, and Bayesian methods; 20.2.2 introduces the naive Bayes classifier.
 HTF: 7.9 (learning theory); 6.6.3 (naive Bayes).
7.9 covers VapnikChervonenkis bounds (which can be applied to continuous hypothesis spaces), but does not provide much explanation or insight. Being a frequentist book, HTF also has limited coverage of probabilistic/Bayesian ideas.
 M: 10.3.9 (learning theory); 3.13, 6.13 (probabilistic methods, naive Bayes).
10.3.9 provides only a brief mention and proof of a PAC bound. 3.12 gives an interesting viewpoint on probabilistic methods, emphasizing Bayesian ideas and human learning; 3.3 describes a nice example of Bayesian learning in a discrete hypothesis space, like the one in Russell and Norvig except that learning is done with positive examples only.
 B: 7.1.5 (learning theory); 1.2.3, 4.2.13 (probabilistic methods).
Learning theory coverage is minimal. Bayesian analysis (much of it quite heavy) pervades the book, so it is hard to pick out specific sections. 4.2 includes Gaussian discriminants.
Week 9 (10/18, 10/20):

 M: 6.4 (Gaussian discriminants), 7.4 (logistic regression); 3.56, 7.3, 7.5 (Bayesian parameter learning and regression).
Provides thorough coverage of all the required topics for this week.
 RN: 18.6.4 (logistic regression); 20.2.4 (Bayesian parameter learning).
18.6.4 is a simple maximumlikelihood treatment of logistic regression (mainly review). 20.2.4 covers the beta/Bernoulli model for Bayesian parameter learning. Does not cover Gaussian discriminants or Bayesian regression.
 HTF: 8.18.4 (Bayesian regression).
Describes a form of Bayesian regression on a spline basis; its main interest is the attempt to relate Bayesian and boostrap methods.
 B: 4.24.3.3 (Gaussian discriminants, logistic regression); 2.1, 3.3, 4.5 (Bayesian parameter learning and regression).
Thorough coverage, a good supplement/alternative to Murphy. 4.3.3 (IRLS algorithm) is optional.
Week 10 (10/25, 10/27):
 M: 9.12, 9.5.13 (Mixtures, EM and kernel density estimation).
Introduced mixtures etc. in the context of latent variable models. 9.4 and 9.5.6 provide additional theoretical insight.
 RN: 20.2.6 (density estimation), 20.3 (EM, subsections 1, 2, 4).
Covers the basic ideas for kernel and kNN density estimation and introduces EM through a series of examples.
 HTF: 6.66.8 (kernel density estimation), 8.5, 13.2, 14.3.67 (kmeans, Gaussian mixtures, and EM).
The material is good but a bit scattered and repetitive across the different sections.
Week 11 (11/1, 11/3):
 RN: 14.13 (representation using Bayes nets).
Defines syntax, semantics, and discusses representational issues for conditional distributions.
 M: 4.12 (Bayes nets).
A short introduction including the genetic linkage example, phylogenetic trees, and HMMs.
 HTF: 17.12 (Markov nets).
Does not cover Bayes nets (directed graphical models); but this material is a useful adjunct.
 B: 8.12 (representation using Bayes nets).
Mathematically solid, but mostly "anonymous" networks with no knowledgebased structure. 8.3 covers undrected networks
Week 12 (11/8, 11/10):
 RN: 14.45, 20.23 (inference and learning in Bayes nets), 15 (temporal models).
Covers the basics of variable elimination and Monte Carlo inference, EM and Bayesian learning, HMMs, DBNs, Kalman filters.
 M: 4.4, 13.14, 14.14 (exact and approximate Bayes net inference); 4.5 (Bayesian learning), 2.6, 12.3, 20.14 (Markov models, Kalman filters, HMMs).
Extensive material on inference and learning, often intertwined. 2.6 includes nice ngram examples.
 B: 8.4.16 (exact inference), 11.1.14, 11.23 (Monte Carlo), 13 (temporal models).
Very thorough analytical material; Bayes net parameter learning not covered separately.
 HTF: 17.34 (learning in Markov nets).
Covers both parameter and structure learning for Markov nets; not much on inference. No material on temporal models
Week 13 (11/15, 11/17):
Week 14 (11/22 only):
Week 15 (11/29, 12/1):