CS 281, Spring 1998, Machine Learning
Questions on readings




Please answer these questions concisely and precisely. Try to stick to a maximum of one page per week, to be turned in on Mondays of the corresponding week.

Week 2 (1/26): Models of learning; function learning; version spaces

Mitchell Ex.1.2, Bishop Ex.1.5

Week 3 (2/2): Function learning (decision trees)

1. Suppose that a learning algorithm is trying to find a consistent hypothesis when the classifications of examples are actually being generated randomly with equal probability for positive and negative examples. There are n Boolean attributes, and examples are drawn uniformly from the set of all possible examples. Calculate the number of examples required before the probability of finding a contradiction in the data reaches 0.5.

2. In the recursive construction of decision trees, it sometimes occurs that a mixed set of p positive and n negative examples remains at a leaf node, even after all the attributes have been used.
(a) Show that the solution used by ID3, which picks the majority classification, minimizes the absolute error over the set of examples at the leaf.
(b) Show that returning the class probability p/(p+n) minimizes the sum of squared errors.

3. What capability is missing from Sammut's system for analyzing traces of flight behaviour, and must be provided by a human? How might this failing be addressed?

4. In what cases will subset selection fail to improve the learning rate of a decision tree algorithm?

Week 4 (2/9): Function learning, theoretical analysis

1. Compare multi-interval discretization with simple binary splits.

2. Explain exactly how ADtrees would be used as part of a subset selection process for decision tree learning.

Mitchell Ex.7.7(a),(b)

Week 6 (2/23): Bayesian learning, Naive Bayes classifier

1. Suppose there are five "species" of bags of marbles with the following percentages of blue and red marbles in each: 100/0, 75/25, 50/50, 25/75, and 0/100. Suppose these species occur with relative frequency 0.1, 0.2, 0.4, 0.2, 0.1 respectively. Given a bag of unknown species, a marble is extracted and found to be red. Compute the posterior probability distribution for the species of the bag, and the probability that the next marble is red. Repeat this for the case where the first four marbles are red.

Bishop Ex.2.3, 2.5, Mitchell Ex.6.3

Week 7 (3/2): Mixture models, Bayesian networks

1. When fitting mixture models, what happens to the likelihood when one of the Gaussian is centered on a single data point and its variance goes to zero? What happens when two Gaussians are initialized with identical means and variances? How can these problems be avoided?

2. When learning a standard, fully observable Bayesian network with known structure, what difficulties arise when a node has many parents? How might these be overcome?

3. Binder et al derive the gradient for noisy-OR networks. Derive a similar expression for the gradient of a sigmoid network of binary nodes. In such networks, the conditional distribution of the child is a sigmoid function (see Bishop, p.82) of a weighted sum of the parent values; the weights are the parameters to be tuned.

Week 8 (3/9): Learning Bayesian networks contd.

1. There are many approximate inference algorithms for Bayesian networks. Discuss the issues that arise with respect to their use in learning.

2. Compare the EM update rule for Bayesian network learning to the gradient method.

3. Suppose SEM is given an initial network with several observable nodes and one hidden node that is not connected. Can it recover the true structure? Are there other initial networks with the same problem?

Week 9 (3/16): Probabilistic temporal models, speech

1. What are the basic computational tasks that must be solved in order to train an HMM?

2. In a sparse HMM, each state has non-zero transition probability to only a constant number of other states. In a sparse DBN, each state variable has only a constant number of parents in the preceding time step. Are these notions of sparseness the same? How many parameters does each type of sparse model have? How many parameters are needed in an HMM that is equivalent to a sparse DBN that has NO links at all?

3. In the model of Zweig and Russell, why would it help to have measurements of the actual articulator state? How would they be used? What if the measurements were noisy?

Week 10 (3/23): Spring break

How would you persuade the UC President to extend Spring Break to two weeks?

Week 11 (3/30): Instance-based methods

1. Bishop states that the probability density function resulting from K-nearest-neighbour estimation is not a proper density because its integral diverges. Is this true for all K and d? Explain.

2. What is the boundary shape for 2-nearest-neighbour classification with two classes in two dimensions?

3. In inverse-square-distance-weighted K-nearest-neighbour learning, what happens if we simply let K = N? Is Mitchell correct in saying that the distant points have little or no effect?

4. Discuss the relationship between locally weighted regression and distance-weighted K-nearest-neighbour learning.

Week 12 (4/6): Linear models

1. Explain exactly the sense in which a logistic activation function allows the outputs of a one-layer network to be interpreted as probabilities.

2. Explain why it might be interesting to consider extracting a large number of computed features from the raw input data, and using the computed features for training a one-layer network instead of the raw data.

Bishop Ex.3.3

Week 13 (4/13): Neural networks

1. What advantages do neural nets have over networks of Boolean gates?

2. Suppose you had a neural network with linear activation functions. That is, for each unit the output is some constant c times the weighted sum of the inputs. First, assume that the network has one hidden layer. For a given assignment to the weights w, write down equations for the value of the units in the output layer as a function of w and the input values w, without any explicit mention of the output of the hidden layer. Show that there is a network with no hidden units that computes the same function. Show whether this result extends to networks with any number of hidden layers. What can you conclude about linear activation functions?

Bishop Ex. 4.3

Week 14 (4/20): RBFs, SVMs

1. How does exact interpolation (Bishop, section 5.1) relate to the direct weighted averaging method (Atkeson et al., Eq.5) when the latter uses an inverse-squared weighting function? Why doesn't the latter method require matrix inversion?

2. Based on the analysis of Bishop pp.180-1, is there an obvious translation from an RBF network to a Bayesian network with hidden variables? If not, explain where it goes wrong.

3. Explain the motivation for choosing the maximal margin separating hyperplane in SVMs. Does this have anything to do with classification probabilities?

Week 15 (4/27): Ensemble methods

1. Can you construct a precise mapping of Weighted Majority onto full Bayesian learning?

2. Suppose we train k classifiers, each from equal-size sets of examples drawn independently from the same underlying distribution. Will this ensure that the errors made by each are uncorrelated? Provide a proof or counterexample.

3. Why can boosting on H give a predictor that is better than any single hypothesis in H? Why does it seem to do better than bagging?

Week 16 (5/4): Rule learning, inductive logic programming

1. Suppose that FOIL is considering adding a literal to a clause using a binary predicate P, and that previous literals (including the head of the clause) contain five different variables.
(a) How many functionally different literals can be generated? Notice that two literals are functionally identical if they differ only in the names of the new variables that they contain.
(b) Can you find a general formula for the number of different literals with a predicate of arity r when there are n variables previously used?
(c) Why does FOIL not allow literals that contain no previously used variables?

Mitchell Ex. 10.5, 10.6, 10.7