2. In the recursive construction of decision trees, it sometimes occurs that a
mixed set of p positive and n negative examples remains
at a leaf node, even after
all the attributes have been used.
(a)
Show that the solution used by ID3, which picks the
majority classification, minimizes the absolute error over the set of examples
at the leaf.
(b) Show that returning the class probability p/(p+n)
minimizes the sum of squared errors.
3. What capability is missing from Sammut's system for analyzing traces of flight behaviour, and must be provided by a human? How might this failing be addressed?
4. In what cases will subset selection fail to improve the learning rate of a decision tree algorithm?
2. Explain exactly how ADtrees would be used as part of a subset selection process for decision tree learning.
Mitchell Ex.7.7(a),(b)
Bishop Ex.2.3, 2.5, Mitchell Ex.6.3
2. When learning a standard, fully observable Bayesian network with known structure, what difficulties arise when a node has many parents? How might these be overcome?
3. Binder et al derive the gradient for noisy-OR networks. Derive a similar expression for the gradient of a sigmoid network of binary nodes. In such networks, the conditional distribution of the child is a sigmoid function (see Bishop, p.82) of a weighted sum of the parent values; the weights are the parameters to be tuned.
2. Compare the EM update rule for Bayesian network learning to the gradient method.
3. Suppose SEM is given an initial network with several observable nodes and one hidden node that is not connected. Can it recover the true structure? Are there other initial networks with the same problem?
2. In a sparse HMM, each state has non-zero transition probability to only a constant number of other states. In a sparse DBN, each state variable has only a constant number of parents in the preceding time step. Are these notions of sparseness the same? How many parameters does each type of sparse model have? How many parameters are needed in an HMM that is equivalent to a sparse DBN that has NO links at all?
3. In the model of Zweig and Russell, why would it help to have measurements of the actual articulator state? How would they be used? What if the measurements were noisy?
2. What is the boundary shape for 2-nearest-neighbour classification with two classes in two dimensions?
3. In inverse-square-distance-weighted K-nearest-neighbour learning, what happens if we simply let K = N? Is Mitchell correct in saying that the distant points have little or no effect?
4. Discuss the relationship between locally weighted regression and distance-weighted K-nearest-neighbour learning.
2. Explain why it might be interesting to consider extracting a large number of computed features from the raw input data, and using the computed features for training a one-layer network instead of the raw data.
Bishop Ex.3.3
2. Suppose you had a neural network with linear activation functions. That is, for each unit the output is some constant c times the weighted sum of the inputs. First, assume that the network has one hidden layer. For a given assignment to the weights w, write down equations for the value of the units in the output layer as a function of w and the input values w, without any explicit mention of the output of the hidden layer. Show that there is a network with no hidden units that computes the same function. Show whether this result extends to networks with any number of hidden layers. What can you conclude about linear activation functions?
Bishop Ex. 4.3
2. Based on the analysis of Bishop pp.180-1, is there an obvious translation from an RBF network to a Bayesian network with hidden variables? If not, explain where it goes wrong.
3. Explain the motivation for choosing the maximal margin separating hyperplane in SVMs. Does this have anything to do with classification probabilities?
2. Suppose we train k classifiers, each from equal-size sets of examples drawn independently from the same underlying distribution. Will this ensure that the errors made by each are uncorrelated? Provide a proof or counterexample.
3. Why can boosting on H give a predictor that is better than any single hypothesis in H? Why does it seem to do better than bagging?
Mitchell Ex. 10.5, 10.6, 10.7