CS 281, Spring 1998, Machine Learning
Project 1: Inductive Learning, due 3/16




In this assignment, you will write some generic facilities for inductive learning and an induction algorithm. Then you will apply the algorithm to solve a learning problem.

1. Define a data structure for an induction problem. This will have components for examples, attributes, and goal attribute. Attributes and goals should include their ranges. An example is a list of attributes and corresponding values. It is unclassified if it has no goal value.

2. An induction algorithm takes an induction problem as input and returns a hypothesis, i.e., a function that takes an unclassified example and returns a value for the goal attribute. Write a classify function that takes a set of unclassified examples, a goal, and a hypothesis and returns a set of classified examples.

3. Write a function that generates random, classified examples given a set of attributes, a goal, and a hypothesis. The example distribution should be uniform on the example space.

4. Write a learning curve function that takes an induction algorithm, an induction problem, and an error measure and generates a learning curve (a list of x--y pairs denoting number of training examples vs. prediction error). Use any accuracy estimation method from the literature, providing the necessary parameters as inputs to the learning curve function.

5. Implement an induction algorithm of your choice (either from those discussed in class or some other algorithm). It should be capable of dealing with the data described in 8 below.

6. Construct an artificial learning problem with a small number of attributes and a simple target function (e.g., a smallish decision tree). Generate a set of 100 examples and hence a learning curve.

7. Compare the results of accuracy estimation on data sets of various sizes (up to 100) to the "actual" prediction accuracy of a hypothesis---e.g., by testing the hypothesis on a separate, very large test set.

8. Given the following data set and its description, come up with the best hypothesis you can, using whatever techniques you wish. Notice that the data includes continuous values, so you may want to discretize it, either by hand or using some automated technique. It also contains nominal attributes with many values; you may want to preprocess these values into groups. This is a fairly realistic data set and your task is not dissimilar from those one might face in a real application setting. Your project report should document all the methods you employed, plus a learning curve.

9. Given the following test set (to be distributed later), report the prediction quality for your hypothesis.