Hierarchical Beta Processes and the Indian Buffet Process
Matlab code

Romain Thibaux and Michael I. Jordan

   
         
What it does
 

This matlab code was used to run the experiments of our paper (here) on hierarchical beta processes (hbp). Only the simplest 2-level hierarchies are implemented (i.e. §6.1 and §6.2 but not §6.3). The code allows you to run Gibbs sampling (which reduces to rejection sampling for 2-level hierarchies) to compute the posterior probability of each feature (e.g. word), as well as the remaining frequency of unknown features, in each category (e.g. topic), after observing data.

The data, preformatted, is available as a Matlab .mat file, or as a perl script to execute on the original data. The code can then read from either source.

If you find a bug, or have trouble installing, let me know.

   
         
Main functions
 

hbp: takes a training set of N documents (in sparse binary matrix format) split into n categories, a test set (m documents with unknown categories), and the parameters of an hbp prior. It returns the posterior probability that each test document belongs to each category, as an n*m matrix.

test_hbp: a scripts which loads an example dataset, chooses reasonable parameters for the hbp and runs the above function. It also compares the performance with that of naive Bayes.

   
         
Other functions
  choose_hbp_parameters: picks reasonable parameters c0 and gamma for the hbp prior.
hbp_cross_validation
: picks a good c parameter by cross-validation.
beta_hierarchy_inference
: inference in a singleton slice of the hbp model (eq. 9).
fit_gamma: finds the parameters of a gamma variable that best approximates the posterior of b in a slice of the hbp (eq. 10 and following). It uses a third degree fixed point equation for fast convergence (not mentioned in the paper, type "help fit_gamma" for more information) but binary search is also implemented for comparison.
posterior_beta_mode: finds the maximum a posteriori for b in the slice model, i.e. it finds the maximum of eq. 10. This function works in tandem with fit_gamma().
rejection_sampling: draws a random sample from the posterior of b (eq. 10) by rejection sampling using a gamma approximation as proposal distribution.
naive_bayes: naive bayes classification with Laplace smoothing.
and various helper functions
   
         
Code
  Dowload the code, compressed with winzip [zip] or tar [tar]. Uncompress it and set the path of matlab (File > Set Path...) so it can see it. The code is licensed under BSD. See the readme file [readme] for a list of conditions attached to this license.    
         
Data
 

The preprocessed data in a zipped .mat format is here [5 Mb, 30Mb uncompressed]. This dictionary gives the words associated to each feature (features are ordered by frequency to help with interpretability). As can be seen by browsing the dictionary, very minimal preprocessing was applied. Stop words and rare words alike were left in.

Alternatively, to reproduce this data from the original 20 newsgroup data, first download the data 20news-bydate.tar.gz from there. Once uncompressed it contains two directories 20news-bydate-test/ and 20news-bydate-train/ inside a toplevel directory 20news-bydate/. Then place this perl script and this makefile in 20news-bydate/. Run 'make' (under Windows this requires cygwin). The script takes a long time to run, and generates many auxillary files. In the end the directories contain about 100 Mb, which can amount to about 300 Mb of disk space because of the large number of files. Among other things, it generates the dictionary mentioned above.

The script translates each document into a feature list: the list of word ids that this document contains, where words have a unique id for the entire data. It should be easy to modify to include many other types of binary features (pairs of words, etc).