--------------------------------------------------------
Divide-and-Conquer Matrix Factorization on QSUB Clusters
--------------------------------------------------------
This package contains an implementation of DFC, a framework to perform
divide-and-conquer matrix factorizations, as detailed in
http://www.cs.berkeley.edu/~ameet/dfc. The code is meant for use on clusters
in which jobs are submitted via the ***qsub command*** The code was initially
written in MATLAB (version R2008a) and subsequently compiled using the MATLAB
compiler, thus no MATLAB license is required in order to execute the code in
parallel. Note that this code has only been tested in UNIX with 64-bit
machines. Additionally, two variants of the code exist for Torque/PBS and SGE
job schedulers. Although the discussions below (regarding installation,
input/output arguments, examples, etc.) explicitly refer to the Torque/PBS
script (run-dfc.sh), they are equally applicable to the SGE script
(run-dfc-sge.sh).
The script run-dfc.sh takes as input a matrix to be factorized along with a
variety of input parameters. It submits jobs via qsub to a cluster, and
outputs files in a set of subdirectories in a specified output directory. The
most important subdirectory is the 'mat' directory which contains the results
of the base matrix factorization algorithm on each submatrix as well as the
final DFC result. The user can specify the output format using the out_type
parameter. If out_type = 1, then the results are stored in a file called
'projection_*.mat'. This MATLAB file contains several variables including
'L_proj', which stores a factored representation of the recovered low-rank
component of the input matrix. If out_type = 2, then the results are stored in
multiple text files, using the Matrix Market format (see
http://math.nist.gov/MatrixMarket/formats.html#MMformat for details about this
format and for I/O code in C, Python, Fortran and Matlab). Details about the
inputs to run-dfc.sh and the final outputs of this script are described
below.
We hope you find this DFC software useful. Please cite the following work in
any publications that make use of it:
[MaTaJo11] Lester Mackey, Ameet Talwalkar, Michael I. Jordan.
Divide-and-Conquer Matrix Factorization.
Neural Information Processing Systems (NIPS) 2011.
-------------------------
Installation instructions
-------------------------
To run this code, perform the following steps:
1) Download dfc.tar.gz.
2) Extract the underlying dfc/ directory, i.e., run 'tar -xvf dfc.tar.gz'.
3) Check that the compiled matlab binaries run on your platform by running
'./run-matlab-binary-test.sh' in the dfc directory, and then verifying that
dfc/test_matlab_binaries/partition_method-NNLS_...job-1-of-4.mat exists.
-> This code has only has been tested in UNIX with 64-bit machines.
4) Call './run-dfc.sh ...' and verify the outputs in MATLAB by running
check_results(...) using an example dataset (described below).
-> In the noiseless settings errors should be smaller than 1E-3 and in the
noisy settings errors should be in the 1E-2 range.
---------------------------------------
Input arguments to run-dfc.sh script
---------------------------------------
- The full path to an out directory where results will be stored.
- The number of rows in the matrix, m.
- The number of columns in the matrix, n; ***note that we require m <= n***.
- The number of parallel jobs to run: the matrix columns will be partitioned
into this many groupings, and the base-MF algorithm will be run in parallel on
each submatrix; the parallelism is achieved by submitting multiple jobs to the
cluster; the various results are then combined via a final projection job.
- The path to the input data matrix, M: the matrix must be stored in a .MAT
file containing a variable M with dimensions m x n.
- The type of matrix factorization to be performed (1 = matrix completion; 2 =
robust matrix factorization).
- The format of the input matrix (1 = MATLAB .mat file; 2 = text file in matrix market format)
- The format of the output matrix (1 = MATLAB .mat file; 2 = text file in matrix market format)
NOTE: The code works with base MF algorithms that are instances of two methods
(APG and ALM) for solving constrained optimization problems. The algorithms are
listed below, along with links to third-party implementations that we use:
Accelerated Proximal Gradient (APG) Method [DEFAULT]
- 'NNLS' (MC)
-> http://www.math.nus.edu.sg/~mattohkc/NNLS.html
- 'APG_RPCA' (RMF)
-> http://perception.csl.uiuc.edu/matrix-rank/Files/apg_partial.zip
Augmented Lagrange Multiplier (ALM) Method
- 'ALM' (MC)
-> http://perception.csl.uiuc.edu/matrix-rank/Files/inexact_alm_mc.zip
- 'ALM_RPCA' (RMF)
-> http://perception.csl.uiuc.edu/matrix-rank/Files/inexact_alm_rpca.zip
-------------------------------------------------------------------
Outputs stored in '/mat/projection_*.mat' (if out_type = 1)
-------------------------------------------------------------------
- L_proj: a struct representing the recovered low-rank component of the input
matrix using the projection method to combine the outputs from all submatrices.
L_proj has two fields, A and B, representing the left and right factors, so
that L_proj.A.' * L_proj.B equals the low rank reconstruction of interest.
- elapsed_proj_L: stores the time, in seconds, required to recover this
low-rank component; this timing statistic equals the time required by the final
projection step plus the maximum time taken by the parallel partition jobs.
- L_proj_ens, L_part (optional): structs representing the recovered low-rank
component of the input matrix using the ensemble projection or partition
methods to combine the outputs from all submatrices. These structs, which have
the same fields as L_proj, are only returned when the estimated ranks of these
matrices are less than the input parameter ens_rank_cutoff. As a default, these
structs are not computed since ens_rank_cutoff is set to -1, and in practice,
large values of ens_rank_cutoff (e.g., above 5K) will significantly slow down
the final projection step.
- elapsed_proj_L_ens, elapsed_part_L (optional): timing results for the
optional ensemble projection and partition methods, with timing computed in the
same way as for elapsed_proj_L.
--------------------------------------------------
Outputs in Matrix Market format (if out_type = 2)
--------------------------------------------------
The factored components of L_proj (and optionally L_proj_ens, L_part) are each
stored in Matrix Market text files in /mat. These components, i.e.,
L_proj.A and L_proj.B are described in the previous section. See
http://math.nist.gov/MatrixMarket/formats.html#MMformat for details about this
format and for I/O code in C, Python, Fortran and Matlab.
----------------
Example Datasets
----------------
Example datasets are included in /dfc/examples. Example datasets
were created in MATLAB using create_mc_examples.m and create_rpca_examples.m
(create_mc_examples.m also uses index.c which is used as a MEX file). Examples
can be run as follows:
1) cd /dfc
2) run on qsub cluster: ./run-dfc.sh /dfc/rpca_test_noise 500 1000 4 /dfc/examples/test_rpca_matrix_with_noise.mat 2 1 1
-> this partitions the input matrix into 4 submatrices, runs the base MF
algorithm on each submatrix in parallel, and then combines the outputs from
all submatrices via the projection method (and optionally via the ensemble
projection and partition methods).
3) after all 5 jobs complete, check results:
3a) cd /dfc/examples
3b) launch MATLAB
3c) run: check_results('test_rpca_matrix_with_noise.mat', '../rpca_test_noise/mat/projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1.mat')
NOTES:
- To output the optional low-rank matrices generated via the ensemble
projection and partition methods, set ens_rank_cutoff=2000 in
run-dfc.sh
- The above instructions provide commands to run the 'RMF with noise' example.
Similar steps can be performed to run other examples:
- 'RMF without noise':
-> ./run-dfc.sh /dfc/rpca_test 500 1000 4 /dfc/examples/test_rpca_matrix.mat 2 1 1
-> check_results('test_rpca_matrix.mat', '../rpca_test/mat/projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1.mat')
- 'MC with noise':
-> ./run-dfc.sh /dfc/mc_test_noise 2000 4000 4 /dfc/examples/test_mc_matrix_with_noise.mat 1 1 1
-> check_results('test_mc_matrix_with_noise.mat', '../mc_test_noise/mat/projection_method-NNLS_tol-1e-07_maxiter--1_m-2000_n-4000_l-1000_r--1_lambda--1.mat')
- 'MC without noise':
-> ./run-dfc.sh /dfc/mc_test 2000 4000 4 /dfc/examples/test_mc_matrix.mat 1 1 1
-> check_results('test_mc_matrix.mat', '../mc_test/mat/projection_method-NNLS_tol-1e-07_maxiter--1_m-2000_n-4000_l-1000_r--1_lambda--1.mat')
- The following command illustrates how to use Matrix Market text files as inputs and outputs:
-> ./run-dfc.sh /dfc/rpca_test_noise 500 1000 4 /dfc/examples/test_rpca_matrix_with_noise.mat 2 2 2
-> Outputs can be found in the /rpca_test_noise/mat directory, with the files for the projection method called:
- projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1_LprojA.m
- projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1_LprojB.m
-> The I/O scripts found in the link below can be used to read / postprocess these outputs in various programming languages:
- http://math.nist.gov/MatrixMarket/formats.html#MMformat