-------------------------------------------------------- Divide-and-Conquer Matrix Factorization on QSUB Clusters -------------------------------------------------------- This package contains an implementation of DFC, a framework to perform divide-and-conquer matrix factorizations, as detailed in http://www.cs.berkeley.edu/~ameet/dfc. The code is meant for use on clusters in which jobs are submitted via the ***qsub command*** The code was initially written in MATLAB (version R2008a) and subsequently compiled using the MATLAB compiler, thus no MATLAB license is required in order to execute the code in parallel. Note that this code has only been tested in UNIX with 64-bit machines. Additionally, two variants of the code exist for Torque/PBS and SGE job schedulers. Although the discussions below (regarding installation, input/output arguments, examples, etc.) explicitly refer to the Torque/PBS script (run-dfc.sh), they are equally applicable to the SGE script (run-dfc-sge.sh). The script run-dfc.sh takes as input a matrix to be factorized along with a variety of input parameters. It submits jobs via qsub to a cluster, and outputs files in a set of subdirectories in a specified output directory. The most important subdirectory is the 'mat' directory which contains the results of the base matrix factorization algorithm on each submatrix as well as the final DFC result. The user can specify the output format using the out_type parameter. If out_type = 1, then the results are stored in a file called 'projection_*.mat'. This MATLAB file contains several variables including 'L_proj', which stores a factored representation of the recovered low-rank component of the input matrix. If out_type = 2, then the results are stored in multiple text files, using the Matrix Market format (see http://math.nist.gov/MatrixMarket/formats.html#MMformat for details about this format and for I/O code in C, Python, Fortran and Matlab). Details about the inputs to run-dfc.sh and the final outputs of this script are described below. We hope you find this DFC software useful. Please cite the following work in any publications that make use of it: [MaTaJo11] Lester Mackey, Ameet Talwalkar, Michael I. Jordan. Divide-and-Conquer Matrix Factorization. Neural Information Processing Systems (NIPS) 2011. ------------------------- Installation instructions ------------------------- To run this code, perform the following steps: 1) Download dfc.tar.gz. 2) Extract the underlying dfc/ directory, i.e., run 'tar -xvf dfc.tar.gz'. 3) Check that the compiled matlab binaries run on your platform by running './run-matlab-binary-test.sh' in the dfc directory, and then verifying that dfc/test_matlab_binaries/partition_method-NNLS_...job-1-of-4.mat exists. -> This code has only has been tested in UNIX with 64-bit machines. 4) Call './run-dfc.sh ...' and verify the outputs in MATLAB by running check_results(...) using an example dataset (described below). -> In the noiseless settings errors should be smaller than 1E-3 and in the noisy settings errors should be in the 1E-2 range. --------------------------------------- Input arguments to run-dfc.sh script --------------------------------------- - The full path to an out directory where results will be stored. - The number of rows in the matrix, m. - The number of columns in the matrix, n; ***note that we require m <= n***. - The number of parallel jobs to run: the matrix columns will be partitioned into this many groupings, and the base-MF algorithm will be run in parallel on each submatrix; the parallelism is achieved by submitting multiple jobs to the cluster; the various results are then combined via a final projection job. - The path to the input data matrix, M: the matrix must be stored in a .MAT file containing a variable M with dimensions m x n. - The type of matrix factorization to be performed (1 = matrix completion; 2 = robust matrix factorization). - The format of the input matrix (1 = MATLAB .mat file; 2 = text file in matrix market format) - The format of the output matrix (1 = MATLAB .mat file; 2 = text file in matrix market format) NOTE: The code works with base MF algorithms that are instances of two methods (APG and ALM) for solving constrained optimization problems. The algorithms are listed below, along with links to third-party implementations that we use: Accelerated Proximal Gradient (APG) Method [DEFAULT] - 'NNLS' (MC) -> http://www.math.nus.edu.sg/~mattohkc/NNLS.html - 'APG_RPCA' (RMF) -> http://perception.csl.uiuc.edu/matrix-rank/Files/apg_partial.zip Augmented Lagrange Multiplier (ALM) Method - 'ALM' (MC) -> http://perception.csl.uiuc.edu/matrix-rank/Files/inexact_alm_mc.zip - 'ALM_RPCA' (RMF) -> http://perception.csl.uiuc.edu/matrix-rank/Files/inexact_alm_rpca.zip ------------------------------------------------------------------- Outputs stored in '/mat/projection_*.mat' (if out_type = 1) ------------------------------------------------------------------- - L_proj: a struct representing the recovered low-rank component of the input matrix using the projection method to combine the outputs from all submatrices. L_proj has two fields, A and B, representing the left and right factors, so that L_proj.A.' * L_proj.B equals the low rank reconstruction of interest. - elapsed_proj_L: stores the time, in seconds, required to recover this low-rank component; this timing statistic equals the time required by the final projection step plus the maximum time taken by the parallel partition jobs. - L_proj_ens, L_part (optional): structs representing the recovered low-rank component of the input matrix using the ensemble projection or partition methods to combine the outputs from all submatrices. These structs, which have the same fields as L_proj, are only returned when the estimated ranks of these matrices are less than the input parameter ens_rank_cutoff. As a default, these structs are not computed since ens_rank_cutoff is set to -1, and in practice, large values of ens_rank_cutoff (e.g., above 5K) will significantly slow down the final projection step. - elapsed_proj_L_ens, elapsed_part_L (optional): timing results for the optional ensemble projection and partition methods, with timing computed in the same way as for elapsed_proj_L. -------------------------------------------------- Outputs in Matrix Market format (if out_type = 2) -------------------------------------------------- The factored components of L_proj (and optionally L_proj_ens, L_part) are each stored in Matrix Market text files in /mat. These components, i.e., L_proj.A and L_proj.B are described in the previous section. See http://math.nist.gov/MatrixMarket/formats.html#MMformat for details about this format and for I/O code in C, Python, Fortran and Matlab. ---------------- Example Datasets ---------------- Example datasets are included in /dfc/examples. Example datasets were created in MATLAB using create_mc_examples.m and create_rpca_examples.m (create_mc_examples.m also uses index.c which is used as a MEX file). Examples can be run as follows: 1) cd /dfc 2) run on qsub cluster: ./run-dfc.sh /dfc/rpca_test_noise 500 1000 4 /dfc/examples/test_rpca_matrix_with_noise.mat 2 1 1 -> this partitions the input matrix into 4 submatrices, runs the base MF algorithm on each submatrix in parallel, and then combines the outputs from all submatrices via the projection method (and optionally via the ensemble projection and partition methods). 3) after all 5 jobs complete, check results: 3a) cd /dfc/examples 3b) launch MATLAB 3c) run: check_results('test_rpca_matrix_with_noise.mat', '../rpca_test_noise/mat/projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1.mat') NOTES: - To output the optional low-rank matrices generated via the ensemble projection and partition methods, set ens_rank_cutoff=2000 in run-dfc.sh - The above instructions provide commands to run the 'RMF with noise' example. Similar steps can be performed to run other examples: - 'RMF without noise': -> ./run-dfc.sh /dfc/rpca_test 500 1000 4 /dfc/examples/test_rpca_matrix.mat 2 1 1 -> check_results('test_rpca_matrix.mat', '../rpca_test/mat/projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1.mat') - 'MC with noise': -> ./run-dfc.sh /dfc/mc_test_noise 2000 4000 4 /dfc/examples/test_mc_matrix_with_noise.mat 1 1 1 -> check_results('test_mc_matrix_with_noise.mat', '../mc_test_noise/mat/projection_method-NNLS_tol-1e-07_maxiter--1_m-2000_n-4000_l-1000_r--1_lambda--1.mat') - 'MC without noise': -> ./run-dfc.sh /dfc/mc_test 2000 4000 4 /dfc/examples/test_mc_matrix.mat 1 1 1 -> check_results('test_mc_matrix.mat', '../mc_test/mat/projection_method-NNLS_tol-1e-07_maxiter--1_m-2000_n-4000_l-1000_r--1_lambda--1.mat') - The following command illustrates how to use Matrix Market text files as inputs and outputs: -> ./run-dfc.sh /dfc/rpca_test_noise 500 1000 4 /dfc/examples/test_rpca_matrix_with_noise.mat 2 2 2 -> Outputs can be found in the /rpca_test_noise/mat directory, with the files for the projection method called: - projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1_LprojA.m - projection_method-APG_RPCA_tol-1e-07_maxiter--1_m-500_n-1000_l-250_r--1_lambda--1_LprojB.m -> The I/O scripts found in the link below can be used to read / postprocess these outputs in various programming languages: - http://math.nist.gov/MatrixMarket/formats.html#MMformat