Dimensionality Reduction via Sparse Support Vector Machines

Kristin Bennett, Jinbo Bi, Mark Embrechts, Curt Breneman and Minghu Song

Departments of Mathematics, DSES, and Chemistry
Rensselaer Polytechnic Institute

Abstract
We describe a methodology for performing variable ranking and selection using support vector machines (SVMs). The method constructs a series of sparse linear SVMs to generate linear models that can generalize well, and uses a subset of nonzero weighted variables found by the linear models to produce a final nonlinear model. The method exploits the fact that a linear SVM (no kernels) with $\ell_1$-norm regularization inherently performs variable selection as a side-effect of minimizing capacity of the SVM model. The distribution of the linear model weights provides a mechanism for ranking and interpreting the effects of variables. Starplots are used to visualize the magnitude and variance of the weights for each variable. We illustrate the effectiveness of the methodology on synthetic data, benchmark problems and challenging regression problems in drug design. This method can dramatically reduce the number of variables, and outperforms SVMs trained using all attributes and using the attributes selected according to correlation coefficients. The visualization of the resulting models is useful for understanding the role of underlying variables.

This paper has been accepted by JMLR, special issue on variable/feature selection.
A longer version of the paper than the one accepted for JMLR can be found here. It actually comprises two chapters of Jinbo Bi’s PhD thesis. A more thorough description about the QSAR application can be found in this longer version.

A data set used in this paper The raw Caco2 dataset (gzipped) was generated in the on-going project of Automated Drug Discovery. The dataset consists of 27 molecules and 713 descriptors calculated using RECON, PEST and MOE programs. These descriptors encode the molecular shape, topology, subdivided surface-area and electron-density properties, which have been widely applied in Quantitative Structure-Activity Relationship (QSAR) studies. The property LogPC (the last column) is the response representing the Caco2 permeability of the compounds. See the longer version for details.

The preprocessed Caco2 dataset (gzipped) was generated by applying commonly-used chemometric screening techniques (see the JMLR paper or the longer version). Our variable selection and induction algorithms were tested on the preprocessed dataset.

CPLEX programs

All of our algorithms were implemented using the commercial optimization software CPLEX 6.6. Programs can be available under request (contact Dr. Kristin Bennett bennek@rpi.edu). An appropriate version of CPLEX is required to run the programs.

Contact Jinbo Bi (jinbo@engr.uconn.edu) for information about this page.