Dr. Yann Guermeur
Laboratoire Lorrain de Recherche et Informatique et Applications
Universite Henry Poincare
Title: A new SVM for multi-category discriminant analysis
Abstract. Vapnik's statistical learning theory provides the studies dealing with the fundamental problems of pattern recognition with a theoretical framework. The theory of bounds and its natural extension, the Structural Risk Minimization (SRM) inductive principle, make it possible to concomitantly assess and control the generalization performance of discriminant models. However, bounds on the rate of convergence of the empirical risk have been mainly derived for models computing dichotomies. The Multi-category Support Vector Machines (M-SVM) developed so far are not explicitly related to any of them, and thus, strictly speaking, do not implement the SRM principle. Building upon a uniform convergence result specifically derived for real-valued multi-category discriminant models, we propose a theoretical foundation for M-SVMs. This framework, which incorporates the aforementioned models, also includes a new machine with appealing statistical properties, which is assessed in this work, for protein secondary structure prediction. Precisely, this machine is used to combine the outputs of three prediction methods based on different statistical principles: SOPMA, GOR IV and SIMPA96. The implementation raises technical difficulties, which spring from the size of the database required. These difficulties are bypassed by processing adequately the objective function of the convex programming problem to be solved. To estimate the prediction accuracy, a two-level cross-validation procedure inspired from Wolpert's stacked generalization is implemented on a release of the PDBSELECT database made up of 629 chains. The recognition rate achieved is 71.7%, 2.0% higher than the best individual method, SOPMA. An additional increase of 0.5% results from post-processing the outputs with a non stationary DP algorithm. The overall performance of 72.2% represents an improvement of 0.9% over the best result obtained so far on the same set of protein chains, which is statistically significant with confidence exceeding 0.95. In short, three benefits spring from this combination. First, the combined prediction is consistently statistically significantly better than the prediction of any of the individual classifiers. Second, the bound provides us with an estimate of the generalization error. Third, the outputs can be exploited successfully by higher-level treatments. Furthermore, our SVM can be applied to other tasks in biocomputiog, for instance the identification of protein coding regions in genomic DNA.