Sunday, July 5, 2009

Implementing SVM on large dataset

Support vector machines (SVM) are a set of related supervised learning methods used for classification and regression. Our aim is to compare various free implementation of SVM, in terms of accuracy and computation time. Indeed, because the heuristic nature of the algorithm, we can obtain different results according to the used tools on the same dataset. In fact, in the publications describing the performance of SVM, we should not only specify the parameters of the algorithm but also indicate what is the tool used. This latter can influence the results.

SVM is effective in domains with very high number of predictive variables, when the ratio between the number of variables and the number of observations is unfavorable. We are in a domain which is particularly favorable to SVM in this tutorial. We want to discriminate two families of proteins from their description with amino acids. We use sequence of 4 characters (4-grams) as descriptors. Thus, we can have a large number of descriptors (31,809) in comparison to the number of examples (135 instances).

We compare Tanagra 1.4.27, Orange 1.0b2, Rapidminer Community Edition 4.2 and Weka 3.5.6.

Keywords: svm, support vector machine
Components: C-SVC, SVM, SUPERVISED LEARNING, CROSS-VALIDATION
Tutorial: en_Tanagra_Perfs_Comp_SVM.pdf
Dataset: wide_protein_classification.zip
Reference:
Wikipedia (en), « Support vector machine »