I.I.A.S.S. - School 2002

Home

Aims

Lecturers

Committees

More Information

Program

Application

Participants

Site

Ensemble Lab

Ensemble Lab

1 Introduction

One of the main goals of the school is to offer practical and theoretical tools to develop new ensemble methods and stimulate an informal discussion among the participants about problems involving design, implementation and application of ensemble methods to real-world problems.

During the week of the school, the Ensemble-lab will make available to the participants some personal computers with installed data bases and software packages including ensemble methods, in order to permit to test and compare the different ensemble methods on an uniform set of data bases. Moreover, a wireless 11MHz 802.11b connection will permit to the participants arriving with their laptops and the appropriate wireless communication card to stay connected while at the meeting area.

In particular, we encourage the participants to apply their methods to the data sets listed below, present demos of their own ensemble software using the PCs available at IIASS, and give a brief presentation at the end of the conference. The contributions can include existing or novel ensemble methods, but also methods to benchmark and other real data sets.

In this web page we plan to collect all the software that will be made availble on the PCs at the school. All participants to the school (and all other researchers on the field) are solicited to contribute to this page by sending the pointers of thier own data bases/programs or to software of particular interest already available on the web to Francesco Masulli (masulli@disi.unige.it)

Participants contributing the data sets will give a brief explanation about their data set at the beginning of the workshop so that other participants can get a feeling of the type of information and the type of noise that might be in the data.

Anyway, we suggest to the participants to start to work on these data bases and programs before the school.

2 Real-World Data Bases

2.1 Delve Data Sets delve@cs.toronto.edu

DELVE (Data for Evaluating Learning inValid Experiments) developed at University of Toronto (see Delve Development Group) contains:

A software environment, which allows you to manipulate datasets and do statistical analysis of learning method performance.
A number of datasets for regression and classification.
Learning methods

Most data sets have performance results (see Datasets Summary Table)

We suggest to study some large data sets of DELVE. Some of they are difficult, such as those on the dynamics of the robot arm. Moreover, we suggest to apply the software environment to test learning method performances.

2.2 Lymphoma (suggested by Giorgio Valentini)

This is a gene expression data set. Often those data sets are characterized by low cardinality and high dimensionality. It is mantained, among others concerning gene expression, at Stanford Microarray Database.

For details, see the paper Alizadeh et al. Nature 403: 503-511 (2000)

Some preprocessed data are downloadable by anonymous ftp from: ftp://ftp.disi.unige.it/person/ValentiniG/Data/lymphoma/

In the directory, the file README explains how to use the data.

2.3 Leukemia (suggested by Stefano Rovetta)

Others gene expression data sets are mantained at M.I.T
Whithead Institute Center for Genome

The address of the Leukemia data set is
http://www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html

A paper on it is Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring by T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, available at http://www-genome.wi.mit.edu/mpr/pubs.html

Stefano Rovetta has organized the data base in a training set and a test set that can be used for testing your ensemble methods. These data files are in the following format:

first number: number of patterns

second number: dimensionality of patterns

subsequent numbers: a matrix with patterns in rows, pattern components (variables) in columns, and target labels (+1/-1) in the last column

2.5 Letter Recognition database (suggested by Tom Dietterich)

Available at the repository of the University of California at Irvine:

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/letter-recognition/

It gives really amazing improvements with ensembles.

2.6 Deterding Vowel Recognition Data (suggested by Shimon Cohen )

Data set from Carngie Mellon University. It is available at the repository of the University of California at Irvine:

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/connectionist-bench/vowel/

2.7 Coffee analysis data from Pico Electronic Nose (donated by Matteo Pardo )

Data set from the University of Brescia (Italy). It is available at

http://tflab.ing.unibs.it/staff/pardo/dataset.html

2.8 Data Base on Remote Sensing suggested by Palma Blonda (to come)

2.9 Data Base on GIS donated by Cesare Furlanello (to come)

3. Software packages for ensemble methods

3.1 PRtools (suggested by L. Kuncheva )

Author: Robert (Bob) Duin
Home Page: http://www.ph.tn.tudelft.nl/prtools/
Language: Matlab
Main Features:

Datasets and Mappings
Data Generation
Linear and Higher Degree Polynnomial Classifiers
Nonlinear Classification
Normal Density Based Classification
Feature Selection
Classifiers and Tests (general)
Mappings
Combining classification rules
Clustering and Distances

3.2 Weka 3

Contact: wekasupport@cs.waikato.ac.nz
Contributors: Eibe Frank, Mark Hall, Len Trigg, Richard Kirkby, Gabi Schmidberger, Malcolm Ware, Xin Xu, Remco Bouckaert, Yong Wang, Stuart Inglis, Ian H. Witten
Home page: http://www.cs.waikato.ac.nz/~ml/weka/
Language: Java
Main Features:

Implemented schemes for classification include:

decision tree inducers
rule learners
naive Bayes
decision tables
locally weighted regression
support vector machines
instance-based learners
logistic regression
voted perceptrons
multi-layer perceptron

Implemented schemes for numeric prediction include:

linear regression
model tree generators
locally weighted regression
instance-based learners
decision tables
multi-layer perceptron

Implemented "meta-schemes" include:

bagging
stacking
boosting
regression via classification
classification via regression
cost sensitive classification

3.3 Torch

Author: Ronan Collobert
Collaborators: Samy Bengio and Johnny Mariethoz
Home page: http://www.torch.ch/
Language: C++
Main Features:

A lot of things in gradient-machines, that is, machines which could be learned with gradient descent. This includes Multi-Layered Perceptrons, Radial Basis Functions and Mixtures of Experts. In fact there are a lot of small "modules" available (Linear module, Tanh module, SoftMax module...) that you can plug as you want to get what you want.
Support Vector Machine, in classification and regression.
A Distribution package which includes Kmeans, Gaussian Mixture Models, Hidden Markov Models and Bayes Classifier. Moreover classes for speech recognition with embedded training are available is this package.
Ensemble models such as Bagging and Adaboost.

3.4 Random Forests Software

Author: Leo Breiman
Home page: http://www.stat.Berkeley.EDU/users/breiman/
Language: Fortran 77
Main Features:
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.

3.5 NEURObjects

Author: Giorgio Valentini
Collaborator: Francesco Masulli
Home page: http://www.disi.unige.it/person/ValentiniG/NEURObjects/
Language: C++
Main Features:

I/O and data sets pre-processing classes.
Classes for automatic data set generation.
Classes for neural networks training and testing
Classes implementing learning algorithms
Classes implementing ECOC ensembles of learning machines
Classes for statistical evaluation of neural networks performances

3.6 BoosTexter

Authors: Erin Allwein, Robert Shapire , and Yoram Singer.
Home page: http://www.research.att.com/~schapire/BoosTexter/
Main Features:

BoosTexter is a general purpose machine-learning program based on boosting for building a classifier from text and/or attribute-value data.

3.7 Other implementationts of Boosting

See http://www.boosting.org/

3.8 ASNN Associative Neural Network

Author: Igor Tetko
Home page: http://www.vcclab.org/lab/asnn
Language: code on-line; standalone version available on request
Main Features:

ASNN represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbour technique.

3.9 "R" Statistical Computing Programming Language

Contributors: Many, including Douglas Bates, John Chambers,Peter Dalgaard, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto,Paul Murrell, Brian Ripley, Duncan Temple Lang, Luke Tierney, and Alexandros Karatzoglou
For questions about R: R help mailing list
Home page: http://www.r-project.org/
Main Features:

It is a free open source GNU (under the GPL license) implementation of the S programming language and provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, support vector machines, neural networks, ensemble, etc).

3.10 ENTOOL
Authors: C. Merkwirt and J.D. Wichard
For questions about ENTOOL: entool@web.de
Home page: http://chopin.zet.agh.edu.pl/~wichtel/

ENTOOL is a software package for ensemble regression modelling.

It is implemented mainly in Matlab, with some time-critical parts written in C/C++ (as mex-functions).

Objectives

Extending the ensemble learning approach to several types of models
Object-oriented implementation yields a transparent mixture of different models and allows the user the addition of his own model classes

Methods

The toolbox is equipped with several model classes for out-of-box usage:

Radial basis functions (RBF)
Linear regression
Polynomial regression
K-nearest-neighbour models with adaptive metric
Multilayer perceptron (MLP)
Adaptive Regression Splines (ARES)

Home Aims Lecturers Committees More Information Program Application Participants Site Ensemble Lab