Home 
spacer
Aims 
spacer
Lecturers 
spacer
Committees 
spacer
More Information 
spacer
Program 
spacer
Application 
spacer
Participants 
spacer
Site 
spacer
Ensemble Lab 

Ensemble Lab

 

1 Introduction

One of the  main  goals of the  school  is to offer practical  and theoretical  tools  to  develop new ensemble methods and stimulate an informal discussion among the participants about problems involving design, implementation and application of ensemble methods to real-world problems.

During the week of the school, the Ensemble-lab will make available  to the participants  some personal computers with installed  data bases  and software packages including  ensemble methods, in order to permit to test and compare the different ensemble methods on an uniform set of data bases. Moreover, a  wireless 11MHz 802.11b connection will permit to the participants  arriving  with their  laptops  and  the  appropriate  wireless communication card to  stay connected while at the meeting area.

In particular, we encourage the participants to apply their methods to the data sets listed below,  present demos of their own ensemble software using the PCs available at IIASS, and  give a brief  presentation at the end of the conference. The contributions can include  existing or  novel ensemble  methods, but also  methods to benchmark and other  real data sets.

In this web page we plan to collect all the software that will be made availble on  the PCs at the school. All participants to the school (and all other researchers on the field) are solicited to contribute to this page by sending the pointers of thier own data bases/programs or to  software of particular interest already available on the web to Francesco Masulli (masulli@disi.unige.it)

Participants contributing the data sets will give a brief explanation about their data set at the beginning of the workshop so that other participants can get a feeling of the type of information and the type of noise that might be in the data.

Anyway, we suggest to the participants to start to work on these data bases and programs before the school. 

2 Real-World Data Bases

2.1 Delve Data Sets  delve@cs.toronto.edu 

DELVE (Data for Evaluating Learning inValid Experiments) developed at  University of Toronto (see Delve Development Group) contains:
    • A software environment, which allows you to manipulate datasets and do statistical analysis of learning method performance. 
    • A number of datasets for regression and classification.
    • Learning methods
Most data sets  have performance results (see Datasets Summary Table

We suggest to study some large data sets of DELVE.  Some of they are difficult, such as  those on the  dynamics of  the  robot arm. Moreover, we suggest to apply the software environment  to test learning method performances.
 

2.2 Lymphoma (suggested by Giorgio Valentini)

This is a gene expression data set. Often those data sets are characterized by low cardinality and  high dimensionality. It is mantained, among others concerning  gene expression, at Stanford Microarray Database.

For details, see the paper Alizadeh et al. Nature 403: 503-511 (2000)

Some preprocessed data are downloadable by anonymous ftp from: ftp://ftp.disi.unige.it/person/ValentiniG/Data/lymphoma/

In the directory, the file README explains  how to use  the data.

2.3 Leukemia (suggested by Stefano Rovetta)

Others gene expression data sets are mantained at M.I.T 
Whithead Institute Center for Genome 

The address of  the Leukemia data set  is
http://www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html

A paper on it is  Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring  by T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh,  J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander,  available at  http://www-genome.wi.mit.edu/mpr/pubs.html

Stefano Rovetta has organized the data base  in a  training set  and a  test set  that can be used for testing your ensemble methods. These data files are in the following format:

  • first number: number of patterns
  • second number: dimensionality of patterns
  • subsequent numbers: a matrix with patterns in rows, pattern components (variables) in columns, and target labels (+1/-1) in the last column

2.5 Letter Recognition database (suggested by Tom Dietterich)

Available at the repository of the University of California at Irvine:

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/letter-recognition/

It gives really amazing improvements with ensembles.
 

2.6 Deterding Vowel Recognition Data (suggested by Shimon Cohen )


 Data set  from Carngie Mellon University. It is available at the repository of the University of California at Irvine:

 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/connectionist-bench/vowel/

2.7 Coffee analysis data from  Pico Electronic Nose (donated by  Matteo Pardo )

Data set from the University of Brescia (Italy).  It is available at

 http://tflab.ing.unibs.it/staff/pardo/dataset.html 

2.8 Data Base on Remote Sensing suggested  by Palma Blonda  (to come)

2.9 Data Base on GIS donated by  Cesare Furlanello  (to come)

 

3. Software packages for ensemble methods

3.1 PRtools (suggested by  L. Kuncheva )

Author:  Robert (Bob) Duin
Home Page:  http://www.ph.tn.tudelft.nl/prtools/
Language: Matlab
Main Features:
  • Datasets and Mappings
  • Data Generation
  • Linear and Higher Degree Polynnomial Classifiers
  • Nonlinear Classification
  • Normal Density Based Classification
  • Feature Selection
  • Classifiers and Tests (general)
  • Mappings
  • Combining classification rules
  • Clustering and Distances

3.2 Weka 3 

Contact: wekasupport@cs.waikato.ac.nz
Contributors: Eibe Frank, Mark Hall, Len Trigg, Richard Kirkby, Gabi Schmidberger, Malcolm Ware, Xin Xu, Remco Bouckaert, Yong Wang, Stuart Inglis, Ian H. Witten
Home page:  http://www.cs.waikato.ac.nz/~ml/weka/
Language: Java
Main Features:
  • Implemented schemes for classification include:
    •      decision tree inducers 
    •      rule learners 
    •      naive Bayes
    •      decision tables
    •      locally weighted regression
    •      support vector machines
    •      instance-based learners
    •      logistic regression
    •      voted perceptrons
    •      multi-layer perceptron
  • Implemented schemes for numeric prediction include:
    •      linear regression
    •      model tree generators 
    •      locally weighted regression
    •      instance-based learners
    •      decision tables
    •      multi-layer perceptron
  •  Implemented "meta-schemes" include:
    •      bagging
    •      stacking
    •      boosting 
    •      regression via classification
    •      classification via regression
    •       cost sensitive classification

3.3 Torch

Author: Ronan Collobert
Collaborators: Samy Bengio and Johnny Mariethoz
Home page:  http://www.torch.ch/
Language: C++
Main Features: 
  • A lot of things in gradient-machines, that is, machines which could be learned with gradient descent. This includes Multi-Layered Perceptrons, Radial Basis Functions and Mixtures of Experts. In fact there are a lot of small "modules" available (Linear module, Tanh module, SoftMax module...) that you can plug as you want to get what you want.
  • Support Vector Machine, in classification and regression.
  • A Distribution package which includes Kmeans, Gaussian Mixture Models, Hidden Markov Models and Bayes Classifier. Moreover classes for speech recognition with embedded training are available is this package.
  • Ensemble models such as Bagging and Adaboost.

  • A few non-parametric models such as K-nearest-neighbors, Parzen Regression and Parzen Density Estimator.

3.4 Random Forests Software

Author:   Leo Breiman
Home page: http://www.stat.Berkeley.EDU/users/breiman/
Language: Fortran 77 
Main Features: 
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.
 

3.5 NEURObjects 

Author:   Giorgio Valentini
Collaborator:  Francesco Masulli
Home page:  http://www.disi.unige.it/person/ValentiniG/NEURObjects/
Language: C++
Main Features:
  • I/O and data sets pre-processing classes.
  • Classes for automatic data set generation.
  • Classes for neural networks training and testing
  • Classes implementing learning algorithms
  • Classes implementing ECOC ensembles of learning machines
  • Classes for statistical evaluation of neural networks performances

3.6 BoosTexter 

Authors: Erin Allwein,  Robert Shapire , and  Yoram Singer.
Home page:  http://www.research.att.com/~schapire/BoosTexter/
Main Features: 
BoosTexter is a general purpose machine-learning program based on boosting for building a classifier from text and/or attribute-value data. 

3.7 Other implementationts of Boosting

See  http://www.boosting.org/

3.8 ASNN Associative Neural Network

Author:  Igor Tetko
Home page: http://www.vcclab.org/lab/asnn
Language:  code on-line; standalone version available on request
Main Features: 
ASNN represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbour technique.

3.9 "R" Statistical Computing Programming Language 

Contributors: Many, including Douglas Bates, John Chambers,Peter Dalgaard, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto,Paul Murrell, Brian Ripley, Duncan Temple Lang, Luke Tierney, and  Alexandros Karatzoglou
For questions about R:  R help mailing list
Home page: http://www.r-project.org/
Main Features: 
It is a free open source GNU (under the GPL license) implementation of the S programming language and provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, support vector machines, neural networks, ensemble, etc).
3.10   ENTOOL
Authors:  C. Merkwirt and  J.D. Wichard
For questions about ENTOOL:  entool@web.de
Home page: http://chopin.zet.agh.edu.pl/~wichtel/

ENTOOL is a software package for ensemble regression modelling.

It is implemented mainly in Matlab, with some time-critical parts written in C/C++ (as mex-functions).

Objectives

  • Extending the ensemble learning approach to several types of models
  • Object-oriented implementation yields a transparent mixture of different models and allows the user the addition of his own model classes

Methods

The toolbox is equipped with several model classes for out-of-box usage:

  • Radial basis functions (RBF)
  • Linear regression
  • Polynomial regression
  • K-nearest-neighbour models with adaptive metric
  • Multilayer perceptron (MLP)
  • Adaptive Regression Splines (ARES)