1 Introduction
One of the main goals of
the school
is to offer practical and theoretical tools
to
develop new ensemble methods and stimulate an informal discussion
among
the participants about problems involving design, implementation and
application of ensemble methods to real-world problems.
During the week of the school, the
Ensemble-lab
will make available to the participants some personal
computers
with installed data bases and software packages
including
ensemble methods, in order to permit to test and compare the different
ensemble methods on an uniform set of data bases. Moreover, a wireless
11MHz 802.11b connection will permit to the participants
arriving
with their laptops and the appropriate
wireless
communication card to stay connected while at the meeting area.
In particular, we encourage the
participants to
apply their methods to the data sets listed below, present demos
of their own ensemble software using the PCs available at IIASS,
and
give a brief presentation at the end of the conference. The
contributions
can include existing or novel ensemble methods, but
also
methods to benchmark and other real data sets.
In this web page we plan to
collect all the software
that will be made availble on the PCs at the school. All
participants
to the school (and all other researchers on the field) are solicited to
contribute to this page by sending the pointers of thier own data
bases/programs
or to software of particular interest already available on the
web
to Francesco Masulli (masulli@disi.unige.it)
Participants contributing the data sets will give a
brief explanation
about their data set at the beginning of the workshop so that other
participants
can get a feeling of the type of information and the type of noise that
might be in the data.
Anyway, we suggest to the participants to start to work
on these data
bases and programs before the school.
2 Real-World Data Bases
DELVE (Data
for Evaluating Learning inValid Experiments) developed at
University
of Toronto (see Delve
Development Group) contains:
-
A software environment, which
allows you to manipulate
datasets and do statistical analysis of learning method
performance.
-
A number of datasets for
regression and classification.
-
Learning methods
Most data sets have performance
results (see
Datasets
Summary Table)
We suggest to study some large
data sets of DELVE.
Some of they are difficult, such as those on the dynamics
of
the robot arm. Moreover, we suggest to apply the software
environment
to test learning method performances.
This is a gene expression data
set. Often
those data sets are characterized by low cardinality and high
dimensionality.
It is mantained, among others concerning gene expression, at Stanford
Microarray Database.
For details, see the paper Alizadeh
et al. Nature 403: 503-511 (2000)
Some preprocessed data are
downloadable by anonymous
ftp from: ftp://ftp.disi.unige.it/person/ValentiniG/Data/lymphoma/
In the directory, the file README
explains
how to use the data.
Others gene expression data sets are
mantained
at M.I.T
Whithead
Institute Center for Genome
The address of the Leukemia
data set
is
http://www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html
A paper on it is Molecular Classification of
Cancer: Class
Discovery and Class Prediction by Gene Expression Monitoring
by T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J.P.
Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D.
Bloomfield,
and E.S. Lander, available at http://www-genome.wi.mit.edu/mpr/pubs.html
Stefano Rovetta has organized the
data base
in a training set
and
a test set that
can be used for testing your ensemble methods. These data files are in
the following format:
-
first number: number of patterns
-
second number: dimensionality
of patterns
-
subsequent numbers: a matrix
with patterns in rows,
pattern components (variables) in columns, and target labels (+1/-1) in
the last column
2.5 Letter Recognition database
(suggested by Tom
Dietterich)
Available at the repository of the
University of
California at Irvine:
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/letter-recognition/
It gives really amazing
improvements with ensembles.
2.6 Deterding Vowel Recognition
Data (suggested by
Shimon Cohen
)
Data set from Carngie
Mellon University.
It is available at the repository of the University of California at
Irvine:
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/connectionist-bench/vowel/
2.7 Coffee analysis data
from Pico Electronic
Nose (donated by Matteo
Pardo )
Data set from the University of Brescia (Italy). It is available
at
http://tflab.ing.unibs.it/staff/pardo/dataset.html
2.8 Data Base on Remote Sensing
suggested by
Palma
Blonda (to come)
2.9 Data Base on GIS donated by
Cesare
Furlanello (to come)
3. Software packages for ensemble
methods
3.1 PRtools (suggested by L.
Kuncheva )
Author: Robert (Bob) Duin
Home Page: http://www.ph.tn.tudelft.nl/prtools/
Language: Matlab
Main Features:
-
Datasets and Mappings
-
Data Generation
-
Linear and Higher Degree Polynnomial Classifiers
-
Nonlinear Classification
-
Normal Density Based Classification
-
Feature Selection
-
Classifiers and Tests (general)
-
Mappings
-
Combining classification rules
-
Clustering and Distances
3.2 Weka 3
Contact: wekasupport@cs.waikato.ac.nz
Contributors: Eibe Frank,
Mark
Hall,
Len Trigg,
Richard
Kirkby,
Gabi
Schmidberger,
Malcolm
Ware,
Xin Xu,
Remco
Bouckaert,
Yong
Wang,
Stuart Inglis,
Ian
H. Witten
Home page: http://www.cs.waikato.ac.nz/~ml/weka/
Language: Java
Main Features:
-
Implemented schemes for classification include:
-
decision tree inducers
-
rule learners
-
naive Bayes
-
decision tables
-
locally weighted regression
-
support vector machines
-
instance-based learners
-
logistic regression
-
voted perceptrons
-
multi-layer perceptron
-
Implemented schemes for numeric prediction include:
-
linear regression
-
model tree generators
-
locally weighted regression
-
instance-based learners
-
decision tables
-
multi-layer perceptron
-
Implemented "meta-schemes" include:
-
bagging
-
stacking
-
boosting
-
regression via classification
-
classification via regression
-
cost sensitive classification
3.3 Torch
Author: Ronan Collobert
Collaborators: Samy Bengio and
Johnny
Mariethoz
Home page: http://www.torch.ch/
Language: C++
Main Features:
-
A lot of things in gradient-machines, that is, machines which
could
be learned with gradient descent. This includes Multi-Layered
Perceptrons,
Radial Basis Functions and Mixtures of Experts. In fact there are a lot
of small "modules" available (Linear module, Tanh module, SoftMax
module...)
that you can plug as you want to get what you want.
-
Support Vector Machine, in classification and
regression.
-
A Distribution package which includes Kmeans, Gaussian
Mixture Models, Hidden Markov Models and Bayes Classifier. Moreover
classes for speech recognition with embedded training are
available
is this package.
-
Ensemble models such as Bagging and Adaboost.
A few non-parametric models such as K-nearest-neighbors, Parzen
Regression
and Parzen Density Estimator.
3.4 Random Forests Software
Author: Leo Breiman
Home page: http://www.stat.Berkeley.EDU/users/breiman/
Language: Fortran 77
Main Features:
Random forests are a combination of tree predictors such that each
tree depends on the values of a random vector sampled independently and
with the same distribution for all trees in the forest.
3.5 NEURObjects
Author: Giorgio
Valentini
Collaborator: Francesco
Masulli
Home page: http://www.disi.unige.it/person/ValentiniG/NEURObjects/
Language: C++
Main Features:
-
I/O and data sets pre-processing classes.
-
Classes for automatic data set generation.
-
Classes for neural networks training and testing
-
Classes implementing learning algorithms
-
Classes implementing ECOC ensembles of learning
machines
-
Classes for statistical evaluation of neural networks performances
3.6 BoosTexter
Authors: Erin Allwein, Robert
Shapire , and Yoram
Singer.
Home page: http://www.research.att.com/~schapire/BoosTexter/
Main Features:
BoosTexter is a general purpose
machine-learning program based
on boosting for building a classifier from text and/or
attribute-value
data.
3.7 Other implementationts of Boosting
See http://www.boosting.org/
3.8 ASNN Associative Neural Network
Author: Igor Tetko
Home page: http://www.vcclab.org/lab/asnn
Language: code on-line; standalone version available on request
Main Features:
ASNN represents a combination of an ensemble of
feed-forward
neural networks and the k-nearest neighbour technique.
3.9 "R" Statistical Computing Programming Language
Contributors: Many, including Douglas Bates, John Chambers,Peter
Dalgaard,
Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich
Leisch,
Thomas Lumley, Martin Maechler, Guido Masarotto,Paul Murrell, Brian
Ripley,
Duncan Temple Lang, Luke Tierney, and Alexandros
Karatzoglou
For questions about R: R
help mailing list
Home page: http://www.r-project.org/
Main Features:
It is a free open source GNU (under the GPL
license) implementation
of the S programming language and provides a wide variety of
statistical
and graphical techniques (linear and nonlinear modelling, statistical
tests,
time series analysis, classification, clustering, support vector
machines,
neural networks, ensemble, etc).
3.10 ENTOOL
Authors: C. Merkwirt and J.D. Wichard
For questions about ENTOOL: entool@web.de
Home page: http://chopin.zet.agh.edu.pl/~wichtel/
ENTOOL is a software
package for ensemble regression modelling.
It is implemented mainly
in Matlab, with some time-critical parts written in C/C++
(as mex-functions).
Objectives
- Extending the ensemble
learning approach to several types of models
- Object-oriented
implementation yields a transparent mixture of different models and
allows the user the addition of his own model classes
Methods
The toolbox is equipped
with several model classes for out-of-box usage:
- Radial basis functions
(RBF)
- Linear regression
- Polynomial regression
- K-nearest-neighbour
models with adaptive metric
- Multilayer perceptron
(MLP)
- Adaptive Regression
Splines (ARES)
|