Lab 1 - BINARY CLASSIFICATION AND MODEL SELECTION
This lab is about binary classification and model selection on synthetic and real data. The goal of the lab is to get familiar with the learning algorithms and to get a practical grasp of what we have discussed in class. Follow the instructions below. Think hard before you call the instructors!
Download file releaseLab1.zip. - This file includes all the code you need!
Overture: Warm up
Run the file gui_filter.m and a GUI will start. Have a look at the various components.
With the data simulation option generate a dataset of type "linear" [press "load data" to generate]
Observe the generated data [the button "plot training/plot test" will allow you to toggle between training and test set]
choose the "regularized least squares" filter and the "linear" kernel
have a look at the parameter selection part and the various options of KCV; to choose the regularization parameter "t" you can either choose KCV or set a fixed value
press "run" to perform training and classification; observe the plot of the KCV error and the balance between training and test errors. Also have a look to the plot area on the left where a separation function has appeared [again the button "plot training/plot test" allows you to switch between the two]
Interlude: The Geek Part
Back on the matlab shell, have a look to the content of directory "spectral_reg_toolbox". There you will find, among the others, the code for the command "learn" (used for training), "pattrec" (used for testing), "kcv" (used for model selection on the training set).
For more informations about the parameters and the usage of those scripts, type:
help learn
help patt_rec
help kcv
Finally,
you may want to have a look at the content of directory
"dataset_scripts" and in particular to file
"create_dataset" that will allow you to generate data
synthetic data of different types.
Allegro con brio: Analysis
Carry out the following experiments either using the GUI, when it is possible, or writing appropriate scripts.
(1) Generate data of "Linear" type.
(2) Considering linear-RLS, observe how the training and test error change as
we change (increase or decrease) the regularization parameter
the training set size grows (try various choices of n in [10:....] as long as matlab supports you!)
the amount of noise on the generated data grows
(run training and test for various choices of the suggested parameters)
(3) Leaving all the other parameters fixed choose an appropriate range [lambda_min:lambda_step:lambda_max] and plot the training error and the test error for each lambda. Use the KCV option to select the optimal lambda and see how it relates to the previous plot.
(4) Leaving all the other parameters fixed choose an appropriate range [n_min:n_step:n_max] and plot the training and test error (what do you observe as n goes to infty?)
Crescendo: Advanced Analysis
(5) Consider Gaussian-RLS and perform parameter tuning -- this time together with lambda you'll have to choose an appropriate sigma
try a few vaules of sigma, lambda and compare the obtained training_error, test_error
fix lambda and observe the effect of changing sigma
fix sigma and observe the effect of changing lambda
do you notice (and if so, when) any overfitting/oversmoothing effect?
(6) Consider Polynomial-RLS and perform parameter tuning as in (4).
(7) How does the kernel choice affect the learning behaviour of the algorithm? In particular compare the performances of the polynomial and gaussian kernels on the spiral and moons datasets with respect to the number of examples in the training set (e.g. [10, 20, 50, 100, 1000]) and the amount of regularization ("fixed value" in the GUI, eg. [0.5, 0.1, 0.01, 0.001, 0.0001]).
Finale: Challenge
The challenge consists in a learning task using a real dataset, namely "USPS": this dataset contains a number of handwritten digits images. The problem is to train "the best classifier" that is able to discriminate between the digits "1" and "7".
Have a look at the script "demo_lab1.m". This script "demo_lab1.m" contains a code snippet to perform a simple binary classification task by means of the previously presented scripts.
You should understand what the scripts are supposed to do, and train the classifiers in order to perform a binary classification task for the digits "1", "7".
Once the classifiers are trained, the model must be exported in a matrix file by means of the "save_challenge_1.m" script (to see how to use it please try the command 'help save_challenge_1').
By the end of the challenge session you should submit the result of your script by using the link: http://www.dropitto.me/regmet with password regmet2013. The result file is a matlab matrix file named name-surname.mat. The results will be presented during the next class. The score of the challenge is based on the accuracy of the classifier obtained on a completely independently sampled test set