RECreadme

README for the MatLab package of Regression Error Characteristic Curves
Jinbo Bi (bij2@rpi.edu) 6/4/2003
Kristin Bennett (bennek@rpi.edu)

This package requires MatLab and MatLab Statistics toolbox in order to
run the functions included.
This package includes 4 MatLab .m files: CDF.m, rec_curve.m, syndata.m
and sample_curves.m (including two functions sample_curves and draw_syn_curves).
The hierarchy of these functions is as follows:
1) function CDF is the very basic function which is called by function
rec_curve to estimate the cumulative distribution function.
2) function syndata is a stand-alone function for generating synthetic 
data with additive noise.
3) function rec_curve is the core program for plotting a REC curve.
4) function draw_syn_curves calls function rec_curve to plot various 
REC curves.
5) function sample_curves calls function syndata to generate Gaussian
noise data and Laplacian noise data, and calls function draw_syn_curves
to draw REC curves for 4 different models on both data sets.  See our
paper for description of the 4 different models.

The help information is readily accessible under MatLab environment by
typing in the help command. For example, "help rec_curve", you will get 
the explanation of this function and what the inputs and outputs are. 
This may help a lot when you forget the input arguments of a function.

Each of the four functions is explained in their help headers as:
1) 
%The function [x_sort,area_over,area_under]=CDF(x) is used to 
%estimate the cumulative distribution function of the random 
%variable x and compute the areas under and over the CDF curve.
%Inputs: x -- a vector of real numbers as a sample of random
%        variable x.
%Outputs: x_sort -- a matrix of two columns, and the first column
%        is orginal x sorted in ascending order and the second
%        column is the probability of x.
%        area_over -- the area over the CDF curve, a real number.
%        area_under -- the area under the CDF curve.

2)
%The function AOC = rec_curve(error_metric,y,yhat,lineSpec) is used
%to draw an REC curve based on the residual y-yhat information and
%return the area over the REC curve. Note this REC plot is scaled 
%by the mean model, i.e., the mean of the actual response.
%Inputs: error_metric -- the type of the error metric, if it is 
%        'AD', the REC curve is based on absolute deviation; if 
%        it is 'SE', the REC curve is based on squared residual.
%        y -- the actual values of response.
%        yhat -- the predicted values of response.
%        lineSpec -- the line specification of the REC curve, for
%        example, if it is 'r-', the REC curve will be a red color
%        solid line, please see MatLab line specification syntax
%        for detail.
%Outputs: AOC -- the area over the REC curve.

3)
%The function [x,yn,y]=syndata(noise_type,A,B,C,n) is used to
%generate synthetic data with additive noise. The independent x
%is randomly generated in a 20-dimensional space from a uniform
%distribution on [A,B]. The dependent variable y is generated
%using the function y = sum_i C*x_i + r where C is a constant, i 
%runs from 1 to 10, and hence the last 10 independent variables 
%are noise. The variable r is the additive noise which can follow 
%the Gaussian, uniform, Laplacian, Gamma or Weibull distribution 
%depending on the choice of the 'noise_type'. 
%Inputs: noise_type -- the type of distributions that the additive
%         noise follows. If it is 'Gaussian', then Gaussian random
%         variable r will be generated. Similarly, other choices
%         include 'uniform', 'Laplacian', 'Gamma' and 'Weibull'.
%        A -- the left end of the interval for the uniform dist of x.
%        B -- the right end of the interval for the uniform dist of x.
%        C -- the constant coefficient used in the model to generate y.
%        n -- the number of the synthetic data examples.
%Outputs: x -- the n sample of the 20 independent variables.
%         yn -- the response y generated using the above function on x.
%         y -- the raw response y generated using y = sum_i C*x_i
%              without additive noise.

4)
%The function sample_curves is used to generate sample REC curves
%based on the absolute deviation and squared error on synthetic
%data with Gaussian noise and Laplacian noise.
%This function will generate 3 figures: the first one shows info
%about the response in Gaussian noise data; the second one shows 
%info about the response in Laplacian noise data; the third one 
%shows 4 REC curves based on the absolute deviation and squared 
%error for Gaussain noise data (above two) and Laplacian noise 
%data (below two).

Consult our paper "Regression Error Characteristic Curves" for more 
complete description of the REC curve plot and our expeirments on 
synthetic data.  This package provides a preliminary result concerning
REC curve analysis. Contact either Dr. Kristin Bennett (bennek@rpi.edu)
or Jinbo Bi (bij2@rpi.edu) for on-going progress.