README for the MatLab package of Regression Error Characteristic Curves Jinbo Bi (bij2@rpi.edu) 6/4/2003 Kristin Bennett (bennek@rpi.edu) This package requires MatLab and MatLab Statistics toolbox in order to run the functions included. This package includes 4 MatLab .m files: CDF.m, rec_curve.m, syndata.m and sample_curves.m (including two functions sample_curves and draw_syn_curves). The hierarchy of these functions is as follows: 1) function CDF is the very basic function which is called by function rec_curve to estimate the cumulative distribution function. 2) function syndata is a stand-alone function for generating synthetic data with additive noise. 3) function rec_curve is the core program for plotting a REC curve. 4) function draw_syn_curves calls function rec_curve to plot various REC curves. 5) function sample_curves calls function syndata to generate Gaussian noise data and Laplacian noise data, and calls function draw_syn_curves to draw REC curves for 4 different models on both data sets. See our paper for description of the 4 different models. The help information is readily accessible under MatLab environment by typing in the help command. For example, "help rec_curve", you will get the explanation of this function and what the inputs and outputs are. This may help a lot when you forget the input arguments of a function. Each of the four functions is explained in their help headers as: 1) %The function [x_sort,area_over,area_under]=CDF(x) is used to %estimate the cumulative distribution function of the random %variable x and compute the areas under and over the CDF curve. %Inputs: x -- a vector of real numbers as a sample of random % variable x. %Outputs: x_sort -- a matrix of two columns, and the first column % is orginal x sorted in ascending order and the second % column is the probability of x. % area_over -- the area over the CDF curve, a real number. % area_under -- the area under the CDF curve. 2) %The function AOC = rec_curve(error_metric,y,yhat,lineSpec) is used %to draw an REC curve based on the residual y-yhat information and %return the area over the REC curve. Note this REC plot is scaled %by the mean model, i.e., the mean of the actual response. %Inputs: error_metric -- the type of the error metric, if it is % 'AD', the REC curve is based on absolute deviation; if % it is 'SE', the REC curve is based on squared residual. % y -- the actual values of response. % yhat -- the predicted values of response. % lineSpec -- the line specification of the REC curve, for % example, if it is 'r-', the REC curve will be a red color % solid line, please see MatLab line specification syntax % for detail. %Outputs: AOC -- the area over the REC curve. 3) %The function [x,yn,y]=syndata(noise_type,A,B,C,n) is used to %generate synthetic data with additive noise. The independent x %is randomly generated in a 20-dimensional space from a uniform %distribution on [A,B]. The dependent variable y is generated %using the function y = sum_i C*x_i + r where C is a constant, i %runs from 1 to 10, and hence the last 10 independent variables %are noise. The variable r is the additive noise which can follow %the Gaussian, uniform, Laplacian, Gamma or Weibull distribution %depending on the choice of the 'noise_type'. %Inputs: noise_type -- the type of distributions that the additive % noise follows. If it is 'Gaussian', then Gaussian random % variable r will be generated. Similarly, other choices % include 'uniform', 'Laplacian', 'Gamma' and 'Weibull'. % A -- the left end of the interval for the uniform dist of x. % B -- the right end of the interval for the uniform dist of x. % C -- the constant coefficient used in the model to generate y. % n -- the number of the synthetic data examples. %Outputs: x -- the n sample of the 20 independent variables. % yn -- the response y generated using the above function on x. % y -- the raw response y generated using y = sum_i C*x_i % without additive noise. 4) %The function sample_curves is used to generate sample REC curves %based on the absolute deviation and squared error on synthetic %data with Gaussian noise and Laplacian noise. %This function will generate 3 figures: the first one shows info %about the response in Gaussian noise data; the second one shows %info about the response in Laplacian noise data; the third one %shows 4 REC curves based on the absolute deviation and squared %error for Gaussain noise data (above two) and Laplacian noise %data (below two). Consult our paper "Regression Error Characteristic Curves" for more complete description of the REC curve plot and our expeirments on synthetic data. This package provides a preliminary result concerning REC curve analysis. Contact either Dr. Kristin Bennett (bennek@rpi.edu) or Jinbo Bi (bij2@rpi.edu) for on-going progress.