Adaptive Gaussian Filtering SVN

Machine learning with Gaussian kernels.

Status: Beta
Brought to you by: peteysoft
[r129]: / NEWS Maximize Restore History
180 lines (134 with data), 8.4 kB

New in version 0.92:

- In the direct classification routines (classify_a, classify_knn), there is now
  an option (-j) to print out joint probabilities instead of conditional 
  probabilities.  Of course this can be done by calculating the total
  probability and multiplying by the conditional probability, but this means
  redundant calculation.

- In class_borders, added the option (-r) to solve for a class border 
  other than at R=0.  This is useful if your classes are of significantly 
  different size, especially when the training data does not reflect this.

- There is now a simple clustering analysis program (cluster_knn) based on a 
  threshold density.  It works by first finding a point in which the density 
  is greater than this threshold.  Using the k-nearest neighbours to this point, 
  it recursively finds all other points above this threshold and assigns them 
  the same class number.

- The option to use a metric other than Cartesian now exists.  Since many
  of the calculations are specifically based on a Cartesian space, especially
  the PDF estimation, this should be applied with some caution.

- Option for different names for files containing normalization data.  It's a
  pretty minor point, so it's only been implemented in two or three programs,
  chiefly the class_borders and classify_b modules.  I'm too lazy to do them all...

- Added an n-fold cross-validation program that works with all the classification 
  algorithms.

- Added a small utility that just normalizes the data and thats it.  Also
  cleaned up and properly renamed a utility (vecfile2lvq) to convert the binary 
  files to Kohonen's LVQ format.

New in Version 0.9.3:

- The libpetey library is no longer part of the libagf distribution

- The class borders codes can no longer generate duplicate samples.
  There are two versions: one for large training datasets, and one for small.
  If all combinations of pairs of training samples have been used up, 
  the codes will generate no more training samples.

New in Version 0.9.4:

- Most importantly, everything, except the IO routines, has been templated.
  This means you can do your work in single or double precision and you can
  represent your classes as bytes, 8-bit integers, 16-bit integers,
  32 bit integers, etc. -- whatever size you want.

- With the exception of those used in external routines, variable types 
  in the main routines are now controlled with global typedefs, with each 
  class of variable having a different type.  This means you can tightly
  control the typing for optimal use of space or CPU cycles.  Classes have
  a default type of 32-bit integers while floating point operations are done
  in single precision by default.

- Different metrics are now only supported in the routines where they make
  sense: KNN classification and KNN interpolation.  The functions now 
  require a pointer to the desired metric.

- nfold routine now supports interpolation.  Note that this is still not well
  tested (if at all).

- File conversion utilities as well as the test class routines have now been 
  integrated into the main distribution simply by more linking the 
  two makefiles more closely, thus allowing easier testing and more user-
  friendly files.

- A routine that performs AGF PDF estimation with an optimal error rate 
  is currently being tested but is not ready yet.  We hope to have it ready 
  in a new release very shortly.

- Also in the next release: multi-class classification using the class-borders
  method.  Stay tuned!
 
New in Version 0.9.5:

Sadly, neither the multi-class classifier using the "borders" method, nor
the optimal AGF routine have been perfected yet.  However, there are quite
a few other good improvements to sweeten the mix...

- The routine for finding the k-nearest-neighbours has been changed from one
  based on a binary tree to one based on a quicksort algorithm.  Speed 
  improvements are expected to be on the order of 25%.  To change
  back to the old version, use the macro, KLEAST_FUNC, in the agf_defs.h
  include file.

- The routine for calculating the weights for the AGF algorithm now matches
  the filter variance to the W parameter using the supernewton root-finding
  algorithm instead of by squaring the initial weights.  This means that
  there are now two bounds for the filter variance.  They are set by the
  -v and -V options for the lower and upper bounds respectively.  Since it
  is trivial to push the bounds outward if they do not bracket the root
  and since these changes are "sticky" it does not matter if the high bound
  is too low or the low bound too high.  Rather the user should try to avoid
  the opposite extreme as this will mean a larger number of iterations to reach
  the root.  Default bounds are [sigma^2/n^(2/D), sigma^2] where sigma^2
  is the total variance of the data.

- The new weight-calculating routine is more accurate and should be more 
  robust as well, although at the cost of a slight speed penalty.  As with 
  the kleast subroutine, however, the old version can be re-instated by 
  changing the AGF_CALC_W_FUNC macro.  The intial filter variance, since it is 
  an upper bound, is now set with the -V option instead of the -v option.

- For maximum control of the weight-calculating routine, several new options
  have been added.  To change the maximum number of iterations in the 
  supernewton root-finding algorithm, use the -I option.  This changes it
  for both calculation of weights and for searching for the class borders.
  To change it for one or the other, use -i for the weight calculation routine
  and -h for the class borders routine.  The default number of iterations for
  both is 100 which may not be sufficient for some problems.  

- To change the tolerance of W, or the total of the weights, use the -l option.
  Default is 0.005 which should be more than sufficient.  Since the accuracy of
  W is not that critical, the tolerance can be degraded, probably as high as 1, 
  for a slight speed savings.

- The parameter W is now set with the -W option (uppercase double-u) instead of
  the -w option (lowercase double-u).

- The optimal AGF may not work yet, but it's a lot more user friendly!
  Check the documentation...

New in Version 0.9.6:

- New Bash script to validate probability density function (PDF) calculations:
  validate_pdf.sh

- New command for generating a simulated data set with approximately the
  same PDF as the training data: pdf_sim .  Used by validate_pdf.sh

- New file utility that randomly splits up a dataset into two or more divisions:
  agf_split_data .  Also used by validate_pdf.sh

- New command generates the Relative Operating Characteristic (ROC) curve:
  roc_curve

New in Version 0.9.7:

- Command to generate and browse dendrogram (hierarchical cluster) called
  browse_cluster_tree, as well as associated libraries

- Multi-class classification based on class borders!  Two executables:
  - multi_borders trains the model based on a control file
  - classify_m performs classifications using the output from multi_borders
  - libaries can be generalized (with some effort) for any binary classification
    method

- Normalization has been generalized to a linear transformation and moved
  to a stand-alone executable (agf_precondition) for pre-processing.  This 
  executable includes Singular Value Decomposition (SVD) as an option as well 
  as feature selection.

- Also for pre-processing: agf_preprocess.  The command performs operations
  on both features and classes: splitting the dataset for validation purposes,
  selecting, re-labelling and partitioning classes.

- Namespace added

- Reorganized the directory structure.

- Added an 'examples' directory for test suites and applications.

- Routine to sample the class borders has been made completely general in that
  it can use any algorithm that provides continuous, differentiable
  estimates of the conditional probabilities.

- All operations--classification, pdf-estimation and interpolation--have been
  grouped together under two executables: on for AGF called agf and one for
  KNN called knn.

- Classification programs that output conditional/joint probabilities for all
  classes (agf, knn and classify_m--for some cases) now print them in an 
  LVQ-compatible format.

- Temporary files from nfold and roc_curve are now deleted on exit rather than
  cluttering the directory.