\documentclass[twoside, a4paper, final]{article} \usepackage[english]{babel} \usepackage{a4wide} \usepackage{eurosym} \usepackage{times} \usepackage{type1cm} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{amssymb, amsmath} \usepackage{xr-hyper} \usepackage[colorlinks=true,linkcolor=blue,pdftex]{hyperref} \title{PANDA MVA documentation.} \author{M. Babai} \date{14 May 2010.} %\date{\today} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \maketitle \begin{center} \rule[0.1mm]{5.0cm}{2.0mm} \end{center} \tableofcontents \hrulefill \newpage %========================= KNN ===================================== \section{k-nearest neighbor algorithm (KNN)} In the current package placed in "pandaroot/PndTools/MVA", there are two different implementations of this algorithm: \begin{description} \item [KNN:] This is the very simple implementation of KNN, based on linear search. For large data sets and large number of features (parameters), this implementation becomes very slow and almost impractical. This is included just for validation purpose. \item [TMVAkd\_KNN:] This implementation is using a kd-tree for storage and searching through the available examples. This complex data structure is fast in construction and all search and traverse operations are of the Olog(n) complexity, here n is the number of examples inserted into the data structure. Very fast but needs a lot of examples in order to build a representative and useful database. Needs large memory. \end{description} The current implementations runs linear in worst case and the result is an estimation of the probability densities (pdf). These are normalized among the available classes (labels). \subsection{Using KNN} In both the directories "KNN" and "TMVAkd\_KNN" there are examples that show how to use these implementations which are a good starting point. One can modify these examples to perform pattern classification.\newline{\bf Notes:} \begin{description} \item [Training:] For KNN there is no need for training the classifier. If the input weight file is already pre-processed (normalization, decorrelated, ...), then it can be used directly for classification. Note that the same modifications need to be applied to the new yet to be classified patterns. \item [Weights:] Example up to date weight files which can be used by this algorithm can be fetched using the scripts available in the directory "macro/scripts/". These files are not available via svn, merely due their size ($\pm \ 800MB$). \end{description} %========================= LVQ ===================================== \section{Learning Vector Quantization (LVQ)} This directory contains a very simple implementation of the LVQ1 and LVQ2.1 algorithms. One can use these functions directly to create a set of prototypes and perform the training using each of the mentioned algorithms. There are also tools provided for creating ROC and to perform k-folds cross-validation. The classifier itself is also implemented. At this point the classifier returns only the shortest distances to every class type, normalized by the sum of the outputs and no decision is made. The user needs to modify the code and place a discriminator based on the output if one wants the decision to be made by the classifier. \subsection{Using LVQ} \begin{description} \item [Training:] In the directory LVQ there is an example implementations (LVQtrain.cpp). This sample program can be used as a starting point for a training scheme, Cross-validation, trainer error estimation or error evolution. \item [Weights:] Example up to date weight files which can be used by this algorithm can be fetched using the scripts available in the directory "macro/scripts/". These are not available via svn merely due their numbers. \item [Classification:] In the directory LVQ there is an example implementations (LVQclassify.cpp). This sample program can be used as a starting point for a classification scheme or ROC production. Shortest Mva value indicates a \underline{{\it{\bf better match}}}. {\it In other words the current example is most likely to be from the label with the smallest MVA value}. Furthermore, the output is normalized by the sum of the outputs for all labels. \end{description} %========================= K-means Clustering =================== \section{K-means Clustering} The directory "Clusters" contains an implementation of this algorithm. Given a set of parameter vectors and the number of expected clusters (centroids), "k-Means" generates for each cluster a centroid that represents the mean vector of that particular set of parameter vectors. At this moment only the "Hard K-Means" is implemented. Which means that each initialized center belongs to a single cluster. It is also possible to generate weighted centroids that are partially member of two or more clusters. The latter is called "Soft K-Means" and is not implemented yet. \section{TMVA\_MCL} This directory contains the implementation of a number of wrappers. These are interfaces to multi-class implementations of the algorithms available in TMVA (available as a part of the ROOT package). This is done in order to have a common interface to all available algorithms in pandaroot. After creation of the objects, the control is fully passed to TMVA. This means that setting and changing the parameters follows the TMVA guidelines. For further information on how to use, train and understand these methods, see TMVA manual. \end{document}