\documentclass[twoside, a4paper, final]{article} \usepackage[english]{babel} \usepackage{a4wide} \usepackage{eurosym} \usepackage{times} \usepackage{type1cm} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{amssymb, amsmath} \usepackage{xr-hyper} \usepackage[colorlinks=true,linkcolor=blue,pdftex]{hyperref} \title{PANDA MVA documentation.} \author{M. Babai} \date{14 May 2010.} %\date{\today} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \maketitle \begin{center} \rule[0.1mm]{5.0cm}{2.0mm} \end{center} \tableofcontents \hrulefill \newpage %========================= KNN ===================================== \section{k-nearest neighbor algorithm (KNN)} In the current package placed in "pandaroot/PndTools/MVA", there are two different implementations of this algorithm: \begin{description} \item [KNN:] This is the very simple implementation of KNN, based on linear search. For large data sets and large number of features (parameters), this implementation becomes very slow and almost impractical. This is included just for validation purpose. \item [TMVAkd\_KNN:] This implementation is using a kd-tree for storage and searching through the available examples. This complex data structure is fast in construction and all search and traverse operations are of the Olog(n) complexity, here n is the number of examples inserted into the data structure. Very fast but needs a lot of examples in order to build a representative and usefull database. Needs large memory. \end{description} The current implementations run linear in worst case and the result is an estimation of the probability densities. These are normalized among the available classes (labels). \subsection{Using KNN} In both the directories "KNN" and "TMVAkd\_KNN" there are examples that show how to use these implementations which are a good starting point. One can modify these examples to perform pattern classification.\newline{\bf Notes:} \begin{description} \item [Training:] For KNN there is no need for training the classifier. If the input weight file is already pre-processed (normalization, decorrelated, ...), that can be used directly for classification. \item [Weights:] Example up to date weight files which can be used by this algorithm can be fetched using the scripts available in the directory "macro/scripts/". These are not available via svn merely due their size ($\pm \ 800MB$). \end{description} %========================= LVQ ===================================== \section{Learning Vector Quantization (LVQ)} This directory contains a very simple implementation of the LVQ1 and LVQ2.1 algorithms. One can use these functions directly to create a set of prototypes and perform the training using each of the mentioned algorithms. The classifier itself is also implemented. At this point the classifier returns only the shortest distances to every class type and no decision is made. The user needs to modify the code and place a discriminator based on the output if one wants the decision to be made by the classifier. \subsection{Using LVQ} \begin{description} \item [Training:] In the directory LVQ there is an example implementations (LVQtrain.cpp). This sample program can be used as a starting point for a training scheme. \item [Weights:] Example up to date weight files which can be used by this algorithm can be fetched using the scripts available in the directory "macro/scripts/". These are not available via svn merely due their size. \item [Classification:] In the directory LVQ there is an example implementations (LVQclassify.cpp). This sample program can be used as a starting point for a classification scheme. Shortest Mva value indicates a \underline{{\it{\bf better match}}}. {\it In other words the current example is most likely to be from the lable with the smallest MVA value}. \end{description} %========================= K-means Clustering =================== \section{K-means Clustering} The directory "Clusters" containes an implementation of this algorithm. Given a set of parameter vectors and the number of expected clusters (centroids), "k-Means" generates for each cluster a centroid that represents the mean vector of that particular set of parameter vectors. At this moment only the "Hrad K-Means" is implemented. Which means that each initialized center belongs to a single cluster. It is also possible to generate weighted centroids that are particially member of two or more clusters. The latter is called "Soft K-Means" and is not implemented yet. \end{document}