\clearpage
\section{Evaluation of XDAQ framework}
\label{XDAQ-Mainchapter}

{\em XDAQ} \cite{XDAQ-wiki} is a data acquisition framework being developed for the {\em Compact Muon Solenoid} (CMS) experiment \cite{CMS-home} at CERN. The developers give the following mission statement:
''XDAQ is a middleware that eases the tasks of designing, 
programming and managing data acquisition applications by providing a simple, 
consistent and integrated distributed programming environment. 
The framework builds upon industrial standards, open protocols and libraries.''\cite{XDAQ-wiki}.
The XDAQ programming framework mainly consists in C++ classes
and namespaces, whereas also some JAVA based tools are applied.

Since the data acquisition of the CMS experiment is comparable in size with what is 
planned for CBM, XDAQ seems to be well suited as an example of such a system. Moreover,
the existing XDAQ system might be directly adaptable as framework for the first CBM
data acquisition test set up. Because of this, XDAQ was evaluated with respect to
the usability for CBM. XDAQ was installed on the standard gsi Linux,
and on the InfiniBand test cluster (section \ref{XDAQ-IB-install}).      
Here several tests were carried out using standard examples of the XDAQ release
(section \ref{XDAQ-tests}). For fast data transfer over the network is
crucial for CBM, a special emphasis was put on the data transport. 
Performance of different peer transport layers was measured by means of
a Roundtrip application (section \ref{RoundTripBasics}). The
applicability and the limits of InfiniBand transport within XDAQ framework 
was studied thoroughly. In
the course of this work, a new XDAQ peer transport module was developed
on top of the uDAPL library for InfiniBand   
(\ref{ptDAPL-Imp}).


\subsection{Installation on InfiniBand cluster}
\label{XDAQ-IB-install}
%Installation problems, patches for 64 bit, environment set up.
The first Installation attempts of XDAQ on the InfiniBand test machines encountered
some problems due to the 64 bit architecture, and differences of our Linux installation
(SuSe 9.3) to the CERN standard (Scientific Linux).
Thus some adjustments were necessary to build at least the basic
components.

The XDAQ framework is released as major 
packages {\em Coretools}, {\em Powerpack}, and {\em Worksuite}.
The {\em Worksuite} is specialized for CMS experiment hardware and
requires specific drivers and kernel mode compilation; it was mainly
excluded for these tests.

Concerning the 64 bit compatibility, at first 
the central Makefile parameters had to be modified to support
position independent code (option $-fPIC$). This allowed the
{\em Coretools} bundle to be build. For the XDAQ version of
November 2005, it was also necessary to disable some hardware oriented
modules of the {\em Powerpack}\footnote{XDAQ release of May2006 automatically 
disables these CMS specific parts now}.
Then the compilation seemed to be successfull.


However, some standard examples showed up strange behaviour on
runtime, e.g. the http port of the HyperDAQ web server
(see section\ref{HyperDAQtest}) was not ready while consuming 99\% cpu time; 
other operations lead to segmentation violations of the XDAQ executive.
Since this functionality was working well with a reference installation on standard GSI
linux (Debian 3.0, 32 bit), the sources were investigated concerning parts that
are not 64bit save.
In fact, modification of several places in the code with crucial pointer types 
let these problems disappear. These patches were reported to the XDAQ developer
community\footnote{This initiated advanced efforts of the developers to port the XDAQ code 
to 64 bit; meanwhile this has been achieved with the release of June 2006}.

This installation of November 2005 was applied for all general tests and
developments. In May 2005, the IB cluster installation was updated by the new
XDAQ developments ({\em Coretools 3.5.2}, {\em Powerpack 1.4.3}). 
Still our 64 bit corrections were necessary on top of this
release. Additionally, some other small compilation problems occured that
could be solved manually. 


%\clearpage
\subsection{XDAQ tests overview}
\label{XDAQ-tests}

%Here description of existing xdaq components that have been tested.
The installation of the {\em Coretools} and {\em Powerpack} components
on the IB cluster has been tested by means of standard examples as delivered
with the XDAQ distribution. Sourcecode and configuration files of the examples 
were put to a user working directory and slightly 
adjusted to the cluster environment. Additionally, some test applications 
were further developed and supplemented by other XDAQ features under
investigation.

Starting with the given {\em HelloWorld} example, it was already possible
to learn about the {\em XML} configuration files, and to explore the features 
of the {\bf Hyperdaq} web interface. 
Other tested examples covered:
\begin{compactitem}[$\bullet$]
 \item synchronous and asynchronous {\bf state machines} (\ref{StateMachineTest}); 
 \item multithreading with XDAQ {\bf workloop} mechanism;
 \item data import/export  by means of the XDAQ {\bf InfoSpace} (\ref{XDAQ-Monitor}); 
 \item monitoring of control variables with the XDAQ {\bf monitoring} tool (\ref{XDAQ-Monitor}); 
 \item command messaging with {\bf SOAP} protocol over the 
   {\bf peer transport http} layer (\ref{SOAPtest})
 \item data transport with {\bf I2O} protocol over the 
   {\bf peer transport TCP} (pt/tcp) layer (\ref{I2O}) 
\end{compactitem} 


Most of these features work together in the distributed XDAQ
{\em RoundTrip} benchmark application for data transport performance 
measurements. Therefore our advanced investigations were focused onto this 
example (i.e. the minor changed {\em MyRoundTrip}). 
 
The experiences with different XDAQ aspects from this work is described
in the following.

\subsection{Cluster configuration}
\label{XMLConfig}
% Here some details how xml configuration works: all nodes know all registered
% applications
XDAQ is designed to run many processes distributed over a data
acquisition cluster. The configuration of each process is set up by
an XML description schema \cite{XDAQ-XML}. Here certain hierarchical units are
defined by xml tags, such as {\em Partition}, {\em Context}, and  {\em Application}.
Additionally,  the {\em Endpoint} tag may define a network connection. 
Library modules can be specified by means of {\em Module} tag, they are
loaded automatically when activating the configuration.

The cluster itself may be divided into several {\em Partition}s; however,
XDAQ currently supports only one single partition.
All XDAQ applications of a {\em Partition} are grouped into different {\em Context}s 
on the network cluster. Each {\em Context} is usually identified by an URL, i.e. a
node name and a port number. All XDAQ {\em Application}s of the same 
{\em Context} will run as threads in one process on the node, controlled by the 
{\em XDAQ Executive} application. It is possible to pass individual default parameters to
each {\em Application} using a {\em properties} tag, including variable names 
as defined in the application code. 

At first, the default configuration for all contexts is loaded from the {\tt etc/profile.xml} file.
This defines at least the controlling {\em Executive} application environment, 
and the peer transport via http for external access of the node.
In addition, on startup of each XDAQ process, an individual configuration file
can be passed on the command line, e.g. {\tt xdaq.sh -c myconfiguration.xml}.
It is also possible to load a configuration file from the {\em HyperDAQ} web
interface (section \ref{HyperDAQtest}) into an {\em Executive} after the XDAQ start up.

It should be pointed out that this configuration file may be the same for all nodes
of the cluster. By means of the {\em Context} tag, each node will only apply the
settings for the own context, but is aware of the other contexts.
Thus each  {\em Executive} has a local registry of {\em all} applications in the network.
This is applied for addressing messages (section \ref{SOAPtest}), or
for browsing the cluster (section \ref{HyperDAQtest}).


\subsection{{\em HyperDAQ} web interface}
\label{HyperDAQtest}
%Feature description, xrelay controller, 

By default each XDAQ {\em Context} process (section \ref{XMLConfig}) runs a web server consisting
in the {\em HyperDAQ} application. The home page of this web server (Fig. \ref{fig:hyperdaq}) offers
a user interface for all known XDAQ applications in the cluster. It can
be viewed by any html browser at the {\em Context} http address and port number.
%{\tt http://depcp001.gsi.de:1972}. 
For security of the http access, XDAQ provides
the {\em XAccess} module which applies Basic http authentication and rejection of
unknown client addresses. Additionally, {\tt tinyproxy} is suggested as
lightweight http proxy to control access from outside a
private DAQ cluster \cite{XDAQ-wiki}.

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{hyperdaq-screen.png}
\caption{A screenshot of the {\em HyperDAQ} web interface.}
\label{fig:hyperdaq}
\end{figure}


The properties of any application
may be inspected or even be changed via this web interface. Moreover, a
XDAQ application may be defined as {\em WebApplication}, with the possibility
to show an individual homepage when clicked from the {\em HyperDAQ}.
Using the XDAQ {\em xgi} library for cgi scripts, the developer may provide
a dynamic web-client user interface here. This feature was applied for theStateMachineTest
{\em state machine} (\ref{StateMachineTest}) and {\em MyRoundTrip} 
(\ref{RoundTripBasics}) examples, for instance.

Additionally, it is possible to send SOAP messages \cite{SOAP} to any
known XDAQ application by means of the {\em XRelay Controller} web interfaces, respectively. Like the {\em HyperDAQ}, the {\em XRelay Controller} is one of the XDAQ default applications running in each 
{\em Context} process.
The experiences with the {\em XRelay Controller} are described in section \ref{SOAPtest}.

Other features available from the {\em HyperDAQ} surface: 
\begin{compactitem}[$\bullet$]

\item the {\em Upload Application} form to start new applications on the
cluster by class name, id number, and library location;

\item the {\em Configure Executive} form to upload and apply a new xml configuration script;

\item the {\em Cluster Explore} tool that allows jumping to any other 
{\em HyperDAQ} server in the cluster;

\item lists of all loaded libraries on the {\em Display Modules} page

\item lists of all known applications with links to their properties and 
home pages on the  {\em Application Descriptors} page; 

\end{compactitem} 

To summarize, the {\em HyperDAQ} turned out to be very useful in this evaluation and development phase, as the tested applications could be inspected and
controlled easily. Except for the xml configuration file editor, no other
user interface was necessary for set-up and monitoring of the examples.
However, the existing web surfaces will most probably not be sufficient for
a data acquisition system ``in production''.


\subsection{State machines}
\label{StateMachineTest}
%Synchronous/asynchronous state machines. Description, work well.

A fundamental way to describe and control the behaviour of 
complex entities, e.g. data acquisition components, 
consists in the concept of the {\em Finite State Machine} (FSM) \cite{Wikipedia-Statemachine}.
The XDAQ framework already delivers some powerful classes
to realize such finite state machines.

The {\tt toolbox::fsm} package contains the {\bf FiniteStateMachine}
and {\bf AsynchronousFiniteStateMachine} classes; 
the latter using a dedicated workloop (thread) to process
the state transition functions in the background. Both classes allow
to configure arbitrarily the states and transition functions of the
state machine object: the user may define any kinds of states and their
transitions by text identifiers in the source code; any method of a user class 
may be bound to a specific state transition. XDAQ events with user defined
command names may be passed to an exisiting FSM to trigger 
state changes on runtime.

Additionally, the {\tt xgi} package offers the {\bf WSM} class as 
{\em web dialog state machine}. This alternative FSM implementation
is also user configurable at compile time, but has a web page GUI with
buttons to show and change the state.   
 
The data transport {\em RoundTrip} example (section \ref{RoundTripBasics}) uses
both FSM implementations in parallel. The web state machine is
just applied as user interface, whereas the FiniteStateMachine rules the
application state and may also be toggled by SOAP messages.  
Both state machines are always kept synchronized.

As predefined in the XDAQ examples, the states ``Halted'', ``Ready'', 
``Enabled'' are used here, with transition commands ``Configure'', ``Enable'',
and ``Halt'', respectively.  Since this kind of state machine is generally
applicable for many cases, the corresponding SOAP commands are already 
foreseen in the {XRelay} controller web interface (section \ref{SOAPtest}).
Therefore we kept this state machine definition as ``template'' 
to control all further test programs (sections \ref{RDMA-XDAQ}, 
\ref{RoundTripBasics}, \ref{ptDAPL-SendRec}). 
However, this functionality appears just as a suggestion and may be 
extended or fully redefined for a production system later on.

The XDAQ state machine concept seems powerful and flexible enough
to cope with all requirements for data acquisition set-up and control.
Issues that have not been covered yet in our investigations  are 
the performance (latency, CPU load) of state transitions, 
the reliability, and the stability, especially in case of large distributed 
systems. 


\subsection{Reading and monitoring of variables}
\label{XDAQ-Monitor}
%There is a monitor application for monitoring values. short!
Any data acquisiton system requires the possibility to monitor the state and
important values of the functional components, e.g. data rate, memory
and buffer consumption, etc. Though this task may be covered by independent,
full featured control systems, like EPICS \cite{EPICS}, or SMI++/DIM{\cite{DIM}},
XDAQ at least offers some mechanisms for monitoring of variables.

By the concept of the {\em Infospace} \cite{XDAQ-wiki}, it is possible to offer any value
for external readout by name. When a variable is requested from another
application, a user defined callback may ensure that its values are updated e.g. 
from a hardware component, or by calculation from other variables.

Moreover, XDAQ provides a default  {\em Monitor} application that may
request frequently variables from any other applications \cite{XDAQ-Monitor-wiki}. These variables
must be part of a known monitorable {\em Infospace}. 
By means of the {\tt mon:flashlists} tag in the XML configuration,
the variables to be monitored from any infospace are defined for
the {\em Monitor} application. Additionally, the {\tt mon:collectorsettings}
tag may specify the monitoring frequency, and properties for
recording a history of the values to a file.

Inspecting the monitored values from remote may be done via
a {\em Monitor CGI interface}. The {\em Monitor} application
example offers a xgi callback function that reacts on
a {\tt http::get} requests for a collected value from the flashlist. 
The result of such a variable retrieval may be a displayed as a table
in a web browser (when invoked from there), or may be printed
in the terminal (when invoked with {\tt curl} commandline tool \cite{CURL}).

The provided monitoring application example worked as intended for a small number of variables.
However, monitoring here always requires an {\em active} request of the visualizing client 
at the http server of the XDAQ {\em Executive} that runs the monitoring application.
This procedure might soon reach its limits when applied for a large number of
process variables on many nodes. Furthermore, the table 
display of the monitored variables in the requesting browser, as provided by the example, 
is very primitive compared to the possibilities of any slow control GUI.

Despite this, the example dabcnstrates the XDAQ possibilites for variable retrieval.
These can be applied for a more advanced monitoring system later on. For example, 
using the {\tt curl} library API \cite{CURL}, it would be possible to connect an independent EPICS 
{\em InputOutputController} process to the XDAQ http interface to fill process variables.
There are also ideas \cite{XDAQ-Monitor-wiki} to link the XDAQ variables via http or SOAP interfaces to National Instruments Labview  \cite{Labview} displays.
For the CMS experiment, XDAQ provides an interface to the LHC 
standard control system PVSS via SOAP messaging (section \ref{SOAPtest}) \cite{XDAQ-wiki}.


\subsection{Job Control}
\label{XDAQ-JobControl}
%There is a job control. What is it good for? possible starting point
%for real cluster control system\ldots?
The control of the data acquisition cluster not only requires to monitor
process variables, but also to change the state of the applications.
During the lifetime of the XDAQ {\em Executive}, this can be achieved by
passing commands via SOAP messages to the controlled XDAQ application, 
as described in \ref{SOAPtest}.

Moreover, to launch and cancel the {\em Executive} process itself,
a higher level control mechanism is necessary. This should be capable
of starting any operating system process on a machine of the DAQ cluster 
on initialization time; it may cancel any crashed process and restart it again
on the fly; finally, a regular shutdown of the DAQ processes should be
possible on remote request. One could also think of a supervising system
that may detect problems of the DAQ processes and recovers them automatically;
this ideas lead to systems like the  {\em ARMOR}s (\underline{A}daptive, 
\underline{R}econfigurable,
and \underline{M}obile objects for \underline{R}eliability), as 
proposed elsewhere \cite{Armor-paper}, \cite{Armor-survey}.

The XDAQ framework still does not provide such an elaborate system.
However, there is a {\em JobControl} application \cite{XDAQ-wiki} for starting and killing
any process from remote by means of SOAP requests (see \ref{SOAPtest}).
The {\em JobControl} runs within an independent, reserved XDAQ {\em Executive} process.
This may be launched before all other XDAQ processes, e.g. as a system service at Linux bootstrap time. 
Then the {\em JobControl} application can launch another XDAQ process with its own environment
that runs the real data acquisition applications. The {\em JobControl} can also kill the
other XDAQ process on request. 

There are several SOAP commands to initiate these actions from the controlling client:

\begin{description}

\item[executeCommand:] Start an executable on this machine. May define environment variables
and user settings. Note that any program may be invoked here, not only XDAQ processes.  

\item[killExec:] Kill an executable by id number

\item[killAll:] Kill all executables started by this job control 
 
 \end{description}

 
On our InfiniBand cluster, the {\em JobControl} has not been completely tested yet.
It seems a good idea though to provide a meta application for process control.
For a future DAQ production system, the {\em JobControl} concept could be developed
further to a semi-automatic node controller process. However, it still must
be investigated if there are other process controlling systems with similar or
even better properties.  
 

\subsection{SOAP messaging}
\label{SOAPtest}
%Here controller application that sends/receives configure messages to all exisiting xdaq applications in cluster.
XDAQ uses the {\em Simple Object Access Protocol} SOAP \cite{SOAP} to exchange 
commands and messages between the applications. From the 
{\em XRelay Controller} web interface (\ref{HyperDAQtest}), e.g., it is possible to submit interactively a SOAP  message to any registered application 
in the cluster.

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{xrelay-screen.png}
\caption{A screenshot of the {\em XRelay Controller} web interface.}
\label{fig:xrelay}
\end{figure}


By means of a JAVA based web GUI (Fig. \ref{fig:xrelay}), the {\em XRelay Controller}  displays 
all known applications and allows to select the message receiver, or filter
multiple receivers by name pattern, respectively. Thus multiple 
receivers can be addressed by one click. 
Some default messages as standard commands for the XDAQ state 
machine examples (e.g. ``Configure'', ``Enable'', ``Halt'', ``Suspend'', 
``Resume'',  
see \ref{StateMachineTest}) 
can be selected by mouse from a list.
Besides these predefined messages, the web interface allows to edit the SOAP
statements to be send. The SOAP response message as returned from the
receiver will be displayed.


The {\em XRelay Controller} GUI  was used for most of our test as a 
simple command interface, 
since  the receiver actions for these commands are fully user defined. 
The web user interface turned out to be suitable for a limited number 
of applications,  but is supposed not to be sufficient as a ``real'' 
controls system.

However, by means of the {\em xoap} library XDAQ offers a powerful C++ API for 
inter-application SOAP messaging. 
This could be used as interface for more advanced controls application. 
A cluster set-up controller application would be possible
that communicates by SOAP messages with all other registered
applications. A first simple test program {\em MyController} has been developed 
to gain 
experiences with the API. Using the XDAQ application registry and SOAP 
messaging, it was possible to detect all running XDAQ user applications 
from the controller, and to initialize them.  

 
\subsection{I20 messaging}
\label{I2O}
% here i2o experiences and summary
The XDAQ concept clearly separates the messaging protocol from the
transport implementation layer. The user code just posts
messages of a certain format to a target application, or receives other 
messages  in dedicated callback functions, respectively. The transport
in between the XDAQ application is meant to be transparent for the user code.
In fact, just by changing the peer transport definition in the 
XDAQ cluster configuration file, the same
application may use different peer transport implementations without 
the need to recompile the code. Fig. \ref{fig:i2omessaging} illustrates
the XDAQ messaging architecture.

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{xdaqmessaging2.png}
\caption{{\em I2O} messaging and peer transport layers \cite{XDAQ-wiki}.}
\label{fig:i2omessaging}
\end{figure}


For exchange of commands and status information, XDAQ applies SOAP \cite{SOAP} 
as protocol,  while the transport over the network is handled by the 
{\em PeerTransportHTTP} library here (see section \ref{SOAPtest}).

For the experimental data stream, XDAQ chose the {\em Intelligent Input Output} 
({\bf I2O}) \cite{I2O} as standard binary message format.
The I2O structure features a standard data header, containing numerical sender
and receiver information, fields for certain flags, and a variable function code
number to indicate which actions should be done on receiving this message.
The data field appended to the header is arbitrarily definable by the user.
The XDAQ {\tt i2o} and {\tt i2o:utils} namespaces deliver methods to
format I2O messages, or to bind a user method as callback function
to an incoming I2O message, respectively. The substructure of the
I2O message and the reaction on receiving a message are free for the
user.

As default peer transport implementations for the I2O, XDAQ offers the 
{\bf Peer Transport FIFO} for  data exchange within one node, and the
{\bf Peer Transport TCP} to apply tcp/ip transport in between cluster
nodes. Additionally, an asynchronous peer transport for tcp/ip exists, 
{\bf Peer Transport ATCP}, that uses multithreaded senders to gain
performance.

\subsection{The {\em RoundTrip} benchmark}
\label{RoundTripBasics}
%Description of Roundtrip example, what is measured, theory of my
%figures of merit ($c$, $\tau_{0}$).

Most tests on the I2O messaging were run with the standard data 
{\em RoundTrip} example as distributed from XDAQ release. This
example was further adjusted (and renamed to {\em MyRoundTrip}), 
since special tests  in the course of the peer transport for 
uDAPL developments (section \ref{ptDAPL-Imp}) were necessary.

{\bf Functionality:}
The round trip benchmark is implemented into one class {\em MyRoundTrip},
that runs both as sender and receiver application on different nodes. The
different sender/receiver roles are identified by 
the instance number in the xml configuration file. The test run is
controlled by finite state machines (section \ref{StateMachineTest}) that
are either triggered from SOAP messages, or by the GUI elements of the
local web state machine.

The {\em Configure} command will execute method {\tt ConfigureAction()} which
will allocate memory and retrieve the handle for the connection from
the environement setup.
On {\em Enable} command, the sender will post a number of I2O message 
frames to the receiver. The {\em pipeLine} parameter in the setup will
define how many frames are initially send.

The {\tt token()} method both on sender and receiver side is bound as a
callback to the type of I2O message as defined for this test.
The receiver will execute this method for each message arriving from
the sender. To realise the round trip, the {\tt token()} will just exchange
sender and receiver addresses in the message header, and post the same
message frame back to the original sender. Thus the same messages are
travelling to and fro between both {\tt token()} functions, once the
benchmark has been started.

Additionally, the original sender will use the XDAQ {\em toolbox::PerformanceMeter} object to evaluate time and bandwith for each transfer, respectively. 
The size of the transferred frames is increased systematically during the
benchmark, with ranges and intervalls as user defined 
in the XDAQ configuration file.
All measurements are recorded into a measurement history map and are
displayed on the MyRoundTrip web appplication ``home page''. These results can
be viewed and stored from any web browser via the {\em HyperDAQ} interface 
(see \ref{HyperDAQtest}). 


{\bf Relation between transfer time and bandwidth}
The bandwidth $B$ as calculated from xdaq ratemeter is calculated from the current package size $P$, and from the total package transfer time $\tau$ :
\begin{equation}
\label{eq-bpdef}
B(P) = {\genfrac{}{}{}{}{P} {\tau}}
\end{equation}

In the web display of the RoundTrip benchmark, the term ``latency'' is 
used for $\tau$. 
It should be pointed out that this value in fact means the complete
time between initiating the package send (using the {\em postFrame} method), 
and the arrival of the package in the bound i2o callback function 
in the receiver process.

When plotting $\tau$ as function of package size $P$, one observes that
the transfer time seems to be 
linearly growing with package size for larger packages:

 \begin{equation}
\label{eq-dTaudP}
{\genfrac{}{}{}{}{d\tau} {dP}} = c
\end{equation}
with $c$ being a constant in units $\mu$s/kB. 

This equivalents: 


\begin{equation}
\label{eq-taup}
\tau(P) = c \cdot P + \tau_{0} 
\end{equation}


Combining (eq. \ref{eq-bpdef}) and (eq. \ref{eq-taup}), 
one gets for the bandwidth versus package size:
\begin{equation}
\label{eq-bp}
B(P) = {\genfrac{}{}{}{}{1}{{\left(c+{\tau_{0}/P}\right)}}}
\end{equation}


The bandwidth vs. package size can then be parametrized by two constants: the minimum latency $\tau_{0}$, and the slope of the latency increase with package 
size $c$.
 
For big packages $P \to\infty$, the bandwidth converges against the 
inverse latency slope:
\begin{equation}
\label{eq-bp-infty}
B(P)_{P \to\infty} = {\genfrac{}{}{}{}{1} {c}} 
\end{equation}

Thus, the latency slope $c$  reflects directly the limits of the transport layer (network) and is not due to performance loss of the messaging framework.
This is actually observed in the measurements 
(sections  \ref{RoundTripTestTCP}, \ref{RDMA-XDAQ}, 
\ref{ptDAPL-RoundTrip}, \ref{ptDAPL-SendRec}, \ref{ptDAPL-SendMultRec}, 
and \ref{ptDAPL-MultSendRec}).


For small packages $ P\cdot c \ll \tau_{0}$, 
the slope of the bandwidth is approximately:

\begin{equation}
\label{eq-bp-zero}
{\genfrac{}{}{}{}{dB(P)}{ dP} }_{P \to\ 0}= {\genfrac{}{}{}{}{1}{\tau_{0}}} 
\end{equation}

thus only ruled by the minimum latency.


\subsection{Roundtrip benchmark for Peer Transport TCP}
\label{RoundTripTestTCP}
%Here native roundtrip benchmarks and results
\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{RoundtripTCP.png}
\caption{Bandwidth versus package size for peer transport tcp implementation,
compared between Ethernet and InfiniBand tcp stack.}
\label{fig:tcproundtripbw}
\end{figure}


At first, the roundtrip was performed with the plain tcp peer transport
between 2 nodes. It turned out that the decoupling between message and
protocol layer worked well for this example: the peer transport tcp
could switched on startup between Ethernet and InfiniBand networks just
by editing the configuration file. Moreover, peer transport tcp and atcp
could be exchanged as easily without changing the MyRoundTrip source codes.


As a typical result of these measurements, fig.\ref{fig:tcproundtripbw}
shows the bandwidth versus the frame size for tcp over ethernet, in
comparison with tcp over InfiniBand. The {pipeLine} length was set to 1 here,
i.~e. there was only one package being reflected between both applications.

As expected the InfiniBand shows 
a better performance. However, the theoretical limit of about $1$~Gbyte/s
was not achieved even for big packages.  
From a linear fit of (eq. \ref{eq-taup}) at the measured $\tau(P)$ data,
one gets figures of merit as presented in table \ref{tab-ether-ib-tcp}.


\begin{table}[htb]
\begin{center}

\begin{tabular}{|l|c|c|c|}\hline


Transport  & $\tau_{0} [\mu\mbox{s}]$ & $c [\mu\mbox{s}/\mbox{kByte}]$ & $B_{P\to\infty} [\mbox{MByte/s}]$ \\ \hline\hline

Ethernet & 67.7 & 8.52 & 117 \\ \hline
InfiniBand & 60.5 & 3.53 & 283 \\ \hline

\end{tabular}
\caption{Zero latency and maximum bandwidth as derived from fit 
(eq.\ref{eq-taup}) to $\tau(P)$ for {\em RoundTrip} measurements with
peer transport tcp.
\label{tab-ether-ib-tcp}}
\end{center}

\end{table}
 
Together with eq. \ref{eq-bp-infty}, the transport limit bandwidth for
the tcp stack of InfiniBand only achieves up to $0.3$~\mbox{Gbyte/s}.
Moreover, the minimum latency of $\approx 60~\mu\mbox{s}$ is also much worse than
the $\approx 5~\mu\mbox{s}$ as measured from the plain InfiniBand library transport
benchmarks.

Results of roundtrip measurements for the peer transport via uDAPL
imlementation are presented in section \ref{ptDAPL-RoundTrip}.  


\subsection{XDAQ Roundtrip for Remote DMA with uDAPL}
\label{RDMA-XDAQ}
%Short form: first attempt to combine both libs. Result is that they
%are compatible.
From the benchmark results (section \ref{RoundTripTestTCP}), 
the InfiniBand data transport within the XDAQ framework
requires a better peer transport layer than tcp/ip to reach
the network possibilities. The goal was to implement a XDAQ peer transport library
based on the uDAPL interface \ref{ptDAPL-Imp}. The C++ wrapper classes that
have been developed for the general InfiniBand tests
(section \ref{uDAPL-cpp}) should be applied for this purpose.


As a first test, the roundtrip benchmark as described in (section \ref{RoundTripBasics})
was modified in a way that the performance measurement and state machine 
controls were kept like before, but the data transport did {\em not} happen
by I2O messaging over the peer transport layer. Instead, the basic
uDAPL interface class (\ref{uDAPL-cpp}) was used directly.
It turned out that there was no conflict between XDAQ and uDAPL
libraries when compiled and executed togehter.


\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{RoundtripRDMA.png}
\caption{Bandwidth versus package size from roundtrip benchmark of
plain rDMA within the XDAQ framework}
\label{fig:rdmaroundtripbw}
\end{figure}

For the benchmark, 
the sender {\em ConfigureAction()} opens a remote DMA connection
to the receiver application. A thread in the sender 
application then transfers packages with increasing size and
measures bandwidth and transfer time.
From these measurements, one gets values as shown in 
Fig.~\ref{fig:rdmaroundtripbw}.
The corresponding parameters from the fit (eq. \ref{eq-taup})
are: 

\centerline{$\tau_{0}=25~\mu\mbox{s}$ and $B_{P\to\infty} = 955~\mbox{MByte/s}$.} 

Compared with table \ref{tab-ether-ib-tcp}, this is already
a big improvement. However, the performance of the plain uDAPL tests
(section \ref{uDAPL-testapp}) was not achieved, since
the set up for the uDAPL buffer queues was not optimized here.
Moreover, for a real data transfer application in XDAQ, the
I2O messaging should be used. Thus the development of a
peer transport uDAPL library was a strong requirement.


\subsection{PeerTransportDAPL implementation}
\label{ptDAPL-Imp}

%We did it! Explain how, only show last implementation, brief form.
%Maybe UML class diagram here?
Our implementation of a peer transport 
library for uDAPL was originally based on the architecture of
the existing library for tcp/ip {\em pt/tcp} \cite{XDAQ-wiki}. 
Most of the classes and their relationship were at first just adopted and
renamed from namespace ``tcp'' to namespace ``ptdapl''.
The usage of tcp/ip {\em socket} calls was replaced by the
corresponding functionality as provided by the uDAPL library.
Actually, our general uDAPL interface class {\em TBasic}
(\ref{uDAPL-cpp}) was embedded into the {\em ptdapl::Channel} class.


Figure (\ref{fig:ptdaplclasses}) shows the UML class diagram overview for 
the final implementation. 
In the following, we describe the ptdapl
classes in detail.


\begin{description}

\item[PeerTransportDAPL:] The owner of {\em PeerTransportSender} and 
{\em PeerTransportReceiver}. 
These are registered in the general {\em PeerTransportAgent} as available 
from pt framework. Offers web and cgi interface for interactive configuration. 
Some connection parameters for uDAPL may be 
set in the configuration file. These are passed via peer 
transport sender/receiver instances to the channels that use them.  
Method {\tt reset()} will apply changes to sender and receiver components. 

\item[ptdapl::Channel:] This is the baseclass of {\em ptdapl::Transmitter} and 
{\em ptdapl::ReceiverLoop}. Aggregates the {\em TBasic} class as interface to the 
uDAPL functionality. 

\end{description}
   
\begin{figure}[htb]
\includegraphics[angle=90,width=1.0\textwidth, height=1.0\textheight]
{ptdapl.png}
\caption{{\em PtDAPLClasses} Class diagram of the peer transport DAPL implementation for XDAQ}
\label{fig:ptdaplclasses}
\end{figure}


\begin{description}
\item[ptdapl::PeerTransportSender:] The sender of data.
\begin{compactitem}[$\bullet$]
      \item Method {\tt svc()} runs in a waiting workloop 
(as background thread). It waits for the output queue that contains 
the {\em ptdapl::PostDescriptor} objects. The memory frames as 
defined in this descriptor are transferred by method {\tt send()}  
of corresponding {\em ptdapl::Transmitter} 
(also referenced in post descriptor structure).
      \item Method {\tt post()} is used from the outside 
client to generate a new {\em PostDescriptor} in the output queue.
      \item {\tt getMessenger()} is a factory method for the appropriate 
messenger that shall manage the sending. 
This is finally used from the application context messenger cache. 
\end{compactitem}

 \item[ptdapl::Transmitter:] The sending client. Methods {\tt connect()}, 
{\tt send()}, implemented with \newline
{\tt TBasic::SendConnectionRequest()} 
and {\tt TBasic::PostSendBuffer()}, resp.


\item[ptdapl::I2OMessenger:] Manages the sending functionality.
The owner of the {\em ptdapl::Transmitter}.

\begin{compactitem}[$\bullet$]
\item is used by 
{\tt xdaq::ApplicationContextImpl::postFrame()}. 
This is the recommended user interface to send i2o frames. 
Application context instance has {\em messengerCache} 
which keeps the messengers by originator and destination tags 
({\em xdaq::ApplicationDescriptor}s of {\tt postFrame()} 
arguments). For each pair of originator and destination 
there is one messenger. If a messenger for a 
pair does not exist, the messenger cache will create it, 
using the build method {\tt getMessenger()} of {\em PeerTransportSender}.

\item method {\tt send()} internally calls 
the {\tt post()} of the corresponding {\em PeerTransportSender} 
that created it.

\item method {\tt createAddress()} generates address references 
for sender and receiver nodes from url/service.


\end{compactitem}

\item[ptdapl::Address:] Keeps connection address parameters. 
Added method {\tt getPortNum()} to get number of connection port 
by value. This parameter is used here as port number 
for the uDAPL connection.

\end{description}

%\clearpage

\begin{description}

\item[ptdapl::PeerTransportReceiver:] The receiver of data on each node. 
     Has vector of {\em ptdapl::ReceiverLoop} objects; these are activated 
in method {\tt start()} and canceled in {\tt stop()}. Method 
{\tt config()} generates new {\em ReceiverLoop} object in this vector for 
each receiver node address (network endpoint). 
Additionally, baseclass {\em pt:PeerTransportReceiver} has
a map of {\em i2o:Listener} objects by name.
The listener of name "i2o" is used for all {\em ptdapl::ReceiverLoop} 
objects which get its reference during {\tt config()}.


 \item[ptdapl::ReceiverLoop:] The receiving server. Has two working threads:
\begin{compactitem}[$\bullet$]
\item Thread {\tt connectloop()} establishes the connection:
it creates an uDAPL endpoint whenever requested from a new sender client. 
The {\em ReceiverLoop} instance may have 
many endpoints, each connected to another remote {\em Transmitter}. 

\item Thread {\tt process()} does the receiving.  
It waits for an uDAPL receive event from any existing endpoint.
On receive, the incoming endpoint receive buffer is  
passed to the {\em i2o::Listener} which forwards the message to
the destination callback function.
\end{compactitem}

\item [i2o::Listener:] Handles incoming i2o messages (frames). This
is a general messaging framework class, not part of the {\em ptdapl} 
library. 
\begin{compactitem}[$\bullet$]
\item The implementation of listener is done in subclass 
{\em i2o::utils::Dispatcher}. This dispatcher is created and 
added to the peer transport agent in the constructor of the 
XDAQ {\em Executive} class. There are also other dispatchers for {\em xgi} and {\em SOAP}.

\item   Method {\tt processIncomingMessage(msg)} does the actual work: 
the I2O message contains target and function ids. The dispatcher will find 
the {\em i2o::MethodSignature} reference by these ids from the application. 
With  {\tt i2o::MethodSignature::invoke(msg)} the I2O message {\tt msg} 
is passed over to the callback method, as registered in the
user code by {\tt i2o::bind()}.  
\end{compactitem}


\end{description}

\clearpage

{\bf During several development cycles with continuous 
improvements, additional classes and functionality dedicated
to uDAPL transport were introduced: }

\begin{description}

\item [ptdapl::EndpointBuffers:] New class that manages
the uDAPL send and receive buffers of a connection
embedded into regular XDAQ memory frames.

\begin{compactitem}[$\bullet$]

\item Is owned by a {\em Channel} both in sender and receiver
implementation:
The {\em Transmitter} has one {\em EndpointBuffers} pool, whereas
a {\em ReceiverLoop} holds one {\em EndpointBuffers} object for each
incoming endpoint.

\item Pool for XDAQ memory
frames that internally wrap already assigned uDAPL send and receive
buffers, to minimize copying between XDAQ and uDAPL memory. 
It holds parallel vectors of uDAPL buffers, 
XDAQ frames, and their {\em toolbox::mem::Reference}s.  


\item Subclass of the XDAQ memory pool 
{\em toolbox::mem::Pool}. Overwrites standard 
methods {\tt alloc}, {\tt release}. When application
does a {\tt release()} on a XDAQ memory frame from this pool
(after processing the received package),
it will not free the allocated memory, but will push the
buffer index into a queue of free buffers, and refresh the
XDAQ reference. Similarly, after having completed the send of a frame,
it is just marked as ``free'' by queueing its index.
So in both use cases, the buffer is immediately available again
without performance loss due to 
re-allocation and registration as uDAPL endpoint buffer.


\item  Method {\tt findBuffer(toolbox::mem:Reference*)}:   
checks if any memory reference belongs to an already created 
uDAPL input or output buffer. In this case, the buffer can be send
without first copying the contents from XDAQ frame to uDAPL buffer.
The owner class {\em Channel} also offers 
a virtual {\tt findBuffer()}, which is implemented for {\em Transmitter} 
and {\em ReceiverLoop}, scanning all contained {\em EndpointBuffers}.
From outside, the {\tt findBuffer()} method is
accessible in the {\em PeerTransportReceiver} and
{\em PeerTransportDAPL} aggregations, too.


\item Method {\tt addBuffer(toolbox::mem:Reference*)}: adds
a new uDAPL endpoint buffer that uses an existing XDAQ memory frame.
Especially useful for
the {\em RoundTrip} application, when a buffer from the receiver endpoint 
can also be assigned to the sender endpoint.  
On the next roundtrip cycle, this reference will be found as
existing send buffer and transferred 
without any data copying.

\end{compactitem}
\medskip

\item [ptdapl::Transmitter:] direct buffer sending and 
asynchronous releasing:
\begin{compactitem}[$\bullet$]

\item new  method {\tt send(TMemorySpace * dbuf, int len)}.
Will post existing uDAPL buffer directly; old method 
{\tt send(void* buf, int len)} will still perform {\tt memcpy} into a 
buffer of  internal {\em EndpointBuffers} pool before sending.
In the final implementation, both methods will return before 
send completion was acknowledged from uDAPL. After the send call, the
{\em PeerTransportSender} will not {\tt release()} the frame itself, 
but this is done asynchronously by a second thread. 

\item new thread (workloop) {\tt sendCheckLoop()}. 
Waits for any send complete event from the uDAPL interface. Required
to add appropriate implementation in 
basic uDAPL interface \newline {\tt TBasic::WaitSendEVD()}. 
On send complete event, the thread will get the buffer reference from the
posted uDAPL cookie and will  {\em release()} this frame.
{\em Transmitter} had to be changed to inherit from
{\em toolbox::lang::Class} in order to run a method as XDAQ workloop. 

\end{compactitem}


\item [PeerTransportDAPL:]
\label{ptdaplgetsendframe}
additional method: 
{\tt getSendFrame(context, from, to)}: 
Returns memory reference of next free send buffer by descriptors of sender 
("from") and receiver ("to") application. Syntax is similar to the 
{\tt ApplicationContext::postFrame(...)} as called by the 
user code for sending. Thus, the user code may directly get a
reference to an uDAPL send buffer from the peer transport implementation.
This is useful to avoid another copying of data, since the user may
directly modify the assigned uDAPL memory contents before posting it.

However, to use this feature, downcasting of the application pointer 
to {\em PeerTransportDAPL} is necessary in user code. 
This will break the XDAQ philosopy of modularity 
and the decoupling between application and peer transport though. 
Actually, the {\em MyRoundTrip} application 
(section \ref{ptDAPL-RoundTrip}) had to be
modified in a way that the {\em PeerTransportDAPL} instance was
retrieved on configuration time via its id number  
from the XDAQ application registry .
Therefore, the XML configuration 
parameter {\em peerTransportID} was introduced to the 
{\em MyRoundTrip} set up.


Finally, the 
configuration parameters for the {\em PeerTransportDAPL}
allow to
specify the number of send and receive buffers in the pool, and
to switch on/off the {\tt memcpy} functionality for sending
and receiving. 

\end{description}


\subsection{RoundTrip with PeerTransportDAPL}
\label{ptDAPL-RoundTrip}
%Roundtrip, etc. main-test.pdf
\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{RoundtripPTDAPL.png}
\caption{Bandwidth versus package size from roundtrip benchmark of
I2O over {\em PeerTransportDAPL} (var.~9)}
\label{fig:ptdaplroundtripbw}
\end{figure}


In the course of the development of the {\em PeerTransportDAPL},
the {\em MyRoundTrip} benchmark
(section \ref{RoundTripBasics}) was applied to check the
performance and figure out implementation disadvantages.
The roundtrip application was mostly unchanged to the tcp/ip
peer transport measurements (section \ref{RoundTripTestTCP}).
For the final implementations, however, accessability to the
{\em PeerTransportDAPL} was introduced to {\em MyRoundTrip}, 
allowing direct writing of the user code 
into the preassigned uDAPL send buffers 
(section \ref{ptdaplgetsendframe}).

The following are the results of this best optimized
{\em PeerTransportDAPL} implementation (variant 9).
Figure \ref{fig:ptdaplroundtripbw} shows bandwidth versus package size
for the roundtrip of 5 initial packages and 20 receive buffers. The
maximum number of uDAPL buffers was set to 32 here.


 Figure 
\ref{fig:ptdaplroundtriptau} plots the corresponding transport times 
$\tau(P)$.
From the fit (eq. \ref{eq-taup}) to this curve,
we get the parameters of merit:

\centerline{$\tau_{0}=5.5~\mu\mbox{s}$  and $B_{P\to\infty} = 935~\mbox{MByte/s}$.} 

This is obiously better than the peer transport via tcp, and
reaches almost the speed of the plain uDAPL tests.

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{RoundtripPTDAPL-tau.png}
\caption{Transport time versus package size from roundtrip benchmark of
I2O over {\em PeerTransportDAPL} (var.~9). 
Red line shows fit of (eq. \ref{eq-taup}) at linear part of the curve.}
\label{fig:ptdaplroundtriptau}
\end{figure}


However, the {\em real} minimum latency $\tau_{min}\simeq 20~\mu\mbox{s}$ 
is larger than the fitted characteristic $\tau_{0}\simeq 5~\mu\mbox{s}$.
For small packages, the transport is just limited 
by the XDAQ framework latency that is invariant with package size. 
The fitted equation (\ref{eq-taup}) is valid 
for big packages only, where the uDAPL maximum speed rules the
transfer time. Since this peer transport implementation uses two
threads for asynchronous posting and releasing of the send buffers,
the overall transport time shows the maximum of both 
effects only: If uDAPL transfer is faster (small packages), 
the latency to invoke the {\em postFrame} and to 
get the next free buffer from the pool 
is the limit. Otherwise (big packages), the XDAQ buffer management 
is always ready before the network data transfer has completed.

In contrast to this, a previous implementation of {\em PeerTransportDAPL}
(variant 8) with only one sender thread showed a curve $\tau(P)$ that
was the {\em sum} of both effects over the complete range.
This was due to the fact that the sender thread had to 
wait for the uDAPL transfer complete event of each frame before
it could process the next send frame. Thus, the linear part of the 
curve was shifted upwards by the constant offset $\tau_{min}\simeq 20~\mu\mbox{s}$.


\clearpage
\subsection{Sender-Receiver with PeerTransportDAPL}
\label{ptDAPL-SendRec}
%Roundtrip, etc. main-test.pdf
Although the RoundTrip is useful for benchmarking, 
it is different from the future
applications for the DAQ system. Here the data is send by one
application and received by another application. For such a set up,
the shared buffering for sender and receiver endpoints, as implemented
for the  {\em PeerTransportDAPL}, will not
yield any performance gain. Instead, the management of the buffer pool 
becomes crucial for the latency.


To measure such a situation, we implemented a pure
sender application as class {\em MyDataSource}, and a mere
receiver as {\em MyDataDrain}. Both have initially been developed as
modifications of the {\em MyRoundTrip} class: 

\begin{description}
\item [MyDataSource:]  Once enabled by the controlling state machine,
the {\em Benchmark()} workloop (thread) 
posts frames to the receiver {\em MyDataDrain}. 
The sending pipeline loop of {\em MyRoundTrip} is still used.
Integer numbers are written to the I2O data field,
corresponding to the pipeline position. 
The performancemeter is called at the beginning of each pipeline circle. 
When the performancemeter has reached its number of samples, 
the package size is increased (as in {\tt MyRoundTrip::token()}). 

The new send frames are directly requested from peer transport DAPL instance
(section \ref{ptdaplgetsendframe}). If the peer transport instance 
is not found (e.g. because of a wrong application id 
number in the setup file),  {\em MyDataSource} will allocate send 
frames from the standard 
XDAQ memory pool\footnote{ 
In this case, a problematic polling behaviour was observed, 
because the standard  memory pool soon runs out of preallocated 
buffers.  Then the {\tt Benchmark()} function has to react on the 
exception  thrown and retries to allocate a new frame. 
This will drop the performance by a factor of 2.}.
 

\item [MyDataDrain:] Reacts on the received  I2O packages in 
{\tt token()} callback only.  
Here, with knowledge of the sender pipeline length (by configuration file), 
it invokes the performancemeter for each leading packet of the pipe cycle. 
The contents of each I2O frame (pipe entry number) 
are read and may be checked for data integrity.

\end{description} 
 

As in the RoundTrip, both classes provide 
a web interface to display the benchmark results.

The independent performancemeter measurements on sender and receiver
applications yielded the same results. This should be expected for
``conservation of data current'' reasons. Moreover, also for several
senders and receivers (``one to all'' , see \ref{ptDAPL-SendMultRec}; 
``all to one'', see \ref{ptDAPL-MultSendRec}), the sum of
sender and receiver latencies was matching.
Therefore, for the ``one to one'' set-up, 
we just evaluate the data source performance.


%\clearpage

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.82\textwidth]
{var9_q100_rbufs_bw.png}
\caption{Bandwidth versus package size from 1 to 1 data transfer using
{\em PeerTransportDAPL} (var.~9). $N_{send}=50$, $N_{q}=100$, }
\label{fig:ptdaplsourcedrainbw}
\end{figure}

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.82\textwidth]
{var9_q100_rbufs_lat.png}
\caption{Transfer time versus package size from 1 to 1 data transfer using
{\em PeerTransportDAPL} (var.~9). $N_{send}=50$, $N_{q}=100$, }
\label{fig:ptdaplsourcedraintau}
\end{figure}

%\clearpage


Several measurements were done under variaton of the send
and receive buffer numbers.

Besides the number of buffers in the endpoint pools
$N_{send}$ and  $N_{rcv}$,
the predefined maximum uDAPL endpoint buffer queue length
$N_{q}$ is crucial. It is not possible to post more buffers 
than $N_{q}$ to an endpoint simultaneously, thus limiting the effective
usable buffer number of the {\em EndpointBuffers} pool.
By default, half of
the existing {\em EndpointBuffers} frames are assigned
to each endpoint in advance, up to $N_{q}$.

$N_{q}$ is set up on compile time  of the uDAPL wrapper library.
While we used $N_{q}=32$ for the initial roundtrip benchmarks, we
increased it to $N_{q}=100$ for the sender-receiver measurements.

It turned out that the performance improved with increasing
number of receive buffers, up to $N_{rcv}\simeq 500$. Then, it
drops again, probably due to the overhead of searching the
next free buffer in the pool (values for $N_{q}=100$).
Similar was found concerning the send buffers.

Figure (\ref{fig:ptdaplsourcedrainbw}) shows the bandwidth versus
the frame size for these measurements; figure 
(\ref{fig:ptdaplsourcedraintau}) the corresponding transport time.


Observations:
\begin{compactitem}

\item As for the Roundtrip, the $B(P)$ characteristics has 2
different domains:
\begin{enumerate}
\item For smaller packages ($< 10$~kByte), the bandwidth increases 
approximately linear with package size, i.e. the transport time per package 
is constant. Here the transfer is limited by the xdaq buffer management, 
because the uDAPL transfer is faster than the 
finding/releasing of the buffers. Obviously, in this domain
some points show big fluctuations from the ``ideal curve''.

\item For bigger packages ($>10$~kByte), the bandwidth nearly 
saturates very fast to the theoretical transfer limit. 
Here the transport is limited by the uDAPL interface, 
Even here we see some points with big fluctuations, 
but the overall line is more straight than in domain 1 

\end{enumerate}

\item For few receive buffers (less than uDAPL queue length), 
the linear increase region of $\tau$ starts later; 
the "constant $\tau$  region" has more fluctuations. 
Here the uDAPL receive queue is not fully used, 
since only half of XDAQ buffers are posted in advance; 
full uDAPL queing only for  $N_{rcv}> 200$ .

\item For $N_{rcv} \geq 1000$, there is a almost a ``state transition''
step between the constant latency region and the linear latency 
region. At $P\simeq 14$~kByte, the transfer time drops almost
about $20~\mu$s. This corresponds with a very steep edge 
in the bandwidth characteristics. As an explanation,
one can think of an extra delay when suddenly one of the
involved threads has to wait for a synchronizing signal.
These can be the sender and cleanup threads (sender application),
and here most likely the receiverloop thread and the I2O
callback thread (receiver application).

For big packages, the receiverloop thread (or sender, respectively) 
never has to wait for the queue of next free buffers, 
because there is always a
free buffer ready in the queue before uDAPL transfer has finished.

At a certain package size (minimum of linear $\tau$ region), both
threads are ``synchronized'' - for uDAPL it takes the same time to receive
(send, resp.) a package as for XDAQ to release another frame 
and to handle the memory pool management. 
If the package size decreases just a little bit, uDAPL transfer is completed
before the queue of free
buffers has a new buffer ready. Then, the receiver (sender, resp.) thread
will go into a thread condition {\em wait()} when trying to get the next
queue entry. It will continue no sooner than the other thread pushes
another frame into the queue, thus waking up the  receiver (sender,resp.)
by a condition signal.
The observed latency difference at the edge might correspond to the time that 
is required for the queue to schedule this signal, 
in comparison to a plain {\tt pop()} of an existing queue element.

 
\item For big XDAQ receive buffer pool, the constant latency is large 
( $\tau_{min} > 32\mu\mbox{s}$). This can be explained by an increasing 
average search time for the next free buffer in the xdaq mempool.

\item The fits to the latency curve (eq. \ref{eq-taup}) gives the
parameter 
$ \tau_{0} < 0.1~\mu\mbox{s}$, 
i.e. the linear fit almost crosses origin. 
The slope yields $c\approx 1.07~\mu\mbox{s/kByte} $, thus a bandwidth 
limit of $B_{max}\approx 935~\mbox{MByte/s} $ is achieved.

\end{compactitem}


\subsection{One sender for multipe receivers with PeerTransportDAPL}
\label{ptDAPL-SendMultRec}

A DAQ application will most possible not transfer the data 
in between just 2 nodes, but each sender node has to post
parts of the event to many different builder nodes.

\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{var9_1to3_bw.png}
\caption{Bandwidth versus package size from 1 to 3 data transfer using
{\em PeerTransportDAPL} (var.~9). $N_{send}=50$, $N_{rcv}=300$, $N_{q}=100$ }
\label{fig:ptdapl123bw}
\end{figure}

To test such a situation, the test set up with slightly modified
sender application was as follows:

\begin{compactitem}
\item {\em MyDataSource} application will check from {\em destinationID}  
number where to send the data. 
In {\tt ConfigureAction()}, the application descriptors with 
matching {\em tid}s will be found by iteration over the {\em ContextTable}; 
if the target application exists as described in configuration file, 
it will be put into {\em std::vector} of destinations. 

\item The {\tt Benchmark()} sender thread will distribute 
data in ``round robin'' fashion to all known destinations. 
Each receiver application will get a complete output pipeline 
before the next receiver is served. 

\item If the {\em tid} is unknown in the context, an exception will 
indicate a warning 
(e.g. if any of the known nodes has only a sender, but no receiver). 

\item Performance measurements are done independently in sender and all receivers. 

\end{compactitem}


A typical result is plotted in figure \ref{fig:ptdapl123bw}.
Observations:
 \begin{compactitem}
\item The sender reaches the same performance as in the one-to-one setup 
before (latency and bandwidth limit)
 
\item Each receiver gets one third of sender bandwidth, as expected.
 
\item The values of all 3 receivers match very exactly (bandwidth, latency). 
All curves show the same structures/deviatons from the ``expected line''. 
Maybe this stems from overall fluctuations of the network or the switch, 
since it  seems not to be related to the individual load of the
receiver machines 
 \end{compactitem}

\clearpage
\subsection{Multiple senders for one receivers with PeerTransportDAPL}
\label{ptDAPL-MultSendRec}

Vice versa, 
performance tests were done with several sender applications to address
one receiver. Here it shows up if the {\em ReceiverLoop} organization
of the peer transport works well for more than one incoming endpoint.
The application set up was the same as described in section (\ref{ptDAPL-SendMultRec})


\begin{figure}[htb]
\centering\includegraphics[angle=0,width=.8\textwidth]
{var9_3to1_bw.png}
\caption{Bandwidth versus package size from 3 to 1 data transfer using
{\em PeerTransportDAPL} (var.~9).  $N_{send}=50$, $N_{rcv}=300$, $N_{q}=100$}
\label{fig:ptdapl321bw}
\end{figure}

Figure \ref{fig:ptdapl321bw} illustrates some results.
Especially for small package sizes, the multiple senders  
show more differences in bandwidth than in the opposite case 
(\ref{ptDAPL-SendMultRec}) the multiple receivers.
Still the sum of all sender bandwidths yields the receiver 
bandwidth very exactly. Above a threshold of $P\simeq 8$~kB,
each senders contributes about a third of the receiver bandwidth.