# Circle Hough CUDA Core Code
This repository holds the code for the GPU version of the Circle Hough algorithm. Just the GPU core; the rest is at the internal IKP repository.

## Source

Currently at this BitBucket repo: <https://bitbucket.org/fndari/ch-cuda-standalone>

## Compilation

After downloading/cloning the source, follow the usual step for compilation with CMake:

```bash
mkdir -p build && cd build
cmake ../
```
The code is split up into the circle hough core (`CircleHough.cu`) and the standalone interface (`chStandalone.cu`). CMake takes care for you to compile the correct version with the adequate compiler variable definitions, but note that this project is also intended to be run from within `PandaRoot`.

## Input data

By default, the file containing hit data is in the `data/` directory. An example, small file is provided. A much larger one can be found at `/home/ikp1/bianchi/ikpgpu/data-ch/1e3-evt-10x.txt`.

The following parts of this readme seem to be **outdated**.

## Workflow

The `scripts/` folder contains the prototypes for the `run.sh` run script. 
Each runfile is associated with a `runid` (run01, run02, ... ) corresponding to a different configuration (GPU card, type of test, ... ). The `CMakeLists.txt` file contains the instructions for CMake to copy the particular runfile to `run.sh` in the build folder.  

Usually this means proceeding in this way:

1. Copy/create runfile in the `scripts/` folder
2. Edit `CMakeLists.txt` to select the correct file to copy
3. From the `build` directory, run `make && run.sh` to create `run.sh`, compile the CUDA file (if needed) and run the profiling.
4. CSV files are written by default to `${CMAKE_SOURCE_DIR}/csv/<run-id>/`

## Example runfiles

- `run03` loops over the number of hits read from the input, and prints the GPU trace (relevvant `nvprof` flag: `--print-gpu-trace`). This creates a timeline of the GPU activity, and is the basis for the analysis of the timing of the various portions of the code.
- `run08` collects all available kernel metrics (relevant `nvprof` flag: `--metrics all`). I've found out that this is the most convenient method, i.e. collecting all possible metrics to CSV files and then selecting the interesting ones later during the analysis phase.


## Extending to other code

Modifications to the CUDA source are required to run a similar system with other applications.

- To select the portion of the code to be profiled, and get rid of the CUDA initialization timing overhead, in association with the `--profile-from-start off` flag:

```c++
    cudaFree(0); // First call to the CUDA API, initializes drivers, create contexts, ...
    cudaProfilerStart(); // exclude init overhead from the performance metrics
    // Portion of the program to be timed
    cudaProfilerStop();
```

Other useful flags:

- To use the same time unit consistently: `--normalized-time-unit us` flag (the same thing is unfortunately not available for the size units, which requires a workaround during the analysis phase).
- To avoid the "API call" `nvprof` mode: (messier to analyze): `--profile-api-trace none`

## Notes
- Check if `nvcc` flags in `CMakeLists.txt` are compatible with a particular card (i.e. supported architecture)
- At the moment, settable parameters (number of angles, number of hits, ... ) are set with environment variables, so the program must be run from `run.sh`. Otherwise, there will be a segfault. I intend to insert additional logic with default values and/or set the parameters via command line flags to circumvent this.
- At the moment, the `remove-nvprof-header()` function, using `sed` to remove lines starting with `===` at the beginning of the CSV file, is essential to read the files with `pynvprof`. In the future, I'll implement a more robust way of doing so.