# Parallel Processing

Note

Julia allows high-performance, parallel processing from the ground up. Depending on the configuration, Caesar.jl can utilize a combination of four styles of multiprocessing: i) separate memory multi-process; ii) shared memory multi-threading; iii) asynchronous shared-memory (forced-atomic) co-routines; and iv) multi-architecture such as JuliaGPU. As of Julia 1.4, the most reliable method of loading all code into all contexts (for multi-processor speedup) is as follows.

## Multiprocessing

Make sure the environment variable JULIA_NUM_THREADS is set as default or per call and recommended to use 4 as starting point.

JULIA_NUM_THREADS=4 julia -O3

In addition to multithreading, Caesar.jl utilizes multiprocessing to distribute computation during the inference steps. Following standard Julia, more processes can be added as follows:

# load the required packages into procid()==1
using Flux, RoME, Caesar, RoMEPlotting

# then start more processes
using Distributed

# now make sure all code is loaded everywhere (for separate memory cases)
@everywhere using Flux, RoME, Caesar

It might also be convenient to warm up some of the Just-In-Time compiling:

# solve a few graphs etc, to get majority of solve code compiled before running a robot.
[warmUpSolverJIT() for i in 1:3];

## Start-up Time

The best way to avoid compile time (when not developing) is to use the established Julia "first time to plot" approach based on PackageCompiler.jl, and more details are provided at Ahead of Time compiling.

Julia has strong support for shared-memory multithreading. The most sensible breakdown into threaded work is either within each factor calculation or across individual samples of a factor calculation. Either of these cases require some special considerations.

A factor residual function itself can be broken down further into threaded operations. For example, see many of the features available at JuliaSIMD/LoopVectorization.jl. It is recommended to keep memory allocations down to zero, since the solver code will call on the factor samping and residual funtions mulitple times in random access. Also keep in mind the interaction between conventional thread pool balancing and the newer PARTR cache senstive automated thread scheduling.

IncrementalInference.jl internally has the capability to span threads across samples in parallel computations during convolution operations. Keep in mind which parts of residual factor computation is shared memory. Likely the best course of action is for the factor definition to pre-allocate Threads.nthreads() many memory blocks for factor in-place operations.

To use this feature, IIF must be told that there are no data race concerns with a factor. The current API uses a keyword argument on addFactor!:

addFactor!(fg, [:x0; :x1], MyFactor(...); threadmodel=MultiThreaded)
Warning

The current IIF factor multithreading interface is likely to be reworked/improved in the near future (penciled in for 1H2022).

See page Custom Factors for details on how factor computations are represented in code. Regarding threading, consider for example OtherFactor.userdata. The residual calculations from different threads might create a data race on userdata for some volatile internal computation. In that case it is recommended the to instead use Threads.nthreads() and Threads.threadid() to make sure the shared-memory issues are avoided:

struct MyThreadSafeFactor{T <: SamplableBelief} <: IIF.AbstractManifoldMinimize
Z::T
inplace::Vector{MyInplaceMem}
end

# helper function
# in residual function just use thr_inplace = cfo.factor.inplace[Threads.threadid()]