GPU-based inference in Bayesian networks requires performing a tensor product of several functions and summation (contraction) over shared dimensions.
The challenge is to recognize the seemingly irregular memory access pattern at runtime and prefetch the right data based on the pre-processing.
The paper on this was presented at ICS08 in Greece.
Here is the talk I gave on this subject @ MS Research, Redmond
The following archive contains the updated version of the Sum-product kernel implemented using NVIDIA CUDA and OpenMP.
Note that the code is no longer maintained.
Download SumProduct kernel for GPUs.