Overview
The motivation for Bayesian Flow Networks is that autoregressive models are inefficient, because generating a sequence of T elements requires T inference steps. In a diffusion model, the sequence can be generated using D denoising steps, and D << T.
Unfortunately, diffusion models do not work well for discrete data. So, instead of denoising a data point, model each element of the output sequence as being generated by a probability distribution and then iteratively refine these distributions. The parameters of these distributions can be continuous.
Initially, each element's probability distribution is modelled as the maximum entropy distribution (the uniform distribution for discrete data and the normal distribution for continuous data). At each iteration, the network updates the parameters for each element's distribution, to bring it closer to the joint probability distribution of the training set.
If each element were independent, training this model would be easy: sample a sequence x from the training set and compute the distribution parameters that might have been used to generate each element of x. Bayesian inference provides closed-form solutions to the update equations required to bring the model's parameters closer to representing the training data.
Unfortunately, the elements are not independent, and so the relationship between them must also be modelled. This is done by training a neural network which, at each iteration in the generative process, modifies the parameters of each element's probability distribution to account for the other elements.