Unmasking the Noise: Bayesian Measurement Error Modeling

1. Study Context: HMP2 Multi-Omics

We analyze Enzyme Commission (EC) transcription profiles to quantify functional activity in the gut microbiome. This work is currently piloting on the HMP2 dataset and being extended to UniRef90 gene families.

Dataset HMP2 IBDMDB (Supplementary Table 9)

Cohort N = 785 samples from 109 subjects

Reference:
Lloyd-Price, J., Arze, C., Ananthakrishnan, A.N. et al.
Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases.
Nature 569, 655–662 (2019). 10.1038/s41586-019-1237-9

2. The Methodology: Why Bayesian?

Why Negative Binomial?

Count data in microbiome studies exhibits Overdispersion (Variance > Mean), which Poisson models fail to capture. We use Negative Binomial (NB) to explicitly model this biological variability.

Why Bayesian Joint Modeling?

Standard regression treats DNA ($Y$) as a fixed truth. However, at low abundance, we observe a paradox: RNA reads ($Z$) exist even when DNA is undetected ($Y=0$). We solve this using a Joint Generative Model:

Latent Variable ($u_i$): Represents the true biological abundance, shared by both DNA and RNA.
Identifiability (Lumping): Biological variation in transcription rates ($r_i$) is indistinguishable from technical noise without extra data. We "lump" these into the NB dispersion parameter rather than making unverifiable assumptions.

Bayesian Hierarchical Structure $$ \begin{aligned} Y_i &\sim \text{NegBin}(\mu_i^Y, \theta_Y) \\ Z_i &\sim \text{NegBin}(\mu_i^Z, \theta_Z) \end{aligned} $$ $$ \log(\mu_i^Y) = \beta_{0Y} + X_i\beta_Y + u_i $$ $$ \log(\mu_i^Z) = \beta_{0Z} + X_i\beta_Z + \log(\mu_i^Y) $$ *We use HMC (STAN) instead of VI for exact posterior inference on these sparse, heavy-tailed distributions.

3. Bias Analysis: The Laws of Noise

Law I: The Sparsity Paradox

Standard models fail at low abundance. The "Sparsity Law" shows that Bias ($|\beta_{Naive} - \beta_{Bayes}|$) explodes as abundance decreases.

Law II: The Dispersion Law

Bias is driven by Total Dispersion. This metric sums Technical Noise (NB $\theta$) and Biological Heterogeneity (Random Effects $\sigma_u$), acting as a "Total Uncertainty Budget".

Duke CBB: MTX Bayesian Analysis

1. Study Context: HMP2 Multi-Omics

2. The Methodology: Why Bayesian?

Why Negative Binomial?

Why Bayesian Joint Modeling?

3. Bias Analysis: The Laws of Noise

Law I: The Sparsity Paradox

Law II: The Dispersion Law

4. The Solution: Adjusting Estimates

M1 (Bayesian) vs M3 (Naive)

5. Final Inference Impact

Standard Approach (M3)

Corrected Inference (M1)

Selected Gene Details

Loading V4 Dashboard...

1. Study Context: HMP2 Multi-Omics

2. The Methodology: Why Bayesian?

Why Negative Binomial?

Why Bayesian Joint Modeling?

3. Bias Analysis: The Laws of Noise

Law I: The Sparsity Paradox

Law II: The Dispersion Law

4. The Solution: Adjusting Estimates

M1 (Bayesian) vs M3 (Naive)

5. Final Inference Impact

Standard Approach (M3)

Corrected Inference (M1)

Selected Gene Details