Loading V4 Dashboard...

Loading HMP2 EC Dataset & Bayesian Statistics

[System] Diagnostics initialized...

Duke CBB: MTX Bayesian Analysis

Interactive Measurement Error Modeling Dashboard (V4)
Xuexin Li
Candidate

1. Study Context: HMP2 Multi-Omics

We analyze Enzyme Commission (EC) transcription profiles to quantify functional activity in the gut microbiome. This work is currently piloting on the HMP2 dataset and being extended to UniRef90 gene families.

Dataset HMP2 IBDMDB (Supplementary Table 9)
Cohort N = 785 samples from 109 subjects
Reference:
Lloyd-Price, J., Arze, C., Ananthakrishnan, A.N. et al.
Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases.
Nature 569, 655–662 (2019). 10.1038/s41586-019-1237-9

2. The Methodology: Why Bayesian?

Why Negative Binomial?

Count data in microbiome studies exhibits Overdispersion (Variance > Mean), which Poisson models fail to capture. We use Negative Binomial (NB) to explicitly model this biological variability.

Why Bayesian Joint Modeling?

Standard regression treats DNA ($Y$) as a fixed truth. However, at low abundance, we observe a paradox: RNA reads ($Z$) exist even when DNA is undetected ($Y=0$). We solve this using a Joint Generative Model:

  • Latent Variable ($u_i$): Represents the true biological abundance, shared by both DNA and RNA.
  • Identifiability (Lumping): Biological variation in transcription rates ($r_i$) is indistinguishable from technical noise without extra data. We "lump" these into the NB dispersion parameter rather than making unverifiable assumptions.
Bayesian Hierarchical Structure
$$ \begin{aligned} Y_i &\sim \text{NegBin}(\mu_i^Y, \theta_Y) \\ Z_i &\sim \text{NegBin}(\mu_i^Z, \theta_Z) \end{aligned} $$
$$ \log(\mu_i^Y) = \beta_{0Y} + X_i\beta_Y + u_i $$ $$ \log(\mu_i^Z) = \beta_{0Z} + X_i\beta_Z + \log(\mu_i^Y) $$
*We use HMC (STAN) instead of VI for exact posterior inference on these sparse, heavy-tailed distributions.

3. Bias Analysis: The Laws of Noise

Law I: The Sparsity Paradox

Standard models fail at low abundance. The "Sparsity Law" shows that Bias ($|\beta_{Naive} - \beta_{Bayes}|$) explodes as abundance decreases.

Law II: The Dispersion Law

Bias is driven by Total Dispersion. This metric sums Technical Noise (NB $\theta$) and Biological Heterogeneity (Random Effects $\sigma_u$), acting as a "Total Uncertainty Budget".

4. The Solution: Adjusting Estimates

5. Final Inference Impact

💡 Click points to filter table

Standard Approach (M3)

Corrected Inference (M1)

Selected Gene Details