Loading HMP2 EC Dataset & Bayesian Statistics
We analyze Enzyme Commission (EC) transcription profiles to quantify functional activity in the gut microbiome. This work is currently piloting on the HMP2 dataset and being extended to UniRef90 gene families.
Count data in microbiome studies exhibits Overdispersion (Variance > Mean), which Poisson models fail to capture. We use Negative Binomial (NB) to explicitly model this biological variability.
Standard regression treats DNA ($Y$) as a fixed truth. However, at low abundance, we observe a paradox: RNA reads ($Z$) exist even when DNA is undetected ($Y=0$). We solve this using a Joint Generative Model:
Standard models fail at low abundance. The "Sparsity Law" shows that Bias ($|\beta_{Naive} - \beta_{Bayes}|$) explodes as abundance decreases.
Bias is driven by Total Dispersion. This metric sums Technical Noise (NB $\theta$) and Biological Heterogeneity (Random Effects $\sigma_u$), acting as a "Total Uncertainty Budget".