Help & Documentation
Complete reference for the Microbiome Analysis Platform — pipeline, statistics, outputs, and scientific reporting.
1.Overview
The Microbiome Analysis Platform is a full-stack web application purpose-built for researchers who study the relationship between the gut (or other) microbiome and time-to-event clinical outcomes such as progression-free survival (PFS), overall survival, or other endpoints. It automates a reproducible, configurable multi-step analysis pipeline and produces publication-quality outputs in standard open formats.
Data ingestion
Upload patient metadata and taxonomy abundance files (CSV, Excel). Multiple files per dataset, versioned and traceable.
Automated pipeline
Cohort selection, extreme value handling, attribute and microbial filtering, stratification, clustering, dimension reduction, and feature scaling — all configurable.
Scientific outputs
Coefficient tables (CSV & JSON), forest plots, volcano plots, Kaplan-Meier curves, box plots, correlation heatmaps, and pipeline summaries — ready for papers.
Scientific context
Linking microbiome composition to survival outcomes presents unique methodological challenges: the feature space is high-dimensional (often thousands of taxa), compositional, sparse, and correlated. The platform addresses this by:
- Microbial discarding: removing taxa with near-zero variance or low prevalence, reducing noise before modelling.
- Compositional transformation: optional Centered Log-Ratio (CLR) scaling to handle the compositional nature of abundance data.
- Microbial clustering: grouping correlated taxa into clusters and using cluster representatives, drastically reducing the effective dimensionality while preserving biological signal.
- Events-per-variable (EPV) awareness: each analysis method documents its EPV requirements so users can assess feasibility.
- Multiple survival models: Cox PH, AFT, Frailty, Competing Risks, and Bayesian models — selectable per analysis, with parallel method-comparison runs supported.
2.Biological entity classification service
We provide a dedicated service to support manuscript preparation and supplementary materials: for any list of biological entities (e.g. taxon IDs, gene symbols, protein accessions, or organism names), we can generate a detailed classification table with full taxonomic or functional annotations and traceable references to the scientific literature.
2.1 What we provide
Classification table
Each entity is annotated with its official classification (e.g. kingdom, phylum, class, order, family, genus, species), standard identifiers (NCBI Taxonomy ID, where applicable), and a short description. The table is formatted for direct use in papers or as a supplementary file.
Literature references (DOI & PMID)
For each classification or nomenclatural decision, we attach the DOI (Digital Object Identifier) and PMID (PubMed ID) of the primary literature that documents that classification. This supports reproducibility and meets journal requirements for cited taxonomy and nomenclature.
2.2 DOI & PMID references
Every classification or nomenclatural assignment in the table is linked to the primary scientific publication that documents it. We supply both DOI (e.g. 10.1038/...) and PMID (PubMed ID) so you can cite the source in your Methods or supplementary materials and meet journal requirements for traceable taxonomy and nomenclature.
2.3 Use cases
- Supplementary table of all microbial taxa (e.g. genera or species) reported in your analysis, with full taxonomy and references.
- Gene or protein lists with functional classification and literature supporting each assignment.
- Organism or strain lists with standardised names and the publication(s) establishing the current classification.
2.4 How to request
Contact the platform maintainer with your entity list (one identifier per line or as a CSV column) and the type of classification desired (taxonomic, functional, or both). Deliverables include a CSV/Excel table and an optional PDF summary with DOI and PMID links for easy verification.
3.Getting Started
- Login via Google OAuth. Your data is private to your account.
- Create a dataset from the Dashboard → New Dataset. Give it a descriptive name.
- Upload files in the Files tab: at minimum one patient/metadata file and one taxonomy abundance file.
- Create an analysis in the Analysis tab → New Analysis. Configure Data Sources, Pre-Analysis, Analysis method, Post-Analysis, and Output options.
- Run the analysis. A draggable progress dialog shows each pipeline step in real time. Secondary analyses (per-stratum, per-method) run in parallel automatically.
- Download results from the Reports tab: CSV tables, JSON, and image files (JPG/PNG) are available for direct import into R, Python, or statistical software.
4.Workflow Diagram
The full pipeline runs in a fixed order. Secondary runs (per population sector and per analysis method) branch off automatically after the cohort is established at microbial grouping.
5.Data Sources & File Requirements
5.1 File formats
Each dataset holds an arbitrary number of versioned files. In the Analysis configuration (Data Sources tab) you select:
| Role | Typical content | Rows | Key columns | Accepted formats |
|---|---|---|---|---|
| Patient data file | Clinical & survival metadata | One row per patient | duration_pfs, pfs_status, clinical variables, stratification variables |
CSV, XLSX |
| Taxonomy / abundance file | Bracken/Kraken read counts or relative abundances per taxon | One row per sample (must be joinable to patient) | Taxon IDs (NCBI or custom) as column names; sample identifier column | CSV, XLSX |
5.2 Survival columns
The two mandatory survival columns are configured in project metadata (metadata/COLUMNS.py, KAPLAN_MEIER key):
duration(e.g.duration_pfs): numeric, strictly > 0. Units are arbitrary (months, days) but must be consistent.event(e.g.pfs_status): integer —0= right-censored,1= event of interest. For competing risks,2,3, … denote competing event types.
6.Pipeline Stages
The pipeline runs steps in numeric order. Each step logs row and column counts; the pipeline summary CSV reports the state at every stage. Parameters are set in the Analysis editor under five tabs: Data Sources, Pre-Analysis, Analysis, Post-Analysis, Output.
6.1 Pre-analysis steps
Generates Kaplan-Meier survival curves for each stratification variable before any filtering. Used to visualise baseline group differences.
Selects which Bracken/abundance timepoints to include (e.g. baseline only, or multiple time-points merged). Resolves and joins source files into a single analysis frame.
Optional winsorisation or exclusion of extreme-value patients. Configurable percentile thresholds (e.g. lower/upper 5th percentile). The example analysis started with 56 patients; 34 were preserved after extremes handling (lower edge = 17, upper edge = 17).
Reads column-group metadata (e.g. clinical vs. microbial vs. stratification columns). Ensures downstream steps apply the correct filtering policy to each group. In the example, 69 clinical columns were classified.
Restricts the cohort to the selected strata within each stratification variable (e.g. keep only "High-Risk" and "Intermediate-Risk" from FISH indicators). Removes rows not belonging to any selected stratum. The result is the analysis cohort for all downstream steps.
Produces an alluvial (Sankey-style) diagram showing how patients flow through stratification layers.
Drops non-microbial columns by configurable policy: constant columns (zero variance), near-constant, high missingness, etc. In the example, 2 constant columns were removed (
melphalanmgperm2_1, melphalanmgperm2).Drops microbial (taxonomy) features by policy: constant, near-zero abundance, low prevalence across samples, etc. In the example, 1,710 microbial columns were removed, leaving a tractable feature set for clustering.
Visualises abundance distributions of retained microbial features using ridgeline plots.
CSV table of missing-value counts per column after the discarding steps. Useful for reporting data quality in Methods sections.
Univariate Cox (or selected-method) screening of each covariate independently. Results exported to CSV/JSON and visualised as a volcano plot (−log₂ p vs. log hazard ratio).
Applies microbial grouping policy (e.g. aggregate to genus level, use Bracken multi-sample grouping). This is the last shared step before secondary runs fork off.
Clusters remaining microbial features using the chosen method (hierarchical, k-means, DBSCAN, etc.). Produces cluster assignments and representatives. Dramatically reduces the effective dimensionality while preserving biological groupings.
Assigns human-readable labels to clusters (e.g. by dominant taxon or user-defined names). Labels appear in all downstream plots and tables.
Reduces the design matrix to cluster representatives (one feature per cluster) plus retained clinical variables. This is the final feature matrix passed to normalization and the survival model.
Correlation heatmap of the reduced feature matrix. Saved as PNG. Reveals residual collinearity between retained features.
Scales the feature matrix before the survival model.
Runs the selected method on the prepared, scaled matrix. Produces the primary coefficient table:
coef, exp(coef), se(coef), 95% CI, z, p, −log₂(p). Also reports top-10 statistically significant covariates.6.2 Feature scaling
Applied to numeric covariates before fitting the survival model. The scaling method is set in the Analysis → Pre-Analysis tab:
| Method | Formula | Best for |
|---|---|---|
| Z-score (default) | x′ = (x − μ) / σ | Cox PH, Frailty, Competing Risks. Puts all features on the same scale; coefficients are interpretable as per-SD effects. |
| CLR | x′ = log(x / g(x)), where g(x) is the geometric mean | Microbial (compositional) abundance data. Removes the unit-sum constraint and log-scales counts. |
| Min-Max | x′ = (x − min) / (max − min) | When bounded [0, 1] inputs are preferred (e.g. some Bayesian priors). |
| None | — | When data are already suitably scaled, or for manual inspection. |
6.3 Clustering methods
Microbial feature clustering reduces dimensionality while preserving co-abundance structure. Available methods:
7.Survival Analysis Methods
All methods model the relationship between the covariate vector X and a time-to-event outcome (T, δ), where T is the observed time and δ ∈ {0, 1} is the event indicator. Every method produces the same standardised output table (Section 8). Configure the method and its parameters in the Analysis → Analysis tab.
Cox Proportional Hazards Regression
Model: The semi-parametric Cox model factorises the hazard into a non-parametric baseline hazard h₀(t) (unspecified) and a parametric relative risk term:
log h(t | X) = log h0(t) + Xβ
Estimation: β is estimated by maximising the partial likelihood:
where R(ti) is the risk set at event time ti (all subjects still under observation). No baseline hazard parameters are estimated.
Regularization: An elastic-net penalty can be added: −log L(β) + λ [α ||β||₁ + (1−α)/2 · ||β||₂²], where α is the L1 ratio and λ the penalizer strength.
Output quantities reported per covariate:
coef(β̂): log-hazard ratio.exp(coef): Hazard Ratio (HR) = exp(β̂). HR > 1 → increased hazard; HR < 1 → reduced hazard.se(coef): standard error of β̂ from the inverse observed Fisher information matrix.coef lower/upper 95%: Wald CI = β̂ ± 1.96 · SE.exp(coef) lower/upper 95%: exponentiated CI for HR.z: Wald z-statistic = β̂ / SE.p: two-sided p-value from standard normal: p = 2(1 − Φ(|z|)).−log₂(p): negative binary logarithm of p. Values ≥ 4.32 correspond to p ≤ 0.05.
Assumptions & requirements
- Proportional hazards: HR is constant over time.
- EPV ≥ 10 events per variable (rule of thumb).
- Independent observations (use Frailty if clustered).
Configurable parameters
- α: significance level (default 0.05).
- penalizer: L2 penalty λ (default 0, no regularisation).
- l1_ratio: 0 = pure L2, 1 = pure L1 (default 0).
- max_iter: convergence iterations (default 1000).
- tolerance: convergence threshold (default 1e-6).
Accelerated Failure Time (AFT) Model
Model: The AFT model is fully parametric. It assumes that covariates act multiplicatively on survival time (equivalently, additively on log-time):
where W follows a specified error distribution. The parametric family determines the baseline survival function:
| Distribution | W follows | Baseline S₀(t) |
|---|---|---|
| Weibull (default) | Extreme value (Gumbel) | exp(−(t/λ)ᵨ) |
| Exponential | Extreme value (Gumbel), ρ=1 | exp(−t/λ) |
| Log-normal | Normal | 1 − Φ(log t) |
| Log-logistic | Logistic | 1/(1+(t/λ)ᵨ) |
Interpretation: exp(β) is the time ratio (TR): the factor by which expected survival time is multiplied per unit increase in the covariate. TR > 1 → longer expected survival; TR < 1 → shorter.
Comparison with Cox: AFT is more efficient when the distribution is correctly specified, but sensitive to distributional misspecification. Use information criteria (AIC, BIC) to select the distribution.
Frailty Model
Model: An extension of Cox regression that introduces a latent (unobserved) random effect ui — the frailty — to account for within-cluster correlation or unobserved heterogeneity:
Subjects with ui > 1 are more "frail" (higher baseline risk); those with ui < 1 are more resistant. The frailty term is integrated out of the likelihood:
Frailty distributions:
- Gamma (default): conjugate to the Poisson process; Laplace transform is analytically tractable.
- Log-normal: heavier tails; may be more flexible but requires numerical integration.
- Inverse Gaussian: intermediate tail behaviour.
Cluster-robust alternative: If no cluster column is specified, the platform falls back to Cox regression with a sandwich (robust) variance estimator, which provides consistent standard errors under heterogeneity without requiring a frailty distribution.
EPV recommendation: ≥ 15 events per variable; more clusters improve frailty variance estimation.
Competing Risks Analysis
Setting: When subjects are at risk of multiple mutually exclusive event types (e.g. progression vs. death without progression), standard methods that treat competing events as censoring are biased. The event column must encode: 0 = censored, 1 = event of interest, 2, 3, … = competing events.
The platform supports two complementary frameworks:
Cause-specific hazard (CSH)
= h0,k(t) · exp(Xβk) (6)
Estimated by a separate Cox model for each cause k, treating other causes as censored. Measures the rate of the event among those still at risk; useful for aetiology.
Sub-distribution hazard — Fine-Gray (SDH)
CIF1(t) = 1 − exp(−H̃1(t)) = P(T ≤ t, K=1)
Direct model of the cumulative incidence function (CIF). Subjects who experience a competing event remain in the risk set. Coefficients directly describe effects on the probability of the event of interest.
Bayesian Survival Model
Model: Bayesian inference places prior distributions over the regression coefficients β. The posterior is:
where L is the Cox partial likelihood and p(β) is the prior (typically independent normals with configurable scale). Sampling uses Markov Chain Monte Carlo (MCMC) via PyMC.
Reported quantities:
- Posterior mean/median of β (used as
coef). - Posterior SD (used as
se(coef)). - 95% credible interval (equal-tailed, i.e. 2.5th–97.5th posterior percentiles; reported as
coef lower/upper 95%). - MCMC convergence: check that effective sample size ≫ 100 and R̂ ≈ 1.0 for all parameters.
Configurable parameters: n_samples (default 2000, min 100, max 10 000) and prior_scale (default 1.0). Regularisation via prior: a smaller scale yields more shrinkage toward zero.
Advantages: Full uncertainty quantification; robust to small EPV due to prior regularisation; model comparison via WAIC/LOO-CV.
Choosing a method
| Method | When to use | Key output | Min EPV |
|---|---|---|---|
| Cox PH | Default; PH assumption plausible; moderate EPV | Hazard ratio (HR) | 10 |
| AFT | Distribution known; prefer time ratios | Time ratio (TR) | 10 |
| Frailty | Clustered data; unobserved heterogeneity | HR + frailty variance | 15 |
| Competing risks | Multiple event types (progression + death) | CSH or SDH | 15 |
| Bayesian | Small EPV; uncertainty quantification needed | Posterior mean + credible interval | 3 |
8.Secondary Analyses
After the main analysis finishes, the platform automatically launches secondary runs in parallel. Two types are supported:
8.1 Population-sector analyses
One run per stratum of each stratification variable. These run from clustering onwards, sharing the same cohort (established at microbial grouping) but restricted to a single stratum. Example strata from a real analysis:
- FISH indicators: Intermediate-Risk (n=13), High-Risk (n=5), Favorable (n=16)
- Disease characteristics: Low Risk (n=15), Intermediate Risk (n=11), High Risk (n=8)
- Demographics (age): Middle-aged 51-65 (n=19), Elderly >65 (n=13), Young ≤50 (n=2)
- Genomic markers: No Markers (n=11), Cyclin D1 t(11;14) (n=10), 1q Gain (n=9), TP53 del17p (n=7), MAF rearranged (n=5), FGFR3/MMSET t(4;14) (n=3)
8.2 Method-comparison analyses
If multiple analysis methods are selected in the Analysis → Analysis tab (under Analysis Methods Comparison), each method runs independently in its own subfolder under analysis_method/. Results include all standard outputs (coefficient table, forest plot, volcano, box plot, radar) for each method, enabling direct comparison.
analysis_method/accelerated_failure_time/ with files prefixed analysis_method_1_accelerated_failure_time_*. Population-sector results live under population_sector/<stratification>/<stratum>/.
9.Outputs & Scientific Reporting
All outputs are written as open, machine-readable files (CSV, JSON) and high-resolution images (JPG/PNG). They are designed to be directly imported into statistical software (R, Python/pandas, SPSS) or inserted into manuscript figures and supplementary tables.
9.1 Primary result table
File: *_results_main_<method>.csv (CSV) and *_results_main_<method>.json (JSON).
One row per covariate. Columns:
| Column | Type | Description | Statistical meaning |
|---|---|---|---|
covariate | string | Covariate name (taxon ID or clinical variable) | Identifies the predictor |
coef | float | Fitted coefficient β̂ | Log-hazard ratio (Cox/Frailty/CR) or log-time ratio (AFT) or posterior mean (Bayesian) |
exp(coef) | float | exp(β̂) | Hazard ratio (HR) or time ratio (TR); effect size on the original scale |
se(coef) | float | Standard error of β̂ | Estimated sampling variability; used to compute Wald CI and z |
coef lower 95% | float | Lower Wald 95% CI: β̂ − 1.96·SE | Lower confidence / credible interval bound on log scale |
coef upper 95% | float | Upper Wald 95% CI: β̂ + 1.96·SE | Upper confidence / credible interval bound on log scale |
exp(coef) lower 95% | float | exp(coef lower 95%) | Lower bound of HR/TR CI |
exp(coef) upper 95% | float | exp(coef upper 95%) | Upper bound of HR/TR CI |
cmp to | float | Reference value (usually 0) | Value against which β is compared (always 0 for log-scale) |
z | float | Wald statistic: z = β̂ / SE | Standard normal test statistic. |z| > 1.96 ↔ p < 0.05 |
p | float | Two-sided p-value: 2·(1 − Φ(|z|)) | Probability of observing |z| ≥ observed under H₀: β = 0 |
-log2(p) | float | −log₂(p) | Manhattan/volcano plot scale. Value ≥ 4.32 ↔ p ≤ 0.05; ≥ 10 ↔ p ≤ 0.001 |
Example rows from a real Cox PH analysis (4 most significant covariates):
| covariate | coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | z | p | -log2(p) |
|---|---|---|---|---|---|---|---|---|
| functional_hr | 27.919 | 1.39×10¹² | 8.417 | 11.42 | 44.42 | 3.316 | 0.00091 | 10.10 |
| 1359 | 2.891 | 18.02 | 1.012 | 0.908 | 4.875 | 2.857 | 0.00427 | 7.87 |
| 227942 | 36.450 | 6.76×10¹⁵ | 13.797 | 9.408 | 63.492 | 2.642 | 0.00825 | 6.92 |
| weight_kg | −10.084 | 4.15×10⁻⁵ | 4.362 | −18.63 | −1.54 | −2.311 | 0.02082 | 5.59 |
| beta2microglobulin | 0.047 | 1.049 | 1.962 | −3.798 | 3.893 | 0.024 | 0.981 | 0.028 |
Green rows: p < 0.001 (−log₂(p) > 10). Yellow rows: p < 0.05. Protective covariates have negative coef (HR < 1).
9.2 Additional tabular outputs
Pipeline summary CSV
Row per pipeline step. Columns: step name, patients in, patients out, features in, features out, duration (s), status.
*_pipeline_summary.csvCluster definition table
Maps each retained feature to its cluster label, representative status, and summary statistics.
*_cluster_definition_table.csvMissingness table
Per-column missing value counts and percentages. Report directly in the data-quality section of your Methods.
*_missingness_table.csvDiscordance table
Compares directional concordance across analysis methods or strata. Useful for sensitivity analyses.
*_discordance_table.csvUnivariate screening CSV
Results of individual (univariate) analysis for each covariate before multivariate modelling.
*_screening_univariate.csvStep-level CSVs
Intermediate data frames saved after key steps (e.g. *_10_main.csv, *_90_reduced_clusters.csv) for full reproducibility audit.
9.3 Visualizations
Kaplan-Meier curves
Generated per stratification variable (demographics, FISH indicators, disease characteristics, genomic markers, laboratory values, etc.). File per stratum: *_KM_<stratum>.jpg + *.json.
Box plots
Distribution of each covariate (min, Q1, median, Q3, max) across patient groups. File: *_box_plot.jpg + *.json.
Radar (spider) plot
Clinical profile of each microbial cluster centroid. Helps characterise clusters clinically. File: *_radar_clinical_cluster.jpg.
Correlation heatmap
Pearson correlation matrix of the reduced feature set (post-clustering). Reveals residual collinearity. File: *_correlation_heatmap.png.
Alluvial (Sankey) plot
Flow of patients across stratification layers. Useful for CONSORT-style participant flow diagrams. File: *_alluvial_stratos.jpg.
Ridgeline abundance
Distribution of microbial abundance across patients for each retained taxon. Good for supplementary data quality figures.
9.4 Using results in scientific papers
read.csv), Python (pandas.read_csv), or SPSS — no reformatting needed.
Recommended reporting practice for the Methods section:
- State which microbiome data pipeline was used (Bracken/Kraken, classification level, timepoints selected).
- Describe pre-processing: attribute and microbial discarding criteria, feature scaling method (e.g. "CLR-transformed abundances, then z-scored"), clustering method and number of clusters.
- Report EPV: "N events / P covariates = EPV".
- Identify the survival model and software library (e.g. "Cox proportional hazards regression implemented via lifelines v0.27").
- State significance threshold (α, e.g. 0.05) and correction for multiple testing if applicable.
Recommended reporting for Results:
- Primary table:
covariate, HR (95% CI), z, p— the CSV is ready to paste into a table editor. - Forest plot: insert the JPG directly; caption with the model name and n events.
- Volcano plot: use as a supplementary figure showing the full covariate landscape; label significant hits.
- KM curves: one per key stratification; include number at risk and log-rank test p-value in the caption.
Example Methods sentence:
"Microbiome features were first aggregated to genus level and filtered to remove constant-value taxa. The resulting feature matrix was clustered using hierarchical clustering (Ward linkage, Euclidean distance, k = 4 clusters) and reduced to cluster representatives. Abundances were CLR-transformed and z-scored. Multivariate survival analysis was performed using Cox proportional hazards regression (lifelines, penalizer = 0, α = 0.05). A total of N events across P covariates (EPV = N/P) were analysed. Statistical significance was defined as p < 0.05 (two-sided Wald test). Secondary analyses were conducted for each disease-risk stratum and for four additional survival models (AFT, Frailty, Competing Risks, Bayesian)."
10.Interpreting Results
10.1 Hazard ratio (Cox / Frailty / Competing Risks)
- HR = 1.0: no association with hazard.
- HR > 1.0: each unit increase in the covariate is associated with higher instantaneous event rate (risk factor). Example: HR = 18.0 for covariate 1359 → 18-fold higher hazard per unit increase.
- HR < 1.0: associated with lower hazard (protective). Example: HR = 4.15×10⁻⁵ for weight_kg → markedly protective.
- Wide CI: high uncertainty, often due to small EPV or collinearity. Use penalization (L2/L1) or the Bayesian model.
10.2 The −log₂(p) scale
Volcano plots use −log₂(p) on the y-axis. Key thresholds:
| p-value | −log₂(p) | Significance |
|---|---|---|
| 0.05 | 4.32 | Standard α = 0.05 |
| 0.01 | 6.64 | 1% |
| 0.001 | 9.97 | 0.1% (Bonferroni for ~1000 tests) |
| 0.0001 | 13.29 | Genome-wide equivalent |
10.3 EPV and model reliability
10.4 Proportional hazards check
The Cox model assumes that the HR is constant over time. Violation (time-varying HR) can be detected by plotting Schoenfeld residuals vs. time or using Grambsch-Therneau tests. If PH is violated for key covariates, consider time-varying covariates, stratified Cox, or AFT models.
10.5 Bayesian convergence
Examine the MCMC trace plots and R̂ statistics in the process JSON (*_B0_bayesian_info.json if present). R̂ > 1.05 suggests poor mixing; increase n_samples or adjust the prior scale.
11.Glossary
- Censoring (right)
- Observation where the event had not occurred by end of follow-up, loss to follow-up, or study end. Event indicator = 0. Survival analysis uses all available follow-up time without treating censoring as an event.
- Hazard function h(t)
- Instantaneous rate of the event at time t given survival to t: h(t) = limΔt→0 P(t ≤ T < t+Δt | T ≥ t) / Δt. Related to survival: S(t) = exp(−∫₀ᵗ h(u)du).
- Hazard ratio (HR)
- Ratio of hazards between two covariate values: exp(β·Δx). Constant over time under the proportional hazards assumption.
- Time ratio (TR)
- In AFT models: the factor by which median (or mean) survival time is multiplied per unit covariate increase. exp(β) in the AFT parameterisation.
- Events per variable (EPV)
- Number of observed events divided by number of covariates. Rule of thumb: EPV ≥ 10 for Cox to limit small-sample bias. Use penalization or Bayesian methods with lower EPV.
- Proportional hazards (PH)
- Assumption that hazard ratios are constant over time. Formally: h(t|X₁)/h(t|X₂) = c for all t. Testable via Schoenfeld residuals or log-log plots.
- Stratification
- Partition of the cohort into mutually exclusive subgroups by a categorical variable (e.g. disease risk, genomic markers). Population-sector analyses run one model per stratum.
- Frailty
- Unobserved random multiplicative factor on the hazard, representing unobserved individual or cluster-level heterogeneity. Variance θ ≥ 0; θ = 0 reduces to Cox.
- Competing risks
- Setting where multiple mutually exclusive events can occur. The event of interest is prevented (or its observation altered) by competing events. Cause-specific Cox and Fine-Gray sub-distribution models are the two main approaches.
- Cumulative Incidence Function (CIF)
- CIF₁(t) = P(T ≤ t, K = 1): the probability of experiencing the event of interest by time t in the presence of competing risks. Modelled directly by Fine-Gray.
- CLR (Centered Log-Ratio)
- Isometric log-ratio transformation for compositional data: x′ = log(x / g(x)), where g(x) is the geometric mean of the composition. Removes the unit-sum constraint of relative abundances.
- Credible interval (Bayesian)
- Posterior interval [a, b] such that P(a ≤ β ≤ b | data) = 0.95. Directly probabilistic interpretation (unlike frequentist CI).
- MCMC / R̂
- Markov Chain Monte Carlo: simulation-based posterior sampling. R̂ (Gelman-Rubin) is a convergence diagnostic: R̂ ≈ 1 indicates chains have mixed; R̂ > 1.05 suggests convergence issues.
- Wald test
- Test statistic z = β̂ / SE(β̂), compared to standard normal. Used by default for all frequentist methods in the platform. Alternative: likelihood ratio test (more accurate for small samples).
- EPV-adjusted penalization
- Adding a ridge (L2) or lasso (L1) penalty to the log-likelihood reduces overfitting when EPV is low. The penalty λ shrinks coefficients toward zero; λ = 0 yields standard (unpenalised) MLE.
12.Key References
- Cox DR. Regression Models and Life-Tables. J R Stat Soc Ser B. 1972;34(2):187–220.
- Fine JP, Gray RJ. A Proportional Hazards Model for the Subdistribution of a Competing Risk. J Am Stat Assoc. 1999;94(446):496–509.
- Vaupel JW, Manton KG, Stallard E. The Impact of Heterogeneity in Individual Frailty on the Dynamics of Mortality. Demography. 1979;16(3):439–454.
- Wei LJ. The Accelerated Failure Time Model: A Useful Alternative to the Cox Regression Model in Survival Analysis. Stat Med. 1992;11(14–15):1871–1879.
- Gelman A, et al. Bayesian Data Analysis. 3rd ed. CRC Press; 2013.
- Davidson-Pilon C. lifelines: survival analysis in Python. JOSS. 2019;4(40):1317. lifelines.readthedocs.io
- Aitchison J. The Statistical Analysis of Compositional Data. Chapman & Hall; 1986.
- Pedregosa F, et al. Scikit-learn: Machine Learning in Python. JMLR. 2011;12:2825–2830.
This documentation is generated from the platform source code and analysis metadata. Statistical formulas reflect the implementations in step_B0_analysis.py and metadata/ANALYSIS_METHODS.py. For questions or issue reports, contact glevcovich@gmail.com.