This document describes the data sources, model structure, and implementation details for the wastewater-informed hospital admissions forecasts submitted to the COVID-19 Forecast Hub under the name cfa-wwrenewal
. The methods described will evolve as we continue to develop and improve the model. We use the model to produce forecasts for 53 jurisdictions: the 50 U.S. states, the District of Columbia, Puerto Rico, and the United States as a whole.
We welcome feedback on the model structure and implementation, please feel free to submit an issue directly on GitHub or reach out via this form.
For an example of how to fit the model to simulated data, we recommend the toy data vignette.
We use two data sources as input to the model: viral genome concentrations in wastewater ("genome concentrations") and incident hospital admissions ("admissions"). We describe each of these data sources in detail below.
We source viral genome concentration in wastewater from the CDC National Wastewater Surveillance System (NWSS) full analytic dataset. Concentrations are typically reported in units of estimated genome copies per unit volume wastewater or per unit mass of dry sludge. At present, we use only measurements per unit volume wastewater. The dataset is defined in the data dictionary and is described in plain text on the NWSS website. The full dataset is available to researchers upon request but a dashboard with summary statistics is publicly available.
We use the most recent data available at the time of submission to generate the forecast which is typically partly complete for the most recent week.
These data are complex and have many sources of variability and uncertainty. For example,
- Each jurisdiction (for our forecasts, typically a U.S. state or territory) has a different number of wastewater sampling sites (e.g., wastewater treatment plants) that each cover a catchment area. This means that different sites cover different populations, and when combining the sites across a jurisdiction, represent some subset of the jurisdiction's total population. The fraction of the total population covered by sampling sites varies significantly among the jurisdictions we consider.
- Different wastewater sites may use different sample collection methodologies. These methodologies may also vary through time at an individual site.
- Different sites sample at different cadences. Some sites sample nearly daily, while others sample less than once a week. Some sites report that data within the week while others have reporting latencies on the order of weeks.
- Different samples are processed in different labs that may use different methodologies for extraction, concentration, and quantification. These labs may also vary their methodologies through time.
- Not all of the jurisdictions we forecast submit genome concentrations to CDC's National Wastewater Surveillance System (NWSS).
These complexities mean that wastewater data are often difficult to interpret. For example, a change in genome concentration sampled at a site could reflect a change in the underlying number of infections, a change in the number of samples collected or processed, a change in the collection or processing methodologies, a change in the reporting cadence or latency, and/or a change in the viral shedding kinetics of the population (e.g., due to a new variant or a change in the age distribution of cases).
Details about data preprocessing, outlier detection, and aggregation are in the Appendix.
We source hospital admissions data for each jurisdiction from healthdata.gov. As this data is updated on Friday afternoon with admissions up until the previous Saturday, and we submit forecasts on the following Monday, this means that we have a 9 day delay between the most recent data and when the model is fitted.
Details about data preprocessing and outlier detection are in the Appendix.
We use three models to forecast COVID-19 hospital admissions. The models share a common structure and are fit using the same codebase. The models differ in the data they use as input, in whether or not they include a wastewater component, and the aggregation of the estimates they produce.
Model 1 is wastewater-informed, site-level infection dynamics model. It uses genome concentrations at the site level and admissions at the jurisdictional level as input and produces forecasts of admissions at the jurisdictional level. We use Model 1 to generate forecasts for jurisdictions with any wastewater data available. However, if a jurisdiction has not reported any genome concentration data in the past 21 days or if there are fewer than 5 data points per site in any site, we flag this in our metadata, as it is unlikely that the wastewater data is meaningfully informing the forecast in that jurisdiction.
Model 2 is a no-wastewater, jurisdiction-level infection dynamics model. It uses jurisdiction-level admissions as input and produces forecasts of jurisdiction-level admissions as output. We use Model 2 to generate forecasts for jurisdictions without any wastewater data.
Model 3 is a nationwide aggregated wastewater model. It uses nationally-aggregated genome concentrations and national-level admissions as input and produces forecasts of national-level admissions. We use Model 3 to generate nationwide forecasts for the entire United States. Of the forecasts being submitted to the COVID-19 Forecast Hub, only the nationwide (U.S.) forecasts use this observation model.
Model | Input WW data | Input hospitalization data | Infection dynamics | Predicted hospitalizations |
---|---|---|---|---|
Model 1 | Site-level | Jurisdiction-level | Coupled site- and jurisdiction-level | Jurisdiction-level |
Model 2 | None | Jurisdiction-level | Jurisdiction-level | Jurisdiction-level |
Model 3 | Nationally aggregated | Nationally aggregated | National | National |
Alongside each forecast, we publish a corresponding metadata table with jurisdiction-by-jurisdiction notes on wastewater data status and model used. If a jurisdiction has no wastewater data during the fitting period (the past 90 days), we fit Model 2 and indicate this in the metadata. If a jurisdiction has minimal wastewater data or no recent wastewater measurements, we still fit Model 1, but we indicate in the metadata that the wastewater data may be insufficient to inform the forecast.
Our models are constructed from a set of generative components. These are:
- Infection component: A renewal model for the infection dynamics, which generates estimates of incident latent infections per capita.
- Hospital admissions component: A model for the expected number of hospital admissions given incident latent infections per capita.
- Viral genome concentration in wastewater: A model for the expected genome concentration given incident infections per capita.
Depending on the model, these components are implemented at different spatial scales and with different observation processes. In particular, the link to the observables depends on the model and the form of the observables, see below for detailed descriptions of each component and the following sections for model-specific details.
See the notation section for an overview of the mathematical notation we use to describe the model components, including how probability distributions are parameterized.
This component assumes that latent (unobserved) expected incident infections per capita
This process is initialized by estimating an initial exponential growth2 of infections for 50 days prior to the calibration start time
where
We decompose the instantaneous reproduction number
We assume that the unadjusted reproduction number
where
The damping term we use is based on Asher et al. 20185 but extended to be applicable to a renewal process. It assumes that the instantaneous reproduction number is damped by recent infections weighted by the generation interval. This is a simple way to account for the fact that the instantaneous reproduction number is likely to decrease when there are many infections in the population, due to factors such as immunity, behavioral changes, and public health interventions. The damping term is defined as:
where
Following other semi-mechanistic renewal frameworks, we model the expected hospital admissions per capita
To account for day-of-week effects in hospital reporting, we use an estimated weekday effect
Where
We define the discrete hospital admissions delay distribution
In the models that include fits to wastewater data, we allow the population-level infection-hospitalization rate (IHR) to change over time. An inferred change in the IHR could reflect either a true change in the rate at which infections result in hospital admissions (e.g. the age distribution of cases could shift, a more or less severe variant could emerge, or vaccine coverage could shift) or a change in the relationship between infections and genomes shed in wastewater
Therefore, we model the proportion of infections that give rise to hospital admissions
The values
where
where
We chose a relatively strong prior of
In hospital admissions only models, we model the IHR as constant. We assign this constant IHR the same prior distribution that we assign
We model the observed hospital admission counts
where the jurisdiction population size
Currently, we do not explicitly model the delay from hospital admission to reporting of hospital admissions. In reality, corrections (upwards or downwards) in the admissions data after the report date are possible and do happen. See outlier detection and removal for further details.
We model viral genome concentrations in wastewater
where
This approach assumes that
We model the shedding kinetics
where
We model the log observed genome concentrations as Normally distributed:
This component does not mechanistically simulate each step involved in sample collection, processing, and reporting. Instead, it aims to to account for these processes, at a summary level. Future iterations of this model will evaluate the utility of mechanistic modeling of wastewater collection and processing.
In this model, we represent hospital admissions at the jurisdictional level and viral genome concentrations at the site level. We use the components described above but divide the jurisdiction's total population into subpopulations representing sampled wastewater sites' catchment populations, with an additional subpopulation to represent individuals who do not contribute to sampled wastewater.
We model infection dynamics in these subpopulations hierarchically: subpopulation infection dynamics are distributed about a central jurisdiction-level infection dynamic, and jurisdiction's total infections are simply the sum of the subpopulation-level infections.
In Model 1, a jurisdiction consists of
Whenever the sum of the wastewater catchment population sizes
The total number of subpopulations is then
This amounts to modeling the wastewater catchments populations as approximately non-overlapping; every infected individual either does not contribute to measured wastewater or contributes principally to one wastewater catchment. This approximation is reasonable because we only use samples taken from primary wastewaster treatment plants, which avoids the possibility that an individual might be sampled once in a sample taken upstream and then sampled again in a more aggregated sample taken further downstream; see data filtering for further details.
If the sum of the wastewater site catchment populations meets or exceeds the reported jurisdiction population (
When converting from predicted per capita incident hospital admissions
This amounts to making two key additional modeling assumptions:
- Any individuals who contribute to wastewaster measurements but are not part of the jurisdiction population are distributed among the catchment populations approximately proportional to catchment population size.
- Whenever
$\sum n_k \ge n$ , the fraction of individuals in the jurisdiction not covered by wastewater is small enough to have minimal impact on the jurisdiction-wide per capita infection dynamics.
We couple the subpopulation- and jurisdiction-level infection dynamics at the level of the un-damped instantaneous reproduction number
We model the subpopulations as having infection dynamics that are similar to one another but can differ from the overall jurisdiction-level dynamic.
We represent this with a hierarchical model where we first model a jurisdiction-level un-damped effective reproductive number
The jurisdiction-level model for the undamped instantaneous reproductive number
where
where
We chose a prior of
The subpopulation
From
To obtain the number of infections per capita
We infer the site level initial per capita incidence
We model site-specific viral genome concentrations in wastewater
Genome concentration measurements can vary between sites, and even within a site through time, because of differences in sample collection and lab processing methods. To account for this variability, we add a scaling term
Both
In the rare cases when a site submits multiple concentrations for a single date and lab method, we treat each record as an independent observation.
Lab processing methods have a finite limit of detection (LOD), such that not all wastewater measurements can be modeled using the log-normal approach above. This limit of detection varies across sites, between methods, and potentially also over time.
If an observed value
where
(This is mathematically equivalent to integrating the probability density function of the log-normal distribution from zero to the LOD.)
If a sample is flagged in the NWSS data as below the LOD (field pcr_target_below_lod
) but is missing a reported LOD (field lod_sewage
), the 95th percentile of LOD values across the entire data is used as the integral's upper limit.
If a sample has a reported concentration (field pcr_target_avg_conc
) above the corresponding reported LOD, but the sample is nevertheless flagged as below the LOD (field pct_target_below_lod
), we assume the flag takes precedence and treat the sample as below LOD for the purposes of censoring.
The no wastewater, jurisdiction-level infection dynamics model is the simplest model, because it does not include wastewater data. Each jurisdiction is modeled as a separate population according to the infection component and hospitalization component described above. We use this model when no wastewater data is available for a jurisdiction. This model also serves as a comparative baseline for evaluating wastewater-informed models.
In this model, the entire United States is treated as a single population. This model uses the general infection, hospitalization, and wastewater components described above, but with the following modifications:
We first generate a thresholded population-weighted average viral genome concentration in wastewater:
- For each week and site, if the site has more than one sample in that week, compute the mean genome concentration across those samples. This associates each site and week with a single genome concentration.
- For each week, compute the weighted mean genome concentration across all sites. The weights are the site population, or 300,000, whichever is lower. The aim of this thresholding is to prevent large sites from dominating the average concentration.
This approach is similar to the algorithm used by Biobot Analytics to derive regional and national aggregate genome concentration. Future iterations of the model will evaluate the validity of this approach.
We then use this thresholded population-weighted average concentration as the input to the wastewater viral concentration component.
We use informative priors for parameters that have been well characterized in the literature and weakly informative priors for parameters that have been less well characterized.
Parameter | Prior distribution | Source |
---|---|---|
Initial hospitalization probability | Perez-Guzman et al. 2023 8 | |
Time to peak fecal shedding | Russell et al. 2023 9, Huisman et al. 2022 10, Cavany et al. 2022 11 | |
Peak viral shedding |
Miura et al. 2021 12 | |
Duration of shedding | Cevik et al. 2021 13, Russell et al. 2023 9 | |
Total genomes shed per infected individual | Watson et al 202314 | |
Initial infections per capita |
where |
|
Initial exponential growth rate | Chosen to assume flat dynamics prior to observations | |
Infection feedback term | Weakly informative prior chosen to have a mode of 500 in natural scale, based on posterior estimates of peaks from prior seasons in a few jurisdictions |
Parameter | Value | Source |
---|---|---|
Maximum generation interval |
|
|
Maximum infection to hospital admissions delay |
|
|
Wastewater produced per person-day |
|
Ortiz 202415 |
The discrete generation interval probability mass function
We derive the distribution
We model the incubation period with a discretized, modified Weibull distribution6 with probability mass function
We model the symptom onset to hospital admission delay distribution with a Negative Binomial distribution with probability mass function
The infection-to-hospitalization delay distribution
This resulting infection to hospital admission delay distribution has a mean of 12.2 days and a standard deviation of 5.67 days.
Our framework is an extension of the widely used 18 19, semi-mechanistic renewal framework {EpiNow2}
202, using a Bayesian latent variable approach implemented in the probabilistic programming language Stan 21 using 22 to interface with R.
For submission to the COVID-19 Forecast Hub, the model is run on Saturday to generate forecasts each Monday.
For each jurisdiction, we run 4 chains for 750 warm-up iterations and 500 sampling iterations, with a target average acceptance probability of 0.95 and a maximum tree depth of 12.
To generate forecasts per the hub submission guidelines, we calculate the necessary quantiles from the 2,000 draws from the posterior of the expected observed hospital admissions 28 days ahead of the Monday forecast date.
The notation
We parameterize Normal distributions in terms of their mean and standard deviation:
We parameterize Beta distributions in terms of their two standard shape parameters
We parameterize Negative Binomial distributions in terms of their mean and their positive-constrained dispersion parameter (often denoted
We write
Observed data are labeled by data source:
Field names are references to fields in the analytic dataset.
- Primary wastewater treatment plants only (field
sample_location
equalswwtp
) - SARS-CoV-2 amplification targets only (field
pcr_target
equalssars-cov-2
) - No solid samples (field
sample_matrix
is notprimary_sludge
, and fieldpcr_target_units
is notcopies/g dry sludge
) - No samples flagged for quality issues (field
quality_flag
is notyes
) - Outliers removed (see below)
We identify potential outlier genome concentrations for each unique site and lab pair with an approach based on
Briefly, we compute
- For purposes of outlier detection, exclude wastewater observations below the LOD.
- For purposes of outlier detection, exclude observations more than 90 days before the forecast date.
- For each site
$i$ , compute the change per unit time between successive observations$t$ and$t'$ :$(\log[c_{it'}] - \log[c_{it}])/(t' - t)$ . - Compute
$z$ -scores for$\log[c_{it}]$ across all sites$i$ and timepoints$t$ . Flag values with$z$ -scores over 3 as outliers and remove them from model calibration. - Compute
$z$ -scores for the change per unit time values across all sites and pairs of timepoints. For values with$z$ -scores over 2, flag the corresponding wastewater concentrations$c_{it}$ as outliers and remove them from model calibration.
The
- For consistency, the reported viral genome concentration in wastewater (field
pcr_target_avg_conc
) and LOD (fieldlod_sewage
) are converted to genome copies per mL wastewater using the reported measurement units (fieldpcr_target_units
). - For wastewater catchment population
$n_{it}$ , we use fieldpopulation_served
. - We identify the unique combinations of sites and labs and add this as a column to our pre-processed dataset.
We visually inspect the hospital admissions data for each jurisdiction before producing a forecast, identifying anomalies that seem implausible. We then remove these observations from the model inference, treating them as missing data (i.e. as NA values in the model). Often these implausible observations are due to reporting errors, such as a hospital reporting a large number of admissions on a single day that should have been spread out over multiple days and are later corrected. When this happens, we add the corrected data back into the model inference when it gets updated.
Footnotes
-
Cori, A., Ferguson, N. M., Fraser, C., & Cauchemez, S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am. J. Epidemiol. 178, 1505-1512 (2013). https://doi.org/10.1093/aje/kwt133 ↩
-
Abbott, S. et al. EpiNow2: Estimate real-time case counts and time-varying epidemiological parameters. https://doi.org/10.5281/zenodo.3957489 ↩ ↩2 ↩3
-
Fraser, C. (2007). Estimating individual and household reproduction numbers in an emerging epidemic. PLoS One, 2(8), e758 (2007). https://doi.org/10.1371/journal.pone.0000758 ↩
-
Gostic, K.M. et al. Practical Considerations for Measuring the Effective Reproductive Number, Rt. PLoS Comput Biol. 16(12) (2020). https://doi.org/10.1371/journal.pcbi.1008409 ↩
-
Asher, J. Forecasting Ebola with a regression transmission model. Epidemics. 22, 50-55 (2018). https://doi.org/10.1016/j.epidem.2017.02.009 ↩ ↩2
-
Park, S.W. et al. Inferring the differences in incubation-period and generation-interval distributions of the Delta and Omicron variants of SARS-CoV-2. Proc Natl Acad Sci U S A. 120(22):e2221887120 (2023). https://doi.org/10.1073/pnas.2221887120 ↩ ↩2 ↩3
-
Larremore, D.B. et al. Test sensitivity is secondary to frequency and turnaround time for COVID-19 screening. Science Advances (2021). https://doi.org/10.1126/sciadv.abd5393 ↩
-
Perez-Guzman, P.N. et al. Epidemiological drivers of transmissibility and severity of SARS-CoV-2 in England. Nat Commun 14, 4279 (2023). https://doi.org/10.1038/s41467-023-39661-5 ↩
-
Russell, T.W. et al. Within-host SARS-CoV-2 viral kinetics informed by complex life course exposures reveals different intrinsic properties of Omicron and Delta variants. medRxiv (2023). https://doi.org/10.1101/2023.05.17.23290105 ↩ ↩2
-
Huisman, J.S. et al. Estimation and worldwide monitoring of the effective reproductive number of SARS-CoV-2 eLife 11:e71345 (2022). https://doi.org/10.7554/eLife.71345 ↩
-
Cavany S, et al. Inferring SARS-CoV-2 RNA shedding into wastewater relative to the time of infection. Epidemiology and Infection 150:e21 (2022). https://doi.org/10.1017/S0950268821002752 ↩
-
Miura F, Kitajima M, Omori R. Duration of SARS-CoV-2 viral shedding in faeces as a parameter for wastewater-based epidemiology: Re-analysis of patient data using a shedding dynamics model. Sci Total Environ 769:144549 (2021). https://doi.org/10.1016/j.scitotenv.2020.144549 ↩
-
Cevik, M. et al. SARS-CoV-2, SARS-CoV, and MERS-CoV viral load dynamics, duration of viral shedding, and infectiousness: a systematic review and meta-analysis. Lancet Microbe 2(1),e13-e22 (2021). https://doi.org/10.1016/S2666-5247(20)30172-5 ↩
-
Leighton, M. et al. Improving estimates of epidemiological quantities by combining reported cases with wastewater data: a statistical framework with applications to COVID-19 in Aotearoa New Zealand. medRxiv (2023). https://doi.org/10.1101/2023.08.14.23294060 ↩
-
Ortiz, P. Wastewater facts - statistics and household data in 2024. https://housegrail.com/wastewater-facts-statistics/ ↩
-
Park, S.W. et al. Estimating epidemiological delay distributions for infectious diseases. medRxiv (2024). https://doi.org/10.1101/2024.01.12.24301247 ↩ ↩2 ↩3
-
Danaché, C. et al. Baseline clinical features of COVID-19 patients, delay of hospital admission and clinical outcome: A complex relationship. PLoS One 17(1):e0261428 (2022). https://doi.org/10.1371/journal.pone.0261428 ↩
-
US Centers for Disease Control and Prevention. Current Epidemic Growth Status (Based on Rt) for States and Territories. https://www.cdc.gov/forecast-outbreak-analytics/about/rt-estimates.html (2024). ↩
-
US Centers for Disease Control and Prevention. Technical Blog: Improving CDC’s Tools for Assessing Epidemic Growth (2024). https://www.cdc.gov/forecast-outbreak-analytics/about/technical-blog-rt.html ↩
-
Abbott, S. et al. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Res. 5:112 (2020). https://doi.org/10.12688/wellcomeopenres.16006.2 ↩
-
Stan Development Team. Stan Modeling Language Users Guide and Reference Manual. (2023). https://mc-stan.org ↩
-
CmdStanR: the R interface to CmdStan. (2024). https://mc-stan.org/cmdstanr/index.html ↩