Case_definition.qmd

---
title: "Case Definition"
editor: visual
---

## 1. Learning outcomes

At the end of the session, participants will be able to:

-   Apply case definition criteria to a dataset using logical arguments in R

## 2. Story/plot description

You will now create a new column in the data set to hold the case definition you decided in a previous step during your investigation. You can call this column `case` and set it to `TRUE` if the individual meets the case definition criteria and `FALSE` if not. You will use this column later on for any calculations needed (descriptive statistics, two-by-two tables to compute measures of association, etc.) to figure out the culprit of this outbreak.

## 3. Questions/Assignments

If you closed your R project, please open it again and load the clean dataset. We strongly recommend you use the `Spetses_clean1_2024.rds`. This is just to be sure all needed changes in that file for the code to work have been done (as you may have changed the file in a slightly different way than we have).

To be sure we all start from the same case definition, let's agree that a case was defined as a person who:

-   [x] attended the school dinner on 11 November 2006 (i.e. is on the linelist)
-   [x] ate a meal at the school dinner (i.e. was exposed)
-   [x] fell ill after the start of the meal
-   [x] fell ill within the time period of interest after the school dinner
-   [x] suffered from diarrhoea with or without blood, or vomiting

Non cases (not-ill) were defined as people who:

-   [x] attended the school dinner on 11 November 2006 (i.e. are on the linelist)
-   [x] ate a meal at the school dinner (i.e. were exposed)
-   [x] did not fall ill within the time period of interest
-   [x] did not develop diarrhoea (with or without blood) or vomiting

## 3.1 Install packages (if needed) and load libraries

```{r, Packages}
# Load the required libraries into the current R session:
pacman::p_load(rio, 
               here, 
               tidyverse, 
               skimr,
               plyr,
               janitor,
               lubridate,
               gtsummary, 
               flextable,
               officer,
               epikit, 
               apyramid, 
               scales)

```

## 3.2 Import your data

```{r, Import_data}
# Import the clean data set:
copdata <- rio::import(here::here("data", "Spetses_clean1_2024.rds"))
```

## 3.3 Identify the variables you need to apply the case definition criteria.

::: {.callout-tip title="Need a little bit of help?" collapse="true"}
The variables we need from the dataset to apply the above case definition are: `meal`, `onset_datetime`, `diarrhoea`, `bloody` and `vomiting`.
:::

## 3.4 Create a new `case` column to hold the binary case definition variable. Let's think about how to do this little by little:

### a) Ate a meal at the school dinner

You decide to exclude any people from the cohort who didn't eat at the dinner, because we specifically hypothesised a food item to be the vehicle of infection in this outbreak. Thus, filter your dataset to those who ate a meal: Keep in your dataset only those who ate a meal.

::: {.callout-tip title="Need a little bit of help?" collapse="true"}
`filter` by those with `meal == TRUE`.
:::

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What are some of the implications this decision may lead to? (excluding any people from the cohort who didn't eat at the dinner)
:::

```{r}
copdata <- copdata %>% 
  filter(meal == TRUE)
```

::: {.callout-note title="Once you've thought about the above, have a look here" collapse="true"}
Seven of the respondents actually said they did not eat the meal, but when it came to the questions about which food items they ate, they provided answers! This issue could have been minimised by:

\- At the survey state, one could adjust the design of an electronic questionnaire to prevent key questions from being skipped. This can come with both pros and cons. Allow fellows to discuss if time allows.

\- Explore your data further, realise this is the case, and recode the `meal` variable for these individuals as `TRUE.` =\> This would be the way to go, but is not what we did in our example because we tried to keep it simple, and also because it is good to show that you may not always clean the data perfectly, and that has consequences: You can highlight the importance or really explore your data in depth.

By making the above decision, we may be missing cases and non-cases people, and thus, modifying the final estimate of our measure of association. =\> It is very important to know your data, explore it deeply and try to clean it as well as possible. Every step one makes when cleaning the data may have a consequence, and we should be aware of it when making the data cleaning decisions and when interpreting the results.
:::

### b) Fell ill after the start of the meal

We define "fell ill" as any person having had diarrhoea with OR without blood, OR vomiting. To capture this information easily, you will create a new `gastrosymptoms` variable. This variable will indicate that the person had one OR ("or" is R is achieved by using `|`) more of the clinical symptoms in your definition.

Note that we the concept of having eaten a meal is already included as per one of the steps above.

::: {.callout-tip title="Need a little bit of help?" collapse="true"}
Create a new `gastrosymptoms` column in your line list with `mutate()` and case_when`()`.
:::

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What are some of the implications this decision may lead to? (defining "fell ill" as any person who reported having diarrhoea with OR without blood, OR vomiting)
:::

::: {.callout-note title="If you've thought about the above, have a look here" collapse="true"}
Having one clinical symptom enough to be considered a potential case at this point may be considered too unspecific (low specificity). For example, a person who ate at the dinner party and developed diarrhoea for other reasons other than food poisoning (say, they recently started on antibiotics known for unbalancing the intestinal flora and causing diarrhoea) could be misclassified as a potential case. (Note we talk about **potential case**, and not **case**; that is because here we are not talking about cases per-se yet, but this decision has implications for when applying the case definition below).

Moreover, those who did not report clinical symptoms will be defined as non-cases. Thus, we are assuming that these individuals did not develop symptoms because they didn't report them. The missing values could be due to, for example, them skipping the questions in the questionnaire. Some individuals may be reluctant to report symptoms, due to shame, fear of repercussion, or others. It is important to think ahead, before the interview, about ways to minimise these situations. For example, through questionnaire design, you may impede skipping questions; you could promote trust by using the right interviewers (in some cases this will be someone from the community, in others someone form specific NGOs, someone of a specific race or gender, etc); choose to carry out online questionnaires vs in person (or vice versa, depending on the situation), etc.
:::

```{r}
copdata <- copdata %>% 
  mutate(gastrosymptoms = case_when(
    # Those had diarrhoea...
    diarrhoea == TRUE |
      #or bloody diarrhoea...
    bloody == TRUE |
      # or vomiting, are marked as TRUE (fell ill after the meal)
    vomiting == TRUE ~ TRUE,
    # The rest are FALSE. This includes those who ate a meal but had no symptoms (did not fell ill after the meal)
    .default = FALSE)
    )
```

### c) Fell ill within the time period of interest

-   Hmmm... what is the time period of interest? You could start calculating the incubation period, which can be defined by calculating the time between exposure (the meal) and onset of symptoms, and then looking at the distribution of these time differences. In this outbreak, incubation periods are easy to calculate, because everyone was exposed at (roughly) the same time and on the same day (eating the meal at the school dinner party).
-   Dinner was served at 18:00 roughly for everyone. Create a new `meal_datetime` variable with this information for all people in your dataset (as per 05 Oct 2024, at 18:00h).

```{r}
# Start with copdata:
copdata <- copdata %>% 
  # Create new column for meal date and time:
  mutate(meal_datetime = lubridate::ymd_hm("2024-10-05 18:00"))
```

-   You can calculate the incubation period for each participant by subtracting `onset_datetime` - `meal_datetime.`
-   Then, take the `median()` of that column to calculate a median incubation period, this will help you start having some hypothesis about the type of pathogen we are dealing with until the lab results come back.

::: {.callout-tip title="Need a little bit of help?" collapse="true"}
Be aware of the type of variable `incubation` is (check `class()`), as well as of missing values`NA`.
:::

::: {.callout-note title="Note" collapse="true"}
The **incubation period** of a disease is the time from an exposure resulting in infection until the onset of disease. Because infection usually occurs immediately after exposure, the incubation period is generally the duration of the period from the onset of infection to the onset of disease -- Rothman, Greenland, Lash (2017): Modern Epidemiology, 3rd edition
:::

```{r}

copdata <- copdata %>% 
  mutate(incubation = onset_datetime - meal_datetime,
         incubation = as.numeric (incubation))


median(as.numeric(copdata$incubation), na.rm = TRUE)

```

::: {.callout-caution title="Now we are messing up with your brain..." collapse="true"}
If you pay attention, there were two individuals who don't have a time recorded, only a date, the day of the dinner. These people probably got sick the same night they had the dinner, but the exact time was not recoded. They way we've managed the data will insert "00:00" in a missing value of a time. This means that the 2 people that got sick the day of the dinner have been recorded sick *before* the dinner (in the early morning of the day of the dinner). The two people have a negative incubation period! This should not happen. If you feel the force is with you, you can modify this and fix the error (if you do this, your results will be a bit different than your colleagues, but similar). We will continue with this "erroneous" data, for easiness. But note that in real life, as soon as you realise there is a coding error like this, you would have to go back to your cleaning code and fix the error! This may happen a couple of times during your outbreak investigation, and is totally normal. Cleaning data is not an easy job!
:::

We see that the median incubation time is `r median(as.numeric(copdata$incubation), na.rm = TRUE)` hours. This is useful information, as incubation periods tend to be relatively pathogen-specific.

Based on this, say you define and limit the maximum incubation period to 48 hours after the meal, as the data points to a fast-acting bacterial toxin or a virus. That is, dinner participants should have developed (`onset_datetime`) at least one symptom (`diarrhoea`, `bloody` or `vomiting`) 48h after `meal_datetime` to become a `case.`

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What are some of the implications this decision may lead to? (implications of this case definition)
:::

::: {.callout-note title="If you've thought about the above, have a look here" collapse="true"}
All those not developing at least one symptom (diarrhoea with OR without blood, OR vomiting) 48h after the dinner are considered non-cases. This could (depending on how you decide to analyse your data) include those who had no symptoms at all, those who have missing data on the `onset_datetime` variable, and/or those who had symptoms before eating the meal. *This is a reminder that you need to be both careful and aware of the implications of your data analysis decisions.* If a person had clinical symptoms before eating the meal, they are considered as not-cases. However, it could be that a person had symptoms before the meal, and yet, still got infected by the pathogen when eating their meal (bad luck, we know...). According to this definition, we would be missing that case.
:::

### d) Finally, with this information you can create a new `case` column to hold the binary (`TRUE`/`FALSE`) case definition variable.

```{r}

copdata <- copdata %>% 
  mutate(case = case_when(
    # Those who had symptoms <48h from the meal are cases (TRUE)
    gastrosymptoms == TRUE & 
      onset_datetime >= meal_datetime &
      onset_datetime <= (meal_datetime + days(2)) ~ TRUE,
    # Those who had symptoms >48h from the meal are non-cases (FALSE)
    gastrosymptoms == TRUE & 
      onset_datetime > (meal_datetime + days(2)) ~ FALSE,
    # The rest are considered non-cases. Including, those who had no symptoms at all, who have missing data on the onset_datetime variable, or who had symptoms before eating the meal 
    .default = FALSE)
  )
```

Note that we may be incurring in misclassification bias with the code above. The last section indicates that if a person had clinical symptoms before eating the meal, they are considered as non-cases. However, it could be that a person had symptoms before the meal, and yet, still got infected by the pathogen when eating their meal (bad luck, we know...).

Moreover, if you remember, there were a couple of people with an `dayonset`, but no `starthour`. The code we used (`lubridate::ymd_h` with argument `truncated = 2`) results in dates with missing `starthour` being converted to date-time, with the missing time being set to `00:00` (midnight). This means that these two people don't fulfill the case definition criteria because we marked their symptoms started early in the morning of Nov 11 (at 00:00), before the meal time (18:00), and thus, they did not "fell ill within the time period of interest".

The two situations above are a reminder that you need to be both careful and aware of the implications of your data analysis decisions.

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What do you think are the risks of mis-classifying cases as non-cases in your analysis?
:::

::: {.callout-note title="If you've thought about the above, have a look here" collapse="true"}
We will have bias either towards or away from the null, depending on the proportions of subjects misclassified.
:::

```{r}
# Tabulate cases:
janitor::tabyl(dat = copdata, case)
```

Let's have a look at how many people ate a meal, had symptoms, and were considered as cases after applying our case definition:

```{r overview}
copdata %>% 
  summarise(atemeal = sum(meal == TRUE),
            hadsympt = sum(gastrosymptoms == TRUE),
            nb_cases = sum(case == TRUE)
            )
```

# 4. Export clean data

Finally, we can save the cleaned data set as a new file `Spetses_clean2_YOURINITIALS`, under the data folder, before we proceeding with descriptive analysis.

```{r export_clean_data}

rio::export(x = copdata, 
            file = here::here("data", "Spetses_clean2_2024.rds"))


```