Skip to content

Commit

Permalink
fix dates
Browse files Browse the repository at this point in the history
  • Loading branch information
anders-kolstad committed Mar 22, 2024
1 parent 1ee84fa commit 31b8324
Show file tree
Hide file tree
Showing 2 changed files with 137 additions and 27 deletions.
164 changes: 137 additions & 27 deletions ch_data_exploration.qmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
editor:
markdown:
wrap: sentence
---

# Data exploration and cleaning {#data-exp}

```{r setup}
Expand All @@ -18,16 +24,21 @@ The data are split into different tabs based on year.
dat <- read_excel("data/growthData.xlsx",
sheet = "y2017") |>
# rm some measurements from October. These were measured again in November
# along with the rest of the quadrats.
select(-DateFALL2,
-HeightFALL2,
-ObserverFALL2) |>
bind_rows(
y2018 <- read_excel("data/growthData.xlsx",
read_excel("data/growthData.xlsx",
sheet = "y2018") |>
mutate(DateFALL = as.Date(DateFALL, "%d.%m.%Y"))) |>
bind_rows(
read_excel("data/growthData.xlsx",
sheet = "y2019") |>
mutate(DateSPRING = as.Date(DateSPRING, "%d/%m/%Y"),
DateSUMMER = as.Date(DateSUMMER, "%d/%m/%Y"),
DateFALL = as.Date(DateFALL, "%d/%m/%Y"))
mutate(
DateSUMMER = as.Date(DateSUMMER, "%d/%m/%Y"),
DateFALL = as.Date(DateFALL, "%d/%m/%y"))
) |>
bind_rows(
read_excel("data/growthData.xlsx",
Expand Down Expand Up @@ -55,11 +66,11 @@ Here's what the data looks like after I just row bind them:
DT::datatable(dat)
```

</br>

I need to make this into a long format.
I need to make this into a long format.
There are multiple date and height columns that I want to combine.
I will split the spring and fall data (ignoring the summer data)
into separate sets, and then combine them again later.
I will split the spring and fall data (ignoring the summer data) into separate sets, and then combine them again later.

```{r}
#| code-summary: "Turn into long format"
Expand Down Expand Up @@ -141,50 +152,149 @@ dat_long <- dat_spring |>
```

The long data is `r nrow(dat_long)` rows.
This is too much to display as an html table on this web site, but here is a
random sample of 100 rows just to illustrate.
This is too much to display as an html table on this web site, but here is a random sample of 100 rows just to illustrate.

```{r}
DT::datatable(dat_long[sample(1:nrow(dat_long), 100),])
```

I still need to figure out what to do about that second measurement on some
pins in the fall of 2017.
</br>

I still need to figure out what to do about that second measurement on some pins in the fall of 2017.

## Looking for data problems

### Dates

#### Seasons

Are the dates entered correctly to match the seasons?

```{r fig-datesCheck}
#| fig.cap: "Distribution of measurement dates (months)."
dat_long |>
ggplot() +
geom_bar(aes(x = month(date)))+
facet_grid(year(date)~season)
```

There are some measurement sin 2020 that are wrong.
It turns out the month and day have been switched for plots 8 and 9:

```{r fig-dateCheck2}
#| fig.cap: "Checking inconsistency in date entries."
dat_long |>
mutate(year = year(date)) |>
filter(year == 2020,
season == "fall") |>
ggplot() +
geom_bar(aes(x = date),
color = "yellow",
fill = "orange")+
theme(axis.text.y = element_blank()) +
facet_grid(Plot_no~.)
```

```{r}
#| eval: false
#| code-summary: "Confirming tht day and month have been switched."
dat_long |>
count(Treatment)
filter(Plot_no %in% c(8, 9),
season == "fall",
year(date) == 2020) |>
View()
```

How can Treatment be NA?
I will reverse these now.

```{r}
#| code-summary: "Fix date mistake"
dat_long <- dat_long |>
mutate(date = case_when(
Plot_no %in% c(8,9) & date == date("2020-06-10") ~ date("2020-10-06"),
.default = date
))
```

```{r fig-dateCheck3}
#| fig.cap: "Checking measurement dates after fixing mistake."
dat_long |>
filter(is.na(Treatment))
mutate(year = year(date)) |>
filter(year == 2020,
season == "fall") |>
ggplot() +
geom_bar(aes(x = date),
color = "yellow",
fill = "orange")+
theme(axis.text.y = element_blank()) +
facet_grid(Plot_no~.)
```

These are all new pins, and all from the fall.
These row also dont have Pin_no and Plot_no.
I can assume that the Treatment of the new pins are the same as the original
pins.
#### Year

```{r fig-distYears}
#| fig.cap: "Distribution of data points over the years and seasons"
dat_long |>
ggplot() +
geom_bar(aes(
x = factor(year(date)),
fill = season)) +
labs(x = "Year")
```

I wonder why there are so relatively few observation in 2020.

### Height variable

A closer look at the height variable.

Here's the time series for a single pin, measured from the west.

```{r}
temp <- dat_long |>
select(c(ID,
Plot_no,
Pin_no,
Treatment)) |>
distinct() |>
sepert
dat_long |>
filter(grepl("^8.14", ID),
pinPosition == "W2") |>
mutate(
year = year(date),
month = month(date),
day = day(date)) |>
arrange(year, month, day) |>
select(-Plot_no,
-Pin_no,
-Treatment,
-Species_W,
-Species_E,
-date) |>
datatable()
```

It appears the pin was replaced in the first fall in 2019.
In 2020 there is no data from the spring, and in the fall the ID is back to the original (the *new* part is removed).
Next spring (2021) the wire seems to have been replaced again.
In the spring of 2022 it was replaced a forth(?) time, but the ID is again back to the original code.

How do we make sense of this?

### Treatment

A closer look at the Treatment variable.

```{r}
dat_long |>
filter(ID == "16.6") |>
View()
count(Treatment)
```

How can Treatment be NA?

```{r}
dat_long |>
filter(is.na(Treatment)) |>
datatable()
```

These are all new pins, and all from the fall.
These row also don't have Pin_no and Plot_no.
Year is 2018 or 2019.
I can assume that the Treatment of the new pins are the same as the original pins (same Plot_no but without the *new* suffix).

I need to make a table with original IDs matched with the correct plot and pin number, and the correct treatment.
Binary file modified data/growthData.xlsx
Binary file not shown.

0 comments on commit 31b8324

Please sign in to comment.