finalReport.Rmd

---
title: "CARPS Reproducibility Report"
output:
  html_document:
    toc: true
    toc_float: true
---

```{r}
articleID <- "8-7-2014_PS" # insert the article ID code here e.g., "10-3-2015_PS"
reportType <- 'final'
pilotNames <- "Mufan Luo, Sean Raymond Zion" # insert the pilot's name here e.g., "Tom Hardwicke". If there are multiple pilots enter both names in a character string e.g., "Tom Hardwicke, Bob Dylan"
copilotNames <- "Tom Hardwicke, Manuel Bohn" # insert the co-pilot's name here e.g., "Michael Frank". If there are multiple co-pilots enter both names in a character string e.g., "Tom Hardwicke, Bob Dylan"
pilotTTC <- 240 # insert the pilot's estimated time to complete (in minutes, fine to approximate) e.g., 120
copilotTTC <- 200 # insert the co- pilot's estimated time to complete (in minutes, fine to approximate) e.g., 120
pilotStartDate <- as.Date("01/11/17", format = "%m/%d/%y") # insert the pilot's start date in US format e.g., as.Date("01/25/18", format = "%m/%d/%y")
copilotStartDate <- as.Date("06/06/18", format = "%m/%d/%y") # insert the co-pilot's start date in US format e.g., as.Date("01/25/18", format = "%m/%d/%y")
completionDate <- as.Date("06/27/18", format = "%m/%d/%y") # copilot insert the date of final report completion (after any necessary rounds of author assistance) in US format e.g., as.Date("01/25/18", format = "%m/%d/%y")
```

-------

#### Methods summary: 
This is a within-subject experiment where 24 kindergarten children learned in two classroom environments (decorated classroom vs. sparse classroom). Learning rate was assessed by comparing pre- and a post-test performance. 

------

#### Target outcomes: 
> Effect of classroom type on learning
Pretest accuracy was statistically equivalent in the sparse classroom
condition (M = 22%) and the decorated-classroom
condition (M = 23%), paired-samples t(22) < 1, and
accuracy in both conditions was not different from
chance, both one-sample ts (22) < 1.3, ps > .21. The children’s
posttest scores were significantly higher than their
pretest scores in both experimental conditions, both
paired-samples ts(22) > 4.72, ps ≤ .0001 (Fig. 4). Therefore,
in both experimental conditions, the children successfully
learned from the instruction. However, their learning
scores were higher in the sparse-classroom condition
(M = 55%) than in the decorated-classroom condition
(M = 42%), paired-samples t(22) = 2.95, p = .007; this
effect was of medium size, Cohen’s d = 0.65.
Analysis of gain scores corroborated the results of the
analysis of the posttest scores. Gain scores were calculated
by subtracting each participant’s pretest score from
his or her posttest score. Pairwise comparisons indicated
that the children’s learning gains were higher in the
sparse-classroom condition (M = 33%, SD = 22) than in
the decorated-classroom condition (M = 18%, SD = 19),
paired-sample t(22) = 3.49, p = .002, Cohen’s d = 0.73.

------

```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```

# Step 1: Load packages

```{r}
library(tidyverse)
library(knitr) # for kable table formating
library(haven) # import and export 'SPSS', 'Stata' and 'SAS' Files
library(readxl) # import excel files
library(ReproReports) # custom report functions
library(ggplot2)
library(lsr)
library(dplyr)# for data munging
```

```{r}
# Prepare report object. This will be updated automatically by the reproCheck function each time values are compared.
reportObject <- data.frame(dummyRow = TRUE, reportedValue = NA, obtainedValue = NA, valueType = NA, percentageError = NA, comparisonOutcome = NA, eyeballCheck = NA)
```

# Step 2: Load data

```{r}
d <- read_excel("data/VisualEnv-Attn-Learning-Data.xlsx", 2)

d.gain.pre <- read_excel("data/VisualEnv-Attn-Learning-Data.xlsx", 3, skip = 2)
d.gain.post <- read_excel("data/VisualEnv-Attn-Learning-Data.xlsx", 4,skip = 2)

```

# Step 3: Tidy data

```{r}
d.tidy <- d %>%
  # exclude participant # 15 as reported in the paper
  filter(ID != "cs15") %>% 
  # select columns for target analysis
  select("ID",
         "Pre-Test-Sparse-Classroom",
         "Pre-Test-Decorated Classroom",
         "Post-Test-Sparse-Classroom",
         "Post-Test-Decorated Classroom") %>%
  gather(condition,score,-ID) %>%
  mutate(classroom = ifelse(grepl("Sparse",condition), "sparse","decorated"),
         test = ifelse(grepl("Pre-Test",condition), "pre","post" ),
         score = as.numeric(score)) %>%
  arrange(ID) %>%
  select(-condition)


d.tidy.gain.pre <- d.gain.pre %>%
  select(-...11,-`#`) %>%
  gather(pre_test_type,score, -ID)%>%
  arrange(ID)%>%
  mutate(classroom = ifelse(grepl("Sparse",pre_test_type), "sparse","decorated"),
         test = "pre",
         type = ifelse(grepl("Plate",pre_test_type), "plate",ifelse(grepl("Volcano",pre_test_type), "volcano",ifelse(grepl("Bugs",pre_test_type), "bugs",ifelse(grepl("Stone",pre_test_type), "stone",ifelse(grepl("Solar",pre_test_type), "solar",ifelse(grepl("Flight",pre_test_type), "flight","total")))))))%>%
  filter(type != "total") %>%
  select(-pre_test_type)

d.tidy.gain.post <- d.gain.post %>%
  select(-...11,-`#`)%>%
  gather(pre_test_type,score, -ID)%>%
  arrange(ID)%>%
  mutate(classroom = ifelse(grepl("Sparse",pre_test_type), "sparse","decorated"),
         test = "post",
         type = ifelse(grepl("Plate",pre_test_type), "plate",ifelse(grepl("Volcano",pre_test_type), "volcano",ifelse(grepl("Bugs",pre_test_type), "bugs",ifelse(grepl("Stone",pre_test_type), "stone",ifelse(grepl("Solar",pre_test_type), "solar",ifelse(grepl("Flight",pre_test_type), "flight","total")))))),
         score = ifelse(score == "Absent",NA,score),
         score = as.numeric(score))%>%
  filter(type != "total") %>%
  select(-pre_test_type)


d.tidy.gain <-bind_rows(d.tidy.gain.pre,d.tidy.gain.post)%>%
  filter(ID != "cs15")%>%
  arrange(ID)%>%
  spread(test,score)%>%
  na.omit()%>%
  mutate(gain_by_type = post - pre)%>%
  group_by(ID,classroom)%>%
  summarise(gain = mean(gain_by_type))

```

There should be 23 participants in the final data file.

```{r}
d.tidy %>%
  summarise(n = length(unique(ID))) %>%
  kable()
```

# Step 4: Run analysis

## Descriptive statistics

Pretest:

> ... accuracy was statistically equivalent in the sparse classroom condition (M = 22%) and the decorated-classroom condition (M = 23%)...

Posttest:

> However, their learning scores were higher in the sparse-classroom condition (M = 55%) than in the decorated-classroom condition (M = 42%)"

Check values from data:

```{r}
descriptives <- d.tidy %>%
  group_by(classroom,test) %>%
  summarise(mean =mean(score))

descriptives %>%
  kable(digits = 2)
```

```{r}
reportObject <- reproCheck(reportedValue = "0.22", obtainedValue =  descriptives %>% filter(classroom == "sparse", test == "pre") %>% pull(mean), valueType = "mean")

reportObject <- reproCheck(reportedValue = "0.55", obtainedValue =  descriptives %>% filter(classroom == "sparse", test == "post") %>% pull(mean), valueType = "mean")

reportObject <- reproCheck(reportedValue = "0.23", obtainedValue =  descriptives %>% filter(classroom == "decorated", test == "pre") %>% pull(mean), valueType = "mean")

reportObject <- reproCheck(reportedValue = "0.42", obtainedValue =  descriptives %>% filter(classroom == "decorated", test == "post") %>% pull(mean), valueType = "mean")
```
Minor deviation for one reported value.

Gain scores:

> ...gains were higher in the sparse-classroom condition (M = 33%, SD = 22) than in the decorated-classroom condition (M = 18%, SD = 19).

Obtained values:

```{r}
gain.descriptives <- d.tidy.gain %>%
  group_by(classroom) %>%
  summarise(mean = mean(gain),
            sd = sd(gain))

gain.descriptives %>%
  kable(digits = 2)
```

```{r}
reportObject <- reproCheck(reportedValue = "0.33", obtainedValue =  gain.descriptives %>% filter(classroom == "sparse") %>% pull(mean), valueType = "mean")

reportObject <- reproCheck(reportedValue = "0.22", obtainedValue =  gain.descriptives %>% filter(classroom == "sparse") %>% pull(sd), valueType = "sd")

reportObject <- reproCheck(reportedValue = "0.18", obtainedValue =  gain.descriptives %>% filter(classroom == "decorated") %>% pull(mean), valueType = "mean")

reportObject <- reproCheck(reportedValue = "0.19", obtainedValue =  gain.descriptives %>% filter(classroom == "decorated") %>% pull(sd), valueType = "sd")
```
Minor deviations from reported values.

## Inferential statistics

> Pretest accuracy was statistically equivalent in the sparse classroom condition (M = 22%) and the decorated-classroom
condition (M = 23%), paired-samples t(22) < 1,..."

```{r}
pre <- d.tidy %>%
  filter(test == "pre")
  
pre.comparison <- t.test(pre$score ~ pre$classroom, paired = TRUE, alternative = "two.sided") 
```

There is no specific value to be compared. Obtained t value is smaller than 1. 

```{r}
reportObject <- reproCheck(reportedValue = "<1", obtainedValue =  pre.comparison$statistic, valueType = "t", eyeballCheck = TRUE)
```

Check df:

```{r}
reportObject <- reproCheck(reportedValue = "22", obtainedValue = pre.comparison$parameter[['df']], valueType = "df")
```

> ...and accuracy in both conditions was not different from chance, both one-sample ts (22) < 1.3, ps > .21.

```{r}
pre.chance.sparse <- t.test(pre$score[pre$classroom == "sparse"], mu = .25, alternative = "two.sided")
pre.chance.decorated <- t.test(pre$score[pre$classroom == "decorated"], mu = .25, alternative = "two.sided")
```

Both t-values are smaller than 1 as reported, however they are negative since the means are both numerically below chance (25%). Obtained p-values are larger than the reported .21 (smallest is .22).

```{r}

reportObject <- reproCheck(reportedValue = "22", obtainedValue = pre.chance.sparse$parameter[['df']], valueType = "df")

reportObject <- reproCheck(reportedValue = "<1.3", obtainedValue =  pre.chance.sparse$statistic, valueType = "t",eyeballCheck = TRUE)

reportObject <- reproCheck(reportedValue = "> .21", obtainedValue =  pre.chance.sparse$p.value, valueType = "p",eyeballCheck = TRUE)

reportObject <- reproCheck(reportedValue = "22", obtainedValue = pre.chance.decorated$parameter[['df']], valueType = "df")

reportObject <- reproCheck(reportedValue = "<1.3", obtainedValue =  pre.chance.decorated$statistic, valueType = "t",eyeballCheck = TRUE)

reportObject <- reproCheck(reportedValue = "> .21", obtainedValue =  pre.chance.decorated$p.value, valueType = "p",eyeballCheck = TRUE)
```


> The children’s posttest scores were significantly higher than their pretest scores in both experimental conditions, both paired-samples ts(22) > 4.72, ps ≤ .0001 (Fig. 4). 

```{r}
pre.pos.comp.sparse <- t.test(d.tidy$score[d.tidy$classroom == "sparse"] ~ d.tidy$test[d.tidy$classroom == "sparse"], paired = TRUE, alternative = "two.sided") 

pre.pos.comp.decorated <- t.test(d.tidy$score[d.tidy$classroom == "decorated"] ~ d.tidy$test[d.tidy$classroom == "decorated"], paired = TRUE, alternative = "two.sided") 
```

Smaller obtained p-value (decorated room) is 4.70 and therefore not larger than 4.72 as reported. P-values are smaller or equal to .0001 as reported.

```{r}
reportObject <- reproCheck(reportedValue = "22", obtainedValue = pre.pos.comp.sparse$parameter[['df']], valueType = "df")

reportObject <- reproCheck(reportedValue = "> 4.72", obtainedValue =  pre.pos.comp.sparse$statistic, valueType = "t",eyeballCheck = TRUE)

reportObject <- reproCheck(reportedValue = "≤ .0001", obtainedValue =  pre.pos.comp.sparse$p.value, valueType = "p",eyeballCheck = TRUE)

reportObject <- reproCheck(reportedValue = "22", obtainedValue = pre.pos.comp.decorated$parameter[['df']], valueType = "df")

#reportObject <- reproCheck(reportedValue = "4.72", obtainedValue =  pre.pos.comp.decorated$statistic, valueType = "t")

# only minor deviation from reported value.

reportObject <- reproCheck(reportedValue = "> 4.72", obtainedValue =  pre.pos.comp.decorated$statistic, valueType = "t",eyeballCheck = TRUE)

reportObject <- reproCheck(reportedValue = "≤ .0001", obtainedValue =  pre.pos.comp.decorated$p.value, valueType = "p",eyeballCheck = TRUE)
```


> However, their learning scores were higher in the sparse-classroom condition (M = 55%) than in the decorated-classroom condition (M = 42%), paired-samples t(22) = 2.95, p = .007; this effect was of medium size, Cohen’s d = 0.65."


```{r}
post <- d.tidy %>%
  filter(test == "post")

# write equation this way to get t-value with the same sign.
post.comparison <- t.test(post$score[post$classroom == "sparse"],post$score[post$classroom == "decorated"], paired = TRUE, alternative = "two.sided") 

post.comparison.d <- cohensD(post$score ~ post$classroom)
```

T- and p-value are as reported. Obtained effect size is slightly smaller than the reported (0.63 compared to 0.65)

```{r}
reportObject <- reproCheck(reportedValue = "22", obtainedValue = post.comparison$parameter[['df']], valueType = "df")

reportObject <- reproCheck(reportedValue = "2.95", obtainedValue =  post.comparison$statistic, valueType = "t")

reportObject <- reproCheck(reportedValue = ".007", obtainedValue =  post.comparison$p.value, valueType = "p")

reportObject <- reproCheck(reportedValue = "0.65", obtainedValue =  post.comparison.d, valueType = "d")
```

> Pairwise comparisons indicated that the children’s learning gains were higher in the sparse-classroom condition (M = 33%, SD = 22) than in the decorated-classroom condition (M = 18%, SD = 19), paired-sample t(22) = 3.49, p = .002, Cohen’s d = 0.73."

```{r}

gain.comparison <- t.test(d.tidy.gain$gain[d.tidy.gain$classroom == "sparse"], d.tidy.gain$gain[d.tidy.gain$classroom == "decorated"], paired = TRUE, alternative = "two.sided") 

gain.comparison.d <- cohensD(d.tidy.gain$gain ~ d.tidy.gain$classroom)
```

T-value and effect size as reported.

```{r}
reportObject <- reproCheck(reportedValue = "22", obtainedValue = gain.comparison$parameter[['df']], valueType = "df")

reportObject <- reproCheck(reportedValue = "3.49", obtainedValue =  gain.comparison$statistic, valueType = "t")

reportObject <- reproCheck(reportedValue = ".002", obtainedValue =  gain.comparison$p.value, valueType = "p")

reportObject <- reproCheck(reportedValue = "0.73", obtainedValue =  gain.comparison.d, valueType = "d")
```

### Replicate Figure 4 in the paper

```{r}
library(ggthemes)


p1 <- d.tidy %>%
  mutate(classroom = relevel(as.factor(classroom), ref = "sparse"),
         test = relevel(as.factor(test), ref = "pre")) %>%
  group_by(classroom,test)%>%
  summarise(mean = mean(score)*100,
            se = (sd(score)/(sqrt(length(ID))))*100)

ggplot(p1,
       aes(x = classroom, y = mean,  fill = test)) + 
  geom_bar(stat="identity", position = position_dodge(), color = 'black') +
  labs(y = "Percentage Correct (%)") + 
  geom_errorbar(aes(ymin = mean-se, ymax = mean+se),position = position_dodge(width = 0.9), width = .1)+
  ylim(0,100)+
  geom_hline(yintercept = 25, lty=2)+
  theme_few()
```

# Step 5: Conclusion

Initially we encountered major errors for the analysis of gain scores. The t-value and the effect size obtained were considerably smaller compared to the ones reported in the paper. We originally calculated the gain scores by averaging scores in each test phase (pre and post) in each condition for each subject and then subtracting the averages for the post from the ones for the pre test. This way of calculating  scores was based on the following section of the paper:
> Gain scores were calculated by subtracting each participant’s pretest score from his or her posttest score.

When querying the authors with these discrepancies, they informed us that the gain scores were calculated in a different way, namely by subtracting pre from post test scores for each specific lesson (e.g. plate tectonics) and then averaging across lesson difference scores to get a mean gain score per subject and condition. This also included excluding some sub lessons for some of the participants due to absence. This way of calculating the gain scores was not obvious based on the information in the paper. The supplementary material also did not include any additional inforamtion about gain scores. After correcting the way the gain scores were calculated, all major issues were resolved. 

```{r}
Author_Assistance = TRUE # was author assistance provided? (if so, enter TRUE)

Insufficient_Information_Errors <- 0 # how many discrete insufficient information issues did you encounter?

# Assess the causal locus (discrete reproducibility issues) of any reproducibility errors. Note that there doesn't necessarily have to be a one-to-one correspondance between discrete reproducibility issues and reproducibility errors. For example, it could be that the original article neglects to mention that a Greenhouse-Geisser correct was applied to ANOVA outcomes. This might result in multiple reproducibility errors, but there is a single causal locus (discrete reproducibility issue).

locus_typo <- 0 # how many discrete issues did you encounter that related to typographical errors?
locus_specification <- 4 # how many discrete issues did you encounter that related to incomplete, incorrect, or unclear specification of the original analyses?
locus_analysis <- 0 # how many discrete issues did you encounter that related to errors in the authors' original analyses?
locus_data <- 0 # how many discrete issues did you encounter that related to errors in the data files shared by the authors?
locus_unidentified <- 0 # how many discrete issues were there for which you could not identify the cause

# How many of the above issues were resolved through author assistance?
locus_typo_resolved <- 0 # how many discrete issues did you encounter that related to typographical errors?
locus_specification_resolved <- 4 # how many discrete issues did you encounter that related to incomplete, incorrect, or unclear specification of the original analyses?
locus_analysis_resolved <- 0 # how many discrete issues did you encounter that related to errors in the authors' original analyses?
locus_data_resolved <- 0 # how many discrete issues did you encounter that related to errors in the data files shared by the authors?
locus_unidentified_resolved <- 0 # how many discrete issues were there for which you could not identify the cause

Affects_Conclusion <- FALSE # Do any reproducibility issues encounter appear to affect the conclusions made in the original article? This is a subjective judgement, but you should taking into account multiple factors, such as the presence/absence of decision errors, the number of target outcomes that could not be reproduced, the type of outcomes that could or could not be reproduced, the difference in magnitude of effect sizes, and the predictions of the specific hypothesis under scrutiny.
```

```{r}
reportObject <- reportObject %>%
  filter(dummyRow == FALSE) %>% # remove the dummy row
  select(-dummyRow) %>% # remove dummy row designation
  mutate(articleID = articleID) %>% # add the articleID 
  select(articleID, everything()) # make articleID first column

# decide on final outcome
if(any(!(reportObject$comparisonOutcome %in% c("MATCH", "MINOR_ERROR"))) | Insufficient_Information_Errors > 0){
  finalOutcome <- "Failure without author assistance"
  if(Author_Assistance == T){
    finalOutcome <- "Failure despite author assistance"
  }
}else{
  finalOutcome <- "Success without author assistance"
  if(Author_Assistance == T){
    finalOutcome <- "Success with author assistance"
  }
}

# collate report extra details
reportExtras <- data.frame(articleID, pilotNames, copilotNames, pilotTTC, copilotTTC, pilotStartDate, copilotStartDate, completionDate, Author_Assistance, finalOutcome, Insufficient_Information_Errors, locus_typo, locus_specification, locus_analysis, locus_data, locus_unidentified, locus_typo_resolved, locus_specification_resolved, locus_analysis_resolved, locus_data_resolved, locus_unidentified_resolved)

# save report objects
if(reportType == "pilot"){
  write_csv(reportObject, "pilotReportDetailed.csv")
  write_csv(reportExtras, "pilotReportExtras.csv")
}

if(reportType == "final"){
  write_csv(reportObject, "finalReportDetailed.csv")
  write_csv(reportExtras, "finalReportExtras.csv")
}
```

# Session information

```{r session_info, include=TRUE, echo=TRUE, results='markup'}
devtools::session_info()
```