finalReport.Rmd

---
title: "CARPS Reproducibility Report"
output:
  html_document:
    toc: true
    toc_float: true
---

```{r}
articleID <- "7-3-2015_PS" # insert the article ID code here e.g., "10-3-2015_PS"
reportType <- 'final'
pilotNames <- "Danielle Boles, Michael Ko" # insert the pilot's name here e.g., "Tom Hardwicke". If there are multiple pilots enter both names in a character string e.g., "Tom Hardwicke, Bob Dylan"
copilotNames <- "Ben Peloquin" # insert the co-pilot's name here e.g., "Michael Frank". If there are multiple co-pilots enter both names in a character string e.g., "Tom Hardwicke, Bob Dylan"
pilotTTC <- 150 # insert the pilot's estimated time to complete (in minutes, fine to approximate) e.g., 120
copilotTTC <- 200 # insert the co- pilot's estimated time to complete (in minutes, fine to approximate) e.g., 120
pilotStartDate <- as.Date("10/27/17", format = "%m/%d/%y") # insert the pilot's start date in US format e.g., as.Date("01/25/18", format = "%m/%d/%y")
copilotStartDate <- as.Date("06/13/18", format = "%m/%d/%y") # insert the co-pilot's start date in US format e.g., as.Date("01/25/18", format = "%m/%d/%y")
completionDate <- as.Date("04/20/19", format = "%m/%d/%y") # copilot insert the date of final report completion (after any necessary rounds of author assistance) in US format e.g., as.Date("01/25/18", format = "%m/%d/%y")
```

-------

#### Methods summary: 
Participants (N=21) completed a series of trials that required them to switch or stay from one task to the other. One task was to choose the larger value of the two values if surrounded by a green box. The other task was to choose the value with the larger font if surrounded by a blue box. Subliminal cues followed by a mask were presented before each trial. Cues included "O" (non-predictive cue), "M" (switch predictive cue), and "T" (repeat predictive cue). Reaction times and performance accuracy were measured.

------

#### Target outcomes: 
> Performance on switch trials, relative to repeat trials, incurred a switch cost that was evident in longer RTs (836 vs. 689 ms) and lower accuracy rates (79% vs. 92%). If participants were able to learn the predictive value of the cue that preceded only switch trials and could instantiate relevant anticipatory control in response to it, the performance on switch trials preceded by this cue would be better than on switch trials preceded by the nonpredictive cue. This was indeed the case (mean RT-predictive cue: 819 ms; nonpredictive cue: 871 ms; mean difference = 52 ms, 95% confidence interval, or CI = [19.5, 84.4]), two-tailed paired t(20) = 3.34, p < .01. However, error rates did not differ across these two groups of switch trials (predictive cue: 78.9%; nonpredictive cue: 78.8%), p = .8.

------

```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```

## Step 1: Load packages

```{r}
library(tidyverse) # for data munging
library(knitr) # for kable table formating
library(haven) # import and export 'SPSS', 'Stata' and 'SAS' Files
library(readxl) # import excel files
library(ReproReports) # custom report functions
library(broom)
```

```{r}
# Prepare report object. This will be updated automatically by the reproCheck function each time values are compared.
reportObject <- data.frame(dummyRow = TRUE, reportedValue = NA, obtainedValue = NA, valueType = NA, percentageError = NA, comparisonOutcome = NA, eyeballCheck = NA)
```

## Step 2: Load data

```{r warning=FALSE}
# This reads all the participants data (each is in a seperate xls file) in and combines them into one dataframe
# Each xls has 250 rows, the rest is their calculations using excel, which we don't want in the data
files <- dir('data/Experiment 1')

data <- data.frame()
id <- 1
for (file in files){
  if(file != 'Codebook.xls'){
    temp_data <- read_xls(file.path('data/Experiment 1', file))
    temp_data$id <- id
    id <- id + 1
    temp_data <- temp_data[1:250, ]
    data <- rbind(data, temp_data)
  }
}
```

```{r bp-load-codebook, echo=FALSE}
code_book_path <- 'data/Experiment 1/Codebook.xls'
d_codebook <- read_xls(code_book_path, col_names=FALSE) # Note that 2, 4, 8 aren't listed in codebook for Primes
```

Do we have data from 21 participants as expected?

```{r bp-check-num-participants, echo=FALSE}
assertthat::are_equal(length(unique(data$id)), 21)
```

## Step 3: Tidy data

Each row is an observation. The data is already in tidy format.

## Step 4: Run analysis

### Pre-processing

The codebook for Experiment 1 listed `O`, `T`, and `M` as the only primes that they used. However, we found that some participants had primes `2`, `4`, and `8` instead.

We inferred that `2` is the nonpredictive cue, `4` is the repeat predictive cue and `8` as the switch predictive cue based on how the other columns were named. 
Thus we will proceed the analysis with this assumption by recoding the primes this way.

```{r bp-check-prime-codes, echo=FALSE, eval=FALSE}
#
# Note this, chunk is not a part of the evaluation, 
# just for sanity checking
#
data %>% select(Block_Number) %>% unique


# Prime coding
# ------------

# Participants with numbered primes 2, 4, and 8
number_prime_ids <- data %>% 
  filter(Prime %in% c(2, 4, 8)) %>%
  select(id) %>%
  unique

# Participants with lettered primes 2, 4, and 8
letter_prime_ids <- data %>% 
  filter(Prime %in% c('O', 'M', 'T')) %>%
  select(id) %>% 
  unique

# One person seems to have seen both???
intersect(number_prime_ids$id, letter_prime_ids$id)

# Occurences of numbered primes...
data %>%
  filter(Prime %in% c(2, 4, 8)) %>%
  select(Event_Number) %>%
  group_by(Event_Number) %>%
  summarise(cnt=n()) %>%
  ggplot(aes(x=Event_Number, y=cnt)) +
    geom_bar(stat='identity')
```

```{r recode}
data$originalPrime <- data$Prime
data$Prime <- recode(data$Prime, '2' = "O", '4' = "T", '8' = "M")

#recode variables to make referencing easier
data$originalTrialType <- data$TrialType
data$Prime <- recode(data$Prime, 'O' = "Nonpredictive Cue", 'M' = "Switch Predictive Cue", 'T' = "Repeat Predictive Cue")
data$TrialType <- recode(data$TrialType, '0' = "Neither", '1' = "Repeat Trials", '2' = "Switch Trials")
```

**Note**: `TrialType=0` is not listed in codebook.

### Descriptive statistics

We will first try to reproduce median reaction time of switch trials and repeat trials.

> Performance on switch trials, relative to repeat trials, incurred a switch cost that was evident in longer RTs (836 vs. 689 ms)

We used median as the author's instructed to use median for reaction time unless otherwise reported:

> unless stated otherwise, the statistical tests were performed on the more stable median values rather than mean values.

```{r median_RT}
med_RT <- data %>% 
        group_by(TrialType) %>% 
        summarise(median_RT = median(RT),
                  mean_RT=mean(RT))

kable(med_RT[-1, ])
```

```{r median_RT diff}
reportObject <- reproCheck(reportedValue = "836", obtained = filter(med_RT, TrialType == "Switch Trials")$median_RT, valueType = 'median')
reportObject <- reproCheck(reportedValue = "689", obtained = filter(med_RT, TrialType == "Repeat Trials")$median_RT, valueType = 'median')
```

Note than means would not have matched either.

Now we are checking if only including participants with 'letter' primes enables us to reproduce the target values...

```{r bp-check-RTs-filtered-on-OMT, echo=FALSE, eval=FALSE}
data %>% 
  filter(originalPrime %in% c("O", "M", "T")) %>%
  group_by(TrialType) %>% 
  summarise(median_RT = median(RT),
            n=n())
```

This would not have matched either.

-----------------

Next we will try to reproduce the accuracy of switch trials and repeat trials.

> Performance on switch trials, relative to repeat trials, incurred a switch cost that was evident in [...] lower accuracy rates (79% vs. 92%)

```{r mean_Correct_Response}
mean_RespCorr <- data %>% 
        group_by(TrialType) %>%
        summarise(accuracy = mean(RespCorr))

kable(mean_RespCorr[-1, ])
```

```{r mean_acc diff}
reportObject <- reproCheck(reportedValue = "0.79", obtained = filter(mean_RespCorr, TrialType == "Switch Trials")$accuracy, valueType = 'mean')
reportObject <- reproCheck(reportedValue = "0.92", obtained = filter(mean_RespCorr, TrialType == "Repeat Trials")$accuracy, valueType = 'mean')
```

Minor errors: These values are extremeley close although rounding mean for accuracy repeat accuracy is slightly different.

--------------------

Now we will analyze Predicitve Switch Cues vs Nonpredictive Switch Cues. Let's start with reaction time.

> This was indeed the case (mean RT-predictive cue: 819 ms; nonpredictive cue: 871 ms; ... )

Later the authors do a t test with 20 as the degrees of freedoms. So we will assume that these mean values come from individual RT averages.

```{r mean Prime RT}
mean_Prime_RT_Ind <- data %>% 
  filter(TrialType == "Switch Trials") %>% 
  group_by(id, Prime) %>% 
  summarise(meanRT = mean(RT),
            medianRT = median(RT)) #Individual Means
mean_Prime_RT <- mean_Prime_RT_Ind %>% group_by(Prime) %>% 
  summarise(grandmeanRT = mean(meanRT),
            grandMedianRT = median(medianRT)) #Grand Means

kable(mean_Prime_RT)
```

Minor errors here. Report:

```{r mean Prime RT diff}

npc_mean <- mean_Prime_RT %>% filter(Prime == "Nonpredictive Cue") %>% pull(grandmeanRT)

reportObject <- reproCheck(reportedValue = "871", obtained = npc_mean, valueType = 'mean')

spc_mean <- mean_Prime_RT %>% filter(Prime == "Switch Predictive Cue") %>% pull(grandmeanRT)

reportObject <- reproCheck(reportedValue = "819", obtained = spc_mean, valueType = 'mean')
```

```{r bp-check-switch-times, echo=FALSE, eval=FALSE}
# Check coding of `prime` type. No difference when we filter on lettered primes (e.g. `Prime = O, M, T`). We do see a difference filtering on `Prime = 2, 4, 8`...

# Filter with original prime (O, M, T)
data %>% 
  mutate(prime_code_type=ifelse(originalPrime %in% c("O", "M", "T"), 'letters', 'numbers')) %>%
  filter(TrialType == "Switch Trials") %>% 
  group_by(id, prime_code_type, originalPrime) %>% 
  summarise(meanRT = mean(RT),
            medianRT = median(RT)) %>% 
  group_by(prime_code_type, originalPrime) %>% 
  summarise(grandmeanRT = mean(meanRT),
            grandMedianRT=median(medianRT)) #Grand Means
```

These would not have matched either.

---------------------------

Next we will try to reproduce error rates for switch predicitve cues vs switch nonpredictive cues.

> However, error rates did not differ across these two groups of switch trials (predictive cue: 78.9%; nonpredictive cue: 78.8%)

```{r Prime Accuracy}
mean_Prime_RespCorr_Ind <- data %>% filter(TrialType == "Switch Trials") %>% group_by(id, Prime) %>% summarise(meanCorr = mean(RespCorr)) #Individual Means
mean_Prime_RespCorr <- mean_Prime_RespCorr_Ind %>% group_by(Prime) %>% summarise(grandmeanCorr = mean(meanCorr)) #Grand Means

kable(mean_Prime_RespCorr)
```

```{r bp-check-prime-accuracy, echo=FALSE, eval=FALSE}
df_accuracy <- data %>% 
  filter(TrialType == "Switch Trials") %>% 
  mutate(prime_coding=ifelse(originalPrime %in% c(2, 4, 6), 'numbers', 'letters')) %>%
  group_by(prime_coding, id, originalPrime) %>% 
  summarise(meanCorr = mean(RespCorr)) #Individual Means

df_accuracy %>%
  group_by(originalPrime) %>% 
  summarise(grandmeanCorr = mean(meanCorr)) #Grand

```

These numbers are fairly close to the reported numbers. 

```{r Prime Accuracy Diff}
reportObject <- reproCheck(reported = ".789", obtained = .7994, valueType = 'mean')
reportObject <- reproCheck(reported = ".788", obtained = .7803, valueType = 'mean')
```

### Inferential statistics

The first claim is that in switch trials, predictive cues lead to statistically significant faster reaction times than nonpredictive cues.

> ... the performance on switch trials preceded by this cue would be better than on switch trials preceded by the nonpredictive cue. This was indeed the case (mean RT-predictive cue: 819 ms; nonpredictive cue: 871 ms; mean difference = 52 ms, 95% confidence interval, or CI = [19.5, 84.4]), two-tailed paired t(20) = 3.34, p < .01.

```{r Prime RT test}
mean_Prime_RT_Ind <- mean_Prime_RT_Ind %>% select(id, Prime, meanRT) %>%
  spread(Prime, meanRT) #spreading so that the cues are easier to compare
test <- t.test(mean_Prime_RT_Ind[['Nonpredictive Cue']], mean_Prime_RT_Ind[['Switch Predictive Cue']], paired = TRUE) 

kable(tidy(test))
```

```{r}
reportObject <- reproCheck(reportedValue = "<.01", obtained = test$p.value, valueType = 'p', eyeballCheck = FALSE)

reportObject <- reproCheck(reportedValue = "19.5", obtained = test$conf.int[1], valueType = 'ci')

reportObject <- reproCheck(reportedValue = "84.4", obtained = test$conf.int[2], valueType = 'ci')

reportObject <- reproCheck(reportedValue = "20", obtained = test$parameter, valueType = 'df')

reportObject <- reproCheck(reportedValue = "3.34", obtained = test$statistic, valueType = 't')
```

We do not find the same p value as the original paper. There is ambiguity in how to calculate this statistic which is detailed below.

[INSUFFICIENT INFORMATION ERROR]

**Check** - note that differences in reaction time could be due to differences in `Prime` coding. This just supports the notion that we need to better understand the coding differences before trying to reproduce all the analyses.

```{r bp-prime-rt-ttest}
# Differences in reaction time t.test by prime coding
df_rt_ttest <- data %>% 
  filter(TrialType == "Switch Trials") %>% 
  group_by(id, originalPrime) %>% 
  summarise(meanRT = mean(RT)) %>%
  spread(originalPrime, meanRT)

test1 <- t.test(df_rt_ttest$M, df_rt_ttest$O, paired=TRUE)
test2 <- t.test(df_rt_ttest$`2`, df_rt_ttest$`8`, paired=TRUE)
kable(tidy(test1))
kable(tidy(test2))
```

But none of these ways of doing the analysis produce outcomes matching the reported outcomes.

-----------------

Next we will test the second claim.

> However, error rates did not differ across these two groups of switch trials (predictive cue: 78.9%; nonpredictive cue: 78.8%), p = .8.

```{r mean Prime accuracy test}
mean_Prime_RespCorr_Ind <- 
  mean_Prime_RespCorr_Ind %>% 
  spread(Prime, meanCorr) #spreading so that the cues are easier to compare
test <- 
  t.test(mean_Prime_RespCorr_Ind[['Nonpredictive Cue']], 
         mean_Prime_RespCorr_Ind[['Switch Predictive Cue']], paired = TRUE)
thisP <- test$p.value
kable(tidy(test))
```

```{r bp-check-prime-accraucy-ttest, echo=FALSE}
mean_Prime_RespCorr_Ind %>%
  ungroup() %>%
  summarise(nonPredMean=mean(`Nonpredictive Cue`), 
            predMean=mean(`Switch Predictive Cue`))

```

Although still not significant, the p value is very different from what was reported.

```{r mean Prime accuracy test diff}
reportObject <- reproCheck(reportedValue = ".8", obtained = thisP, valueType = 'p')
```

## Step 5: Conclusion

In our initial attempts we were not able to reproduce several target outcomes. Generally, all the reaction time statistics (whether we tried means or medians) were different from what was reported, although these were classified as 'minor numerical errors'. Ultimately though there were major errors for inferential test statistics, which could stem from a problem earlier in the analysis pipeline. 

There are a number of aspects of the original analysis / data files we are unclear about. We have e-mailed the original authors several times but not recieved a response to our questions (we have recieved responses saying they will get back to us, but this has not happened > 9 months after the last message). We have thus classifed these issues as 4 'insufficient information errors'. The issues we are unclear about are as follows:

**Unclear labels for variables**

In the data file, there is a variable "CorrResp" (1 or 0) and another variable "RespCorr" (TRUE or FALSE). We used "RespCorr" because it was the only variable of the two that was included in the codebook where TRUE=Accurate response 
and FALSE=Error. But we still don't know what "CorrResp" is and whether or not it was used in the original analyses.

**Unclear recoding of variables**

In the data file, there is 1 excel file per participant with all of their reaction times to the 250 trials. For some participants, the Prime was coded as the actual prime shown "O", "T", or "M". For other participants, the Prime was coded as "2", "4", and "8". However, we had to infer which number corresponded to each letter by  looking at the variables names assigned to trial type and which cue followed ("stay_2", "stay_4", "swt_2", "swt_8").

We coded 2=O, 4=T, 8=M, but we are still unsure whether these are consistent with how the authors coded the prime variable. We should note that some differences do seem exaggerated while subsetting by these two coding schemes -- data that received a prime coding `Prime in c(2, 4, 8)` appeared to have larger `RT` and `Acc` differences compared to those coded with `Prime in c("O", "T", "M")`.

**Ambiguity between using means or medians**

In the article it states that, unless otherwise noted, statistical tests were performed on median values rather than mean values. We followed this according to the paper's protocol. However, we're not able to reproduce the following outcomes using either means or medians for the following findings:

* "performance on switch trials, relative to repeat trials, incurred a switch cost that was evident in longer RTs (836 vs. 689 ms)"
        
* "mean RT - predictive cue: 819 ms; 95% confidence interval, or CI = [19.5, 84.4], two-tailed paired t(20) = 3.34, p < .01"
        
**Unclear whether descriptives of means/medians of means/medians of individuals, or means/medians across all trials**

When we tried to reproduce reaction time medians, we realized that the value could have been obtained by calculating a value (mean or median) for each individual, then summarizing those values to one value (mean or median), OR it could have been obtained by a value (mean or median) across ALL trials. We tried several combinations of means or medians with across individuals or across trials, and still could not replicate the descriptive reaction times.

```{r}
Author_Assistance = FALSE # was author assistance provided? (if so, enter TRUE)

Insufficient_Information_Errors <- 4 # how many discrete insufficient information issues did you encounter?

# Assess the causal locus (discrete reproducibility issues) of any reproducibility errors. Note that there doesn't necessarily have to be a one-to-one correspondance between discrete reproducibility issues and reproducibility errors. For example, it could be that the original article neglects to mention that a Greenhouse-Geisser correct was applied to ANOVA outcomes. This might result in multiple reproducibility errors, but there is a single causal locus (discrete reproducibility issue).

locus_typo <- 0 # how many discrete issues did you encounter that related to typographical errors?
locus_specification <- 1 # how many discrete issues did you encounter that related to incomplete, incorrect, or unclear specification of the original analyses?
locus_analysis <- 0 # how many discrete issues did you encounter that related to errors in the authors' original analyses?
locus_data <- 0 # how many discrete issues did you encounter that related to errors in the data files shared by the authors?
locus_unidentified <- 3 # how many discrete issues were there for which you could not identify the cause

# How many of the above issues were resolved through author assistance?
locus_typo_resolved <- 0 # how many discrete issues did you encounter that related to typographical errors?
locus_specification_resolved <- 0 # how many discrete issues did you encounter that related to incomplete, incorrect, or unclear specification of the original analyses?
locus_analysis_resolved <- 0 # how many discrete issues did you encounter that related to errors in the authors' original analyses?
locus_data_resolved <- 0 # how many discrete issues did you encounter that related to errors in the data files shared by the authors?
locus_unidentified_resolved <- 0 # how many discrete issues were there for which you could not identify the cause

Affects_Conclusion <- TRUE # Do any reproducibility issues encounter appear to affect the conclusions made in the original article? This is a subjective judgement, but you should taking into account multiple factors, such as the presence/absence of decision errors, the number of target outcomes that could not be reproduced, the type of outcomes that could or could not be reproduced, the difference in magnitude of effect sizes, and the predictions of the specific hypothesis under scrutiny.
```

```{r}
reportObject <- reportObject %>%
  filter(dummyRow == FALSE) %>% # remove the dummy row
  select(-dummyRow) %>% # remove dummy row designation
  mutate(articleID = articleID) %>% # add the articleID 
  select(articleID, everything()) # make articleID first column

# decide on final outcome
if(any(!(reportObject$comparisonOutcome %in% c("MATCH", "MINOR_ERROR"))) | Insufficient_Information_Errors > 0){
  finalOutcome <- "Failure without author assistance"
  if(Author_Assistance == T){
    finalOutcome <- "Failure despite author assistance"
  }
}else{
  finalOutcome <- "Success without author assistance"
  if(Author_Assistance == T){
    finalOutcome <- "Success with author assistance"
  }
}

# collate report extra details
reportExtras <- data.frame(articleID, pilotNames, copilotNames, pilotTTC, copilotTTC, pilotStartDate, copilotStartDate, completionDate, Author_Assistance, finalOutcome, Insufficient_Information_Errors, locus_typo, locus_specification, locus_analysis, locus_data, locus_unidentified, locus_typo_resolved, locus_specification_resolved, locus_analysis_resolved, locus_data_resolved, locus_unidentified_resolved)

# save report objects
if(reportType == "pilot"){
  write_csv(reportObject, "pilotReportDetailed.csv")
  write_csv(reportExtras, "pilotReportExtras.csv")
}

if(reportType == "final"){
  write_csv(reportObject, "finalReportDetailed.csv")
  write_csv(reportExtras, "finalReportExtras.csv")
}
```

## Session information

```{r session_info, include=TRUE, echo=TRUE, results='markup'}
devtools::session_info()
```