Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comments JOSS 241018 #58

Closed
3 tasks done
melodiemonod opened this issue Oct 21, 2024 · 14 comments · Fixed by #59
Closed
3 tasks done

comments JOSS 241018 #58

melodiemonod opened this issue Oct 21, 2024 · 14 comments · Fixed by #59
Assignees

Comments

@melodiemonod
Copy link
Collaborator

melodiemonod commented Oct 21, 2024

Address WeakCha's comments from 241018

openjournals/joss-reviews#7341 (comment)

  • Adding an example after you introduce the functionality in the paper
  • There are some typos in your paper.
  • show what helpers_introduction is
@melodiemonod
Copy link
Collaborator Author

@tcoroller , I think for action 3, we can just add the functions from helpers_introduction in blocks of the notebook?

@tcoroller
Copy link
Collaborator

@melodiemonod I kinda disagree, especially since the helpers are located in the same folder as in the notebooks. They are standalone function meant to keep the notebooks lean. There are no TorchSurv related function there but simply boilerplate ones.

Let me reply to the reviewers with those details and maybe highlighting the link to the folder in the text (maybe listing deps if needed) and if this is not suitable we can find another solution. What do you think?

@tcoroller
Copy link
Collaborator

@tcoroller
Copy link
Collaborator

Maybe the simplest solution:

# PyTorch boilerplate - see https://github.com/Novartis/torchsurv/blob/main/docs/notebooks/helpers_introduction.py
from helpers_introduction import Custom_dataset, plot_losses

@melodiemonod
Copy link
Collaborator Author

Sure, sounds good! Please do reply with to the reviewer with these proposed amendments and where to find the file and let's see if it's enough for them. Many thanks

@melodiemonod
Copy link
Collaborator Author

@tcoroller please review the example in paper.md

@melodiemonod
Copy link
Collaborator Author

@tcoroller
Copy link
Collaborator

@melodiemonod

Two big comments: I

  1. Your branch is behind main (like by a lot), so we can either update it or cherry pick the change and do a new PR from current main (its only the .md anyway)
  2. I like the text updates but I don't understand the example:

Minor questions here:

  • All your events are 1, is that a mistake or by design?
  • The fixed values are not explained (e.g., exponantial value, end_time, etc..)
  • What is the goal of the weights multiplication for the time?

I tinkered a lot with the example, but could not get results as good as yours. Mostly because you are testing on the trained datasets. Here's the same but now using a generator (sample as needed the distribution)..

import torch
from torch.utils.data import DataLoader, Dataset
from torchsurv.loss import cox
from torchsurv.metrics.cindex import ConcordanceIndex
from torchsurv.metrics.auc import Auc

torch.manual_seed(42)

# 1. Simulate batch data. Ensure that your data are formatted in the same way
n_samples = 1000  # number of observations
batch_size = 32  # batch size


# Define the generator function
def tte_generator(batch_size: int, n_features: int = 10):
    while True:
        x = torch.randn(batch_size, n_features)  # 10 features for each observation
        weights = torch.randn(n_features)  # weights associated with the features
        event = torch.randint(
            low=0, high=2, size=(batch_size,), dtype=torch.bool
        )  # event indicator
        time = torch.abs(800 + x @ weights + torch.randn(batch_size) * 25)  # event time
        yield x, event, time


# 2. Define the PyTorch dataset class
class TTE_dataset(Dataset):
    def __init__(self, generator: callable, batch_size: int):
        self.generator = generator(batch_size=batch_size)

    def __len__(self):
        return n_samples // batch_size  # Number of batches

    def __getitem__(self, index):
        return next(self.generator)


# 3. Define the backbone model on the log hazards.
class MyPyTorchCoxModel(torch.nn.Module):
    def __init__(self):
        super(MyPyTorchCoxModel, self).__init__()
        self.fc = torch.nn.Linear(10, 1, bias=False)  # Simple linear model

    def forward(self, x):
        return self.fc(x)


# 4. Instantiate the model, optimizer, dataset and dataloader
cox_model = MyPyTorchCoxModel()
optimizer = torch.optim.Adam(cox_model.parameters(), lr=0.01)
dataset = TTE_dataset(tte_generator, batch_size=batch_size)
dataloader = DataLoader(
    dataset, batch_size=1, shuffle=True
)  # Batch size of 1 because dataset yields batches

# 5. Training loop
for epoch in range(100):
    for i, batch in enumerate(dataloader):
        x, event, time = [t.squeeze() for t in batch]  # Squeeze extra dimension
        optimizer.zero_grad()
        log_hzs = cox_model(x)  # torch.Size([16, 1])
        loss = cox.neg_partial_log_likelihood(
            log_hzs, event, time, reduction="mean"
        )  # use mean log-likelihood
        loss.backward()
        optimizer.step()

# 6. Evaluate the model
# Generate a new batch of data to evaluate
new_batch = next(tte_generator(batch_size=n_samples))
x, event, time = [t.squeeze() for t in new_batch]
log_hzs = cox_model(x)  # Trained log hazards

# AUC at time point 800
auc = Auc()
print(
    "AUC:", auc(log_hzs, event, time, new_time=torch.tensor(800.0))
)  # tensor([0.5531])
print("AUC Confidence Interval:", auc.confidence_interval())  # tensor([0.5220, 0.5841])
print("AUC p-value:", auc.p_value(alternative="greater"))  # tensor([0.0004])

# C-index
cindex = ConcordanceIndex()
print("C-Index:", cindex(log_hzs, event, time))  # tensor(0.5465)
print(
    "C-Index Confidence Interval:", cindex.confidence_interval()
)  # tensor([0.4897, 0.6033])
print("C-Index p-value:", cindex.p_value(alternative="greater"))  # tensor(0.0544)

@melodiemonod
Copy link
Collaborator Author

Hello @tcoroller!

I just pushed a new update of the paper with the example (Here https://github.com/Novartis/torchsurv/blob/58-comments-joss-241018/paper/paper.md). And also added the example below for reference.

Your branch is behind main (like by a lot), so we can either update it or cherry pick the change and do a new PR from current main (its only the .md anyway)

I used the JOSS branch as the source for the review branch. So after we are happy with the updates we can merge 58 -> 45. If you want, we can then update 45 with the update from main, 45 <- main?

All your events are 1, is that a mistake or by design?

You removed from the updated example the true underlying generative mechanism for time-to-event data, i.e., each individual is associated with a time-to-event (event_time) and time-to-censoring (censoring_time), and only the first of the two (time) is observed. It is important to reproduce this mechanism so we can create a correlation between the covariates and the time to event, event_time not time. We cannot create a correlation with time directly as you did there (time = torch.abs(800 + x @ weights + torch.randn(batch_size) * 25)) because it would mean that the time of censoring (which time is partly composed of) would also correlated with the covariates.
--> I added back the true mechanism in the generator function. Also, the events were all 1 because I used a mean of time-to-censoring that was too large (I reduced it now).

What is the goal of the weights multiplication for the time?

The time-to-event (event_time) depends linearly on the features weighted by weights. You cannot sample from them every time in the generator function, they need to be fixed for all individuals. This is why you did not get good metrics.
--> I put the generation of weights outside of the generator function.

Updated example:

import torch
from torch.utils.data import DataLoader, Dataset
from torchsurv.loss import cox
from torchsurv.metrics.cindex import ConcordanceIndex
from torchsurv.metrics.auc import Auc

torch.manual_seed(42)


# 1. Simulate batch data.
# Ensure that your x, event and time are formatted in the same way
n_features = 10  # int, number of features per observation
time_end = torch.tensor(
    2000.0
)  # float, end of observational period after which all observations are censored
weights = (
    torch.randn(n_features) * 5
)  # float, weights associated with the features ~ normal(0, 5^2)


# Define the generator function
def tte_generator(batch_size: int):
    while True:
        x = torch.randn(batch_size, n_features)  # features

        mean_event_time, mean_censoring_time = 1000.0 + x @ weights, 1000.0

        event_time = (
            mean_event_time + torch.randn(batch_size) * 50
        )  # event time ~ normal(mean_event_time, 50^2)
        censoring_time = torch.distributions.Exponential(
            1 / mean_censoring_time
        ).sample(
            (batch_size,)
        )  # censoring time ~ Exponential(mean = mean_censoring_time)
        censoring_time = torch.minimum(
            censoring_time, time_end
        )  # truncate censoring time to time_end

        event = (event_time <= censoring_time).bool()  # event indicator
        time = torch.minimum(event_time, censoring_time)  # observed time

        yield x, event, time


# 2. Define the PyTorch dataset class
class TTE_dataset(Dataset):
    def __init__(self, generator: callable, batch_size: int):
        self.batch_size = batch_size
        self.generatated_data = generator(batch_size=batch_size)

    def __len__(self):
        return self.batch_size

    def __getitem__(self, index):
        return next(self.generatated_data)


# 3. Define the backbone model on the log hazards.
class MyPyTorchCoxModel(torch.nn.Module):
    def __init__(self):
        super(MyPyTorchCoxModel, self).__init__()
        self.fc = torch.nn.Linear(10, 1, bias=False)  # Simple linear model

    def forward(self, x):
        return self.fc(x)


# 4. Instantiate the model, optimizer, dataset and dataloader
cox_model = MyPyTorchCoxModel()
optimizer = torch.optim.Adam(cox_model.parameters(), lr=0.01)

batch_size = 64  # int, batch size
dataset = TTE_dataset(tte_generator, batch_size=batch_size)
dataloader = DataLoader(
    dataset, batch_size=1, shuffle=True
)  # Batch size of 1 because dataset yields batches


# 5. Training loop
for epoch in range(100):
    for i, batch in enumerate(dataloader):
        x, event, time = [t.squeeze() for t in batch]  # Squeeze extra dimension
        optimizer.zero_grad()
        log_hzs = cox_model(x)  # torch.Size([batch_size, 1])
        loss = cox.neg_partial_log_likelihood(log_hzs, event, time)
        loss.backward()
        optimizer.step()


# 6. Evaluate the model
n_samples_test = 1000  # int, number of observations in test set

data_test = next(tte_generator(batch_size=n_samples_test))
x, event, time = [t.squeeze() for t in data_test]  # test set
log_hzs = cox_model(x)  # log hazards evaluated on test set

# AUC at time point 1000
auc = Auc()
print(
    "AUC:", auc(log_hzs, event, time, new_time=torch.tensor(1000.0))
)  # tensor([0.5902])
print("AUC Confidence Interval:", auc.confidence_interval())  # tensor([0.5623, 0.6180])
print("AUC p-value:", auc.p_value(alternative="greater"))  # tensor([0.])

# C-index
cindex = ConcordanceIndex()
print("C-Index:", cindex(log_hzs, event, time))  # tensor(0.5774)
print(
    "C-Index Confidence Interval:", cindex.confidence_interval()
)  # tensor([0.5086, 0.6463])
print("C-Index p-value:", cindex.p_value(alternative="greater"))  # tensor(0.0138)

@tcoroller
Copy link
Collaborator

Nice, it ran with me - updating the output and linting:

import torch
from torch.utils.data import DataLoader, Dataset
from torchsurv.loss import cox
from torchsurv.metrics.cindex import ConcordanceIndex
from torchsurv.metrics.auc import Auc

torch.manual_seed(42)


# 1. Simulate batch data.
# Ensure that your x, event and time are formatted in the same way
n_features = 10  # int, number of features per observation
time_end = torch.tensor(
    2000.0
)  # float, end of observational period after which all observations are censored
weights = (
    torch.randn(n_features) * 5
)  # float, weights associated with the features ~ normal(0, 5^2)


# Define the generator function
def tte_generator(batch_size: int):
    while True:
        x = torch.randn(batch_size, n_features)  # features

        mean_event_time, mean_censoring_time = 1000.0 + x @ weights, 1000.0

        event_time = (
            mean_event_time + torch.randn(batch_size) * 50
        )  # event time ~ normal(mean_event_time, 50^2)
        censoring_time = torch.distributions.Exponential(
            1 / mean_censoring_time
        ).sample(
            (batch_size,)
        )  # censoring time ~ Exponential(mean = mean_censoring_time)
        censoring_time = torch.minimum(
            censoring_time, time_end
        )  # truncate censoring time to time_end

        event = (event_time <= censoring_time).bool()  # event indicator
        time = torch.minimum(event_time, censoring_time)  # observed time

        yield x, event, time


# 2. Define the PyTorch dataset class
class TTE_dataset(Dataset):
    def __init__(self, generator: callable, batch_size: int):
        self.batch_size = batch_size
        self.generatated_data = generator(batch_size=batch_size)

    def __len__(self):
        return self.batch_size

    def __getitem__(self, index):
        return next(self.generatated_data)


# 3. Define the backbone model on the log hazards.
class MyPyTorchCoxModel(torch.nn.Module):
    def __init__(self):
        super(MyPyTorchCoxModel, self).__init__()
        self.fc = torch.nn.Linear(10, 1, bias=False)  # Simple linear model

    def forward(self, x):
        return self.fc(x)


# 4. Instantiate the model, optimizer, dataset and dataloader
cox_model = MyPyTorchCoxModel()
optimizer = torch.optim.Adam(cox_model.parameters(), lr=0.01)

batch_size = 64  # int, batch size
dataset = TTE_dataset(tte_generator, batch_size=batch_size)
dataloader = DataLoader(
    dataset, batch_size=1, shuffle=True
)  # Batch size of 1 because dataset yields batches


# 5. Training loop
for epoch in range(100):
    for i, batch in enumerate(dataloader):
        x, event, time = [t.squeeze() for t in batch]  # Squeeze extra dimension
        optimizer.zero_grad()
        log_hzs = cox_model(x)  # torch.Size([batch_size, 1])
        loss = cox.neg_partial_log_likelihood(log_hzs, event, time)
        loss.backward()
        optimizer.step()


# 6. Evaluate the model
n_samples_test = 1000  # int, number of observations in test set

data_test = next(tte_generator(batch_size=n_samples_test))
x, event, time = [t.squeeze() for t in data_test]  # test set
log_hzs = cox_model(x)  # log hazards evaluated on test set

# AUC at time point 1000
auc = Auc()
print(
    "AUC:", auc(log_hzs, event, time, new_time=torch.tensor(1000.0))
)  # tensor([0.6278])
print("AUC Confidence Interval:", auc.confidence_interval())  # tensor([0.5998, 0.6558])
print("AUC p-value:", auc.p_value(alternative="greater"))  # tensor([0.])

# C-index
cindex = ConcordanceIndex()
print("C-Index:", cindex(log_hzs, event, time))  # tensor(0.5913)
print(
    "C-Index Confidence Interval:", cindex.confidence_interval()
)  # tensor([0.5222, 0.6604])
print("C-Index p-value:", cindex.p_value(alternative="greater"))  # tensor(0.0048)

@tcoroller
Copy link
Collaborator

I used the JOSS branch as the source for the review branch. So after we are happy with the updates we can merge 58 -> 45. If you want, we can then update 45 with the update from main, 45 <- main?

I wonder if we are better off just making a new PR from main and just carry over the .md. Lots of changes.

I can do this piece

@tcoroller tcoroller linked a pull request Oct 22, 2024 that will close this issue
@tcoroller
Copy link
Collaborator

@melodiemonod please check #59 to make sure I didn't forgot anything.

@melodiemonod
Copy link
Collaborator Author

@tcoroller looks all good to me! I just left one comment about a typo (f6d624c#r1811249877), do you think you could fix it before merging?

Also, for the JOSS submission, we had to specify the branch to do the review on, which for now was 45-XX. The bot is looking in this branch to generate the paper proof. We need to ask the editor how we can change this "branch to review" to main.

@tcoroller
Copy link
Collaborator

Let me close this comment and merge to main. I will also tell the reviewer that they should be looking at main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants