Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About "detection adjustment" in the line 339-360 of solver.py #14

Closed
wuhaixu2016 opened this issue Jul 19, 2022 · 7 comments
Closed

About "detection adjustment" in the line 339-360 of solver.py #14

wuhaixu2016 opened this issue Jul 19, 2022 · 7 comments

Comments

@wuhaixu2016
Copy link
Member

Since some researchers are confused about the "detection adjustment", we provide some clarification here.

(1) Why use "detection adjustment"?

Firstly, I strongly suggest the researchers read the original paper Xu et al., 2018, which has given a comprehensive explanation of this operation.

In our paper, we follow this convention because of the following reasons:

  • Fair comparison: As we stated in the Implementation details section of our paper, the adjustment is a widely-used convention in time series anomaly detection. Especially, in the benchmarks that we used in our paper, the previous methods all use the adjustment operation for the evaluation of these benchmarks Shen et al., 2020. Thus, we also adopt the adjustment for model evaluation.
  • Real-world meaning: Since one abnormal event will cause a segment of abnormal time points. The adjustment corresponds to the "abnormal event detection" task, which is to evaluate the model performance in detecting the abnormal events from the whole records. This is a very meaningful task for real-world applications. Once we have detected the abnormal event, we can send a worker to check that time segment for security.

In summary, you can view the adjustment as an "evaluation protocol", which is to measure the capability of models in "abnormal event detection".

(2) We have provided a comprehensive and fair comparison in our paper.

  • All the baselines that we compared in our paper are also evaluated with this "adjustment". Note that this evaluation is widely used in the previous papers for the experiments on SMD, SWaT, and so on. Thus, the comparison is fair.
  • For a comprehensive analysis, we also provide a benchmark for the UCR dataset in Appendix L, which is from KDD Cup. The anomalies in this dataset are mostly recorded only at a single time point. Thus, if you want to obtain the comparison on single-time-point anomaly detection, this dataset can provide some intuitions.

If you still have some questions about the adjustment, welcome to email me and discuss more (whx20@mails.tsinghua.edu.cn).

@liuchengzhi314159
Copy link

您好,请问一下,如果要将您的这个成果应用在别的领域的话。您异常检测出来的值和怎么导出来然后从原始数据中过滤掉?

@xiaobiao998
Copy link

有无学者知道如何解决不知道测试标签的情况下解决该问题嘛

@wuhaixu2016
Copy link
Member Author

您好,这个调整只是用于计算metric,如果您是想用于部署的话,直接注释掉就可以了 @xiaobiao998

@xiaobiao998
Copy link

xiaobiao998 commented Apr 18, 2024 via email

@Steven0706
Copy link

这个代码的性能全靠看了GT后的调整,在现实中没有GT,也就是你注释掉的效果,如果我是reviewer,我会强烈要求去掉这个adjustment,完全是为了好看的不现实步骤。这个不能叫fair compare,因为adjustment有可能利于作者提出的方法,而且这个adjustment现实中不存在,毫无意义。

@wuhaixu2016
Copy link
Member Author

wuhaixu2016 commented Jun 28, 2024

这个代码的性能全靠看了GT后的调整,在现实中没有GT,也就是你注释掉的效果,如果我是reviewer,我会强烈要求去掉这个adjustment,完全是为了好看的不现实步骤。这个不能叫fair compare,因为adjustment有可能利于作者提出的方法,而且这个adjustment现实中不存在,毫无意义。

烦请仔细阅读上面的说明。为了清晰,我们提供一个中文版本,见下
(1)**公平比较:**自从2018年Xu等论文之后,【大部分工作】全都遵循这一调整,所以我们也使用这个技巧
(2)**实际意义:**您可以考虑这样一个场景,在实际部署中,我们的模型定位出了一个异常点,可以派工人前去查看前后一段时间,所以完全不需要有gt,依然可以实现现实部署。所以,使用调整之后,可以理解为是“基于异常事件”的指标。

@enazari
Copy link

enazari commented Oct 23, 2024

Hi @wuhaixu2016 ,

Thank you for sharing your code and making it easy to use and reproduce the results.

I’d like to clarify my understanding of the code: From a theoretical perspective, the following lines introduce information leakage from the training set into the test set. Specifically, the model's predictions (pred) are being directly adjusted based on the ground truth labels (gt). This means that information from the actual labels is being used to modify the predicted outcomes, resulting in evaluation metrics that may be overly optimistic. Adjusting pred using the ground truth during evaluation allows information that would not be available in a real deployment scenario to influence the results.

Practically speaking, since we wouldn’t have access to ground truth labels during deployment, I believe these lines should be omitted. After removing them, here are the results I obtained:

SMD dataset:
Accuracy: 0.9920, Precision: 0.8894, Recall: 0.9235, F-score: 0.9061 (With data leakage)
Accuracy: 0.9543, Precision: 0.1047, Recall: 0.0133, F-score: 0.0236 (After omitting the lines)

SMAP dataset:
Accuracy: 0.9901, Precision: 0.9361, Recall: 0.9900, F-score: 0.9623 (With data leakage)
Accuracy: 0.8648, Precision: 0.1275, Recall: 0.0097, F-score: 0.0180 (After omitting the lines)

PSM dataset:
Accuracy: 0.9854, Precision: 0.9729, Recall: 0.9745, F-score: 0.9737 (With data leakage)
Accuracy: 0.7181, Precision: 0.2854, Recall: 0.0110, F-score: 0.0212 (After omitting the lines)

If my understanding is incorrect, could you please clarify? Alternatively, if you have any suggestions for addressing this issue in a practical way, I would greatly appreciate your input.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants