Changing ArrayBoundV2 callback type increased symbol complexity #89720

NagyDonat · 2024-04-23T08:41:49Z

While investigating #89045 @steakhal noticed that the complexity of symbols seen by the ArrayBoundV2 checker increased significantly when that checker switched to using check::PostStmt callbacks instead of the previously used check::Location.

[...] prior to this change, the maximum Sym->computeComplexity() within getTaintedSymbolsImpl was significantly lower [4-5] than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.

That slowdown was fixed by #89606 which restored the efficiency of getTaintedSymbolsImpl and ensured that it runs quickly even when the symbols are complex. However, it would be still good to investigate the cause of this complexity increase.

The text was updated successfully, but these errors were encountered:

NagyDonat · 2024-04-23T09:05:11Z

@steakhal By the way, did you happen to have ArrayBound (V1) enabled during the bisection?

I'm asking this because one very significant effect of my commit #72107 Switch to PostStmt callbacks in ArrayBoundV2 was that it "transferred" the analysis of many out-of-bounds situations from the V1 checker to ArrayBoundV2. (When they were both Location checkers, V1 activated first AFAIK because it appears earlier in Checkers.td; after the change the PostStmt callbacks of ArrayBoundV2 happen before the Location callback of V1.)

If V1 was enabled in the bisection, then perhaps the observed increase in the complexity is just caused by the fact that previously the V2 checker didn't see those complex symbols (because the V1 checker found and handled them first).

(I don't think that this is the most likely explanation of the situation, but it's worth to mention and check.)

llvmbot · 2024-04-23T09:07:58Z

@llvm/issue-subscribers-clang-static-analyzer

Author: None (NagyDonat)

While investigating #89045 @steakhal noticed that the complexity of symbols seen by the ArrayBoundV2 checker increased significantly when that checker switched to using `check::PostStmt` callbacks instead of the previously used `check::Location`.

> [...] prior to this change, the maximum Sym->computeComplexity() within getTaintedSymbolsImpl was significantly lower [4-5] than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.

That slowdown was fixed by #89606 which restored the efficiency of getTaintedSymbolsImpl and ensured that it runs quickly even when the symbols are complex. However, it would be still good to investigate the cause of this complexity increase.

steakhal · 2024-04-23T09:09:07Z

If V1 was enabled in the bisection, then perhaps the observed increase in the complexity is just caused by the fact that previously the V2 checker didn't see those complex symbols (because the V1 checker found and handled them first).

I believe we only use/used OOBv2.

NagyDonat · 2024-04-23T09:09:34Z

Thanks, that's what I guessed.

NagyDonat · 2024-05-02T16:33:22Z

I spent some time investigating this issue and I would summarize my research as "I still don't know the answer, but I realized that the question is not interesting".

The increased symbol complexity appeared after my old change #72107 "[analyzer] Switch to PostStmt callbacks in ArrayBoundV2" moved the activation of ArrayBoundV2 to a different point within the process of the path-sensitive analysis -- so the natural conclusion is that those more complex symbols were always there at that different step of the process. (If we are measuring river pollution and notice that the pollution increased when we started to take samples from a different location, the natural conclusion is that the new spot is simply more polluted.)

I have several rough conjectures that could lead to this difference between the Location and the PostStmt callback:

The main goal of [analyzer] Switch to PostStmt callbacks in ArrayBoundV2 #72107 was that "after it the checker will check array[index].field while the previous implementation ignored this situation (because here the ElementRegion is wrapped in a FieldRegion object)." Note that sheervideo.c from FFMPEG mixes pointer arithmetic and field access, so it's plausible that the more complex symbols were simply not seen by the Location callback.
There was an old abandoned PR D159106 by @steakhal that tried to handle "Turns out check::Location is not always called when storing a value to a location." I don't remember its exact details, but effects like that could imply that some symbols are skipped by Location callbacks, but appear in the PostStmt callback.
The PostStmt callback triggers directly after constructing the suitable ElementRegion as the return value of the subscript expression; while the Location callback triggers significantly later, when the ElementRegion is accessed for reading or writing. In the meantime the analyzer parses several other statements and could perform some simplification / cleanup step that reduces the complexity of symbols and/or eliminates the cases with the more complex symbols.

I wasn't able to understand exactly what happens between the PostStmt and the Location (the code is very complex, and I didn't set up a debugger), but I strongly suspect that the more complex symbols were always there, but the Location callback didn't see them.

As the investigation is difficult, the complex symbols don't cause any concrete problems (after #89606) and there is probably an innocent explanation, I don't think that it's worth to spend more time on this issue. @steakhal What do you think about this? Woudl you agree with closing this ticket?

steakhal · 2024-05-02T16:46:06Z

As the investigation is difficult, the complex symbols don't cause any concrete problems (after #89606) and there is probably an innocent explanation, I don't think that it's worth to spend more time on this issue. @steakhal What do you think about this? Woudl you agree with closing this ticket?

Thanks for your time. I think what you say makes sense. I didn't see yet glaring drawbacks of having these "complicated symbols" other than sometimes their dump took 26 megabytes. Given that ATM I'm not focusing on this checker, I'm fine with closing this.
I'll reopen it if it turns out relevant and I have evidence backing this up.
FYI right now I'm focusing on bounding Z3 refutation times. I've seen cases when Z3 refutation took 90% or more of the total time of an analysis.

As discussed in https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10 Some `isTainted()` queries can blow up the analysis times, and effectively halt the analysis under specific workloads. We don't really have the time now to do a caching re-implementation of `isTainted()`, so we need to workaround the case. The workaround with the smallest blast radius was to limit what symbols `isTainted()` does the query (by walking the SymExpr). So far, the threshold 10 worked for us, but this value can be overridden using the "max-tainted-symbol-complexity" config value. This new option is "deprecated" from the getgo, as I expect this issue to be fixed within the next few months and I don't want users to override this value anyways. If they do, this message will let them know that they are on their own, and the next release may break them (as we no longer recognize this option if we drop it). Mitigates #89720 CPP-5414

…105493) As discussed in https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10 Some `isTainted()` queries can blow up the analysis times, and effectively halt the analysis under specific workloads. We don't really have the time now to do a caching re-implementation of `isTainted()`, so we need to workaround the case. The workaround with the smallest blast radius was to limit what symbols `isTainted()` does the query (by walking the SymExpr). So far, the threshold 10 worked for us, but this value can be overridden using the "max-tainted-symbol-complexity" config value. This new option is "deprecated" from the getgo, as I expect this issue to be fixed within the next few months and I don't want users to override this value anyways. If they do, this message will let them know that they are on their own, and the next release may break them (as we no longer recognize this option if we drop it). Mitigates llvm#89720 CPP-5414

…105493) As discussed in https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10 Some `isTainted()` queries can blow up the analysis times, and effectively halt the analysis under specific workloads. We don't really have the time now to do a caching re-implementation of `isTainted()`, so we need to workaround the case. The workaround with the smallest blast radius was to limit what symbols `isTainted()` does the query (by walking the SymExpr). So far, the threshold 10 worked for us, but this value can be overridden using the "max-tainted-symbol-complexity" config value. This new option is "deprecated" from the getgo, as I expect this issue to be fixed within the next few months and I don't want users to override this value anyways. If they do, this message will let them know that they are on their own, and the next release may break them (as we no longer recognize this option if we drop it). Mitigates llvm#89720 CPP-5414 (cherry picked from commit 8486589)

…105493) As discussed in https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10 Some `isTainted()` queries can blow up the analysis times, and effectively halt the analysis under specific workloads. We don't really have the time now to do a caching re-implementation of `isTainted()`, so we need to workaround the case. The workaround with the smallest blast radius was to limit what symbols `isTainted()` does the query (by walking the SymExpr). So far, the threshold 10 worked for us, but this value can be overridden using the "max-tainted-symbol-complexity" config value. This new option is "deprecated" from the getgo, as I expect this issue to be fixed within the next few months and I don't want users to override this value anyways. If they do, this message will let them know that they are on their own, and the next release may break them (as we no longer recognize this option if we drop it). Mitigates llvm#89720 CPP-5414

github-actions bot added the new issue label Apr 23, 2024

steakhal added clang:static analyzer and removed new issue labels Apr 23, 2024

steakhal closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024

EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label May 2, 2024

steakhal mentioned this issue Aug 21, 2024

[analyzer] Limit isTainted() by skipping complicated symbols #105493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing ArrayBoundV2 callback type increased symbol complexity #89720

Changing ArrayBoundV2 callback type increased symbol complexity #89720

NagyDonat commented Apr 23, 2024

NagyDonat commented Apr 23, 2024 •

edited

Loading

llvmbot commented Apr 23, 2024

steakhal commented Apr 23, 2024

NagyDonat commented Apr 23, 2024

NagyDonat commented May 2, 2024

steakhal commented May 2, 2024

Changing ArrayBoundV2 callback type increased symbol complexity #89720

Changing ArrayBoundV2 callback type increased symbol complexity #89720

Comments

NagyDonat commented Apr 23, 2024

NagyDonat commented Apr 23, 2024 • edited Loading

llvmbot commented Apr 23, 2024

steakhal commented Apr 23, 2024

NagyDonat commented Apr 23, 2024

NagyDonat commented May 2, 2024

steakhal commented May 2, 2024

NagyDonat commented Apr 23, 2024 •

edited

Loading