Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing ArrayBoundV2 callback type increased symbol complexity #89720

Closed
NagyDonat opened this issue Apr 23, 2024 · 6 comments
Closed

Changing ArrayBoundV2 callback type increased symbol complexity #89720

NagyDonat opened this issue Apr 23, 2024 · 6 comments
Labels
clang:static analyzer question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

Comments

@NagyDonat
Copy link
Contributor

While investigating #89045 @steakhal noticed that the complexity of symbols seen by the ArrayBoundV2 checker increased significantly when that checker switched to using check::PostStmt callbacks instead of the previously used check::Location.

[...] prior to this change, the maximum Sym->computeComplexity() within getTaintedSymbolsImpl was significantly lower [4-5] than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.

That slowdown was fixed by #89606 which restored the efficiency of getTaintedSymbolsImpl and ensured that it runs quickly even when the symbols are complex. However, it would be still good to investigate the cause of this complexity increase.

@NagyDonat
Copy link
Contributor Author

NagyDonat commented Apr 23, 2024

@steakhal By the way, did you happen to have ArrayBound (V1) enabled during the bisection?

I'm asking this because one very significant effect of my commit #72107 Switch to PostStmt callbacks in ArrayBoundV2 was that it "transferred" the analysis of many out-of-bounds situations from the V1 checker to ArrayBoundV2. (When they were both Location checkers, V1 activated first AFAIK because it appears earlier in Checkers.td; after the change the PostStmt callbacks of ArrayBoundV2 happen before the Location callback of V1.)

If V1 was enabled in the bisection, then perhaps the observed increase in the complexity is just caused by the fact that previously the V2 checker didn't see those complex symbols (because the V1 checker found and handled them first).

(I don't think that this is the most likely explanation of the situation, but it's worth to mention and check.)

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 23, 2024

@llvm/issue-subscribers-clang-static-analyzer

Author: None (NagyDonat)

While investigating #89045 @steakhal noticed that the complexity of symbols seen by the ArrayBoundV2 checker increased significantly when that checker switched to using `check::PostStmt` callbacks instead of the previously used `check::Location`.

> [...] prior to this change, the maximum Sym->computeComplexity() within getTaintedSymbolsImpl was significantly lower [4-5] than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.

That slowdown was fixed by #89606 which restored the efficiency of getTaintedSymbolsImpl and ensured that it runs quickly even when the symbols are complex. However, it would be still good to investigate the cause of this complexity increase.

@steakhal
Copy link
Contributor

If V1 was enabled in the bisection, then perhaps the observed increase in the complexity is just caused by the fact that previously the V2 checker didn't see those complex symbols (because the V1 checker found and handled them first).

I believe we only use/used OOBv2.

@NagyDonat
Copy link
Contributor Author

Thanks, that's what I guessed.

@NagyDonat
Copy link
Contributor Author

I spent some time investigating this issue and I would summarize my research as "I still don't know the answer, but I realized that the question is not interesting".

The increased symbol complexity appeared after my old change #72107 "[analyzer] Switch to PostStmt callbacks in ArrayBoundV2" moved the activation of ArrayBoundV2 to a different point within the process of the path-sensitive analysis -- so the natural conclusion is that those more complex symbols were always there at that different step of the process. (If we are measuring river pollution and notice that the pollution increased when we started to take samples from a different location, the natural conclusion is that the new spot is simply more polluted.)

I have several rough conjectures that could lead to this difference between the Location and the PostStmt callback:

  • The main goal of [analyzer] Switch to PostStmt callbacks in ArrayBoundV2 #72107 was that "after it the checker will check array[index].field while the previous implementation ignored this situation (because here the ElementRegion is wrapped in a FieldRegion object)." Note that sheervideo.c from FFMPEG mixes pointer arithmetic and field access, so it's plausible that the more complex symbols were simply not seen by the Location callback.
  • There was an old abandoned PR D159106 by @steakhal that tried to handle "Turns out check::Location is not always called when storing a value to a location." I don't remember its exact details, but effects like that could imply that some symbols are skipped by Location callbacks, but appear in the PostStmt callback.
  • The PostStmt callback triggers directly after constructing the suitable ElementRegion as the return value of the subscript expression; while the Location callback triggers significantly later, when the ElementRegion is accessed for reading or writing. In the meantime the analyzer parses several other statements and could perform some simplification / cleanup step that reduces the complexity of symbols and/or eliminates the cases with the more complex symbols.

I wasn't able to understand exactly what happens between the PostStmt and the Location (the code is very complex, and I didn't set up a debugger), but I strongly suspect that the more complex symbols were always there, but the Location callback didn't see them.

As the investigation is difficult, the complex symbols don't cause any concrete problems (after #89606) and there is probably an innocent explanation, I don't think that it's worth to spend more time on this issue. @steakhal What do you think about this? Woudl you agree with closing this ticket?

@steakhal
Copy link
Contributor

steakhal commented May 2, 2024

As the investigation is difficult, the complex symbols don't cause any concrete problems (after #89606) and there is probably an innocent explanation, I don't think that it's worth to spend more time on this issue. @steakhal What do you think about this? Woudl you agree with closing this ticket?

Thanks for your time. I think what you say makes sense. I didn't see yet glaring drawbacks of having these "complicated symbols" other than sometimes their dump took 26 megabytes. Given that ATM I'm not focusing on this checker, I'm fine with closing this.
I'll reopen it if it turns out relevant and I have evidence backing this up.
FYI right now I'm focusing on bounding Z3 refutation times. I've seen cases when Z3 refutation took 90% or more of the total time of an analysis.

@steakhal steakhal closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
@EugeneZelenko EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label May 2, 2024
steakhal added a commit that referenced this issue Aug 21, 2024
As discussed in

https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10

Some `isTainted()` queries can blow up the analysis times, and
effectively halt the analysis under specific workloads.

We don't really have the time now to do a caching re-implementation of
`isTainted()`, so we need to workaround the case.

The workaround with the smallest blast radius was to limit what symbols
`isTainted()` does the query (by walking the SymExpr). So far, the
threshold 10 worked for us, but this value can be overridden using the
"max-tainted-symbol-complexity" config value.

This new option is "deprecated" from the getgo, as I expect this issue
to be fixed within the next few months and I don't want users to
override this value anyways. If they do, this message will let them know
that they are on their own, and the next release may break them (as we
no longer recognize this option if we drop it).

Mitigates #89720

CPP-5414
cjdb pushed a commit to cjdb/llvm-project that referenced this issue Aug 23, 2024
…105493)

As discussed in

https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10

Some `isTainted()` queries can blow up the analysis times, and
effectively halt the analysis under specific workloads.

We don't really have the time now to do a caching re-implementation of
`isTainted()`, so we need to workaround the case.

The workaround with the smallest blast radius was to limit what symbols
`isTainted()` does the query (by walking the SymExpr). So far, the
threshold 10 worked for us, but this value can be overridden using the
"max-tainted-symbol-complexity" config value.

This new option is "deprecated" from the getgo, as I expect this issue
to be fixed within the next few months and I don't want users to
override this value anyways. If they do, this message will let them know
that they are on their own, and the next release may break them (as we
no longer recognize this option if we drop it).

Mitigates llvm#89720

CPP-5414
tru pushed a commit to steakhal/llvm-project that referenced this issue Sep 1, 2024
…105493)

As discussed in

https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10

Some `isTainted()` queries can blow up the analysis times, and
effectively halt the analysis under specific workloads.

We don't really have the time now to do a caching re-implementation of
`isTainted()`, so we need to workaround the case.

The workaround with the smallest blast radius was to limit what symbols
`isTainted()` does the query (by walking the SymExpr). So far, the
threshold 10 worked for us, but this value can be overridden using the
"max-tainted-symbol-complexity" config value.

This new option is "deprecated" from the getgo, as I expect this issue
to be fixed within the next few months and I don't want users to
override this value anyways. If they do, this message will let them know
that they are on their own, and the next release may break them (as we
no longer recognize this option if we drop it).

Mitigates llvm#89720

CPP-5414

(cherry picked from commit 8486589)
tru pushed a commit to steakhal/llvm-project that referenced this issue Sep 1, 2024
…105493)

As discussed in

https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10

Some `isTainted()` queries can blow up the analysis times, and
effectively halt the analysis under specific workloads.

We don't really have the time now to do a caching re-implementation of
`isTainted()`, so we need to workaround the case.

The workaround with the smallest blast radius was to limit what symbols
`isTainted()` does the query (by walking the SymExpr). So far, the
threshold 10 worked for us, but this value can be overridden using the
"max-tainted-symbol-complexity" config value.

This new option is "deprecated" from the getgo, as I expect this issue
to be fixed within the next few months and I don't want users to
override this value anyways. If they do, this message will let them know
that they are on their own, and the next release may break them (as we
no longer recognize this option if we drop it).

Mitigates llvm#89720

CPP-5414

(cherry picked from commit 8486589)
dmpolukhin pushed a commit to dmpolukhin/llvm-project that referenced this issue Sep 2, 2024
…105493)

As discussed in

https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570/10

Some `isTainted()` queries can blow up the analysis times, and
effectively halt the analysis under specific workloads.

We don't really have the time now to do a caching re-implementation of
`isTainted()`, so we need to workaround the case.

The workaround with the smallest blast radius was to limit what symbols
`isTainted()` does the query (by walking the SymExpr). So far, the
threshold 10 worked for us, but this value can be overridden using the
"max-tainted-symbol-complexity" config value.

This new option is "deprecated" from the getgo, as I expect this issue
to be fixed within the next few months and I don't want users to
override this value anyways. If they do, this message will let them know
that they are on their own, and the next release may break them (as we
no longer recognize this option if we drop it).

Mitigates llvm#89720

CPP-5414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!
Projects
None yet
Development

No branches or pull requests

4 participants