Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in jl_collect_backedges #45444

Closed
vchuravy opened this issue May 24, 2022 · 16 comments · Fixed by #46171 or #46375
Closed

Segmentation fault in jl_collect_backedges #45444

vchuravy opened this issue May 24, 2022 · 16 comments · Fixed by #46171 or #46375
Assignees
Milestone

Comments

@vchuravy
Copy link
Member

signal (11): Segmentation fault
in expression starting at none:0
jl_collect_backedges at /buildworker/worker/package_linux64/build/src/dump.c:1233 [inlined]
ijl_save_incremental at /buildworker/worker/package_linux64/build/src/dump.c:2682
jl_write_compiler_output at /buildworker/worker/package_linux64/build/src/precompile.c:65
ijl_atexit_hook at /buildworker/worker/package_linux64/build/src/init.c:204
jl_repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:711
main at /buildworker/worker/package_linux64/build/cli/loader_exe.c:59

Seen in PkgEval in #45195 on both sides of the comparison.

https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/39a24eb_vs_5554676/PANDA.against.log

@Keno
Copy link
Member

Keno commented May 24, 2022

I've seen this one also, but haven't been able to reproduce.

@Keno Keno added the rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one label May 24, 2022
@vtjnash
Copy link
Member

vtjnash commented May 27, 2022

Maybe an ASAN trace instead? https://buildkite.com/julialang/julia-master/builds/12051#5db78853-4ee6-4fe1-81e2-cbd26ff71465

=================================================================
  | ==15345==ERROR: AddressSanitizer: heap-use-after-free on address 0x61100b1ab018 at pc 0x7f813c550210 bp 0x7ffffda9b770 sp 0x7ffffda9b768
  | WRITE of size 8 at 0x61100b1ab018 thread T0
  | #0 0x7f813c55020f in validate_new_code_instances /cache/build/default-amdci5-5/julialang/julia-master/src/dump.c:2447:83
  | #1 0x7f813c53a174 in _jl_restore_incremental /cache/build/default-amdci5-5/julialang/julia-master/src/dump.c:3121:5
  | #2 0x7f813c53a56c in ijl_restore_incremental /cache/build/default-amdci5-5/julialang/julia-master/src/dump.c:3168:12
  | #3 0x7f812a0078ad in julia__include_from_serialized_25063 loading.jl:825
  | #4 0x7f8129e94311 in julia__require_search_from_serialized_24531 loading.jl:964
  | #5 0x7f8129e97560 in _require_search_from_serialized loading.jl:930
  | #6 0x7f8129e97560 in julia__require_26923 loading.jl:1248
  | #7 0x7f8129e9ad6d in julia__require_prelocked_30464 loading.jl:1144
  | #8 0x7f8129e75592 in macro expansion loading.jl:1124
  | #9 0x7f8129e75592 in macro expansion lock.jl:267
  | #10 0x7f8129e75592 in julia_require_25637 loading.jl:1088
  | #11 0x7f8129e75d4c in jfptr_require_25638 text
 ```

@vchuravy vchuravy mentioned this issue Jul 6, 2022
3 tasks
@vchuravy vchuravy modified the milestones: 1.9, 1.8 Jul 14, 2022
@timholy
Copy link
Member

timholy commented Jul 14, 2022

Possibly a simple fix would be to wait to turn on GC

julia/src/dump.c

Line 3184 in b3b229e

jl_gc_enable(en); // subtyping can allocate a lot, not valid before recache-other
until after the call to validate_new_code_instances (just a couple lines down).

@maleadt
Copy link
Member

maleadt commented Jul 18, 2022

Looking at latest daily PkgEval, there's a bunch of packages triggering this:

julia> filter(df) do row
       row.log isa String && contains(row.log, "jl_collect_backedges")
       end
13×9 DataFrame
 Row │ julia                    compiled  name                         uuid                               version    status  reason          duration  log
     │ String                   Bool      String                       String                             String     String  String          Float64   String?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ v"1.9.0-DEV-7261c65d52"     false  MRIgeneralizedBloch          UUID("9689932d-8765-44d0-985b-2d…  v"0.4.0"   :fail   :segfault        580.717  ################################…
   2 │ v"1.9.0-DEV-7261c65d52"     false  SBMLToolkit                  UUID("86080e66-c8ac-44c2-a1a0-9a…  v"0.1.15"  :fail   :segfault        740.79   ################################…
   3 │ v"1.9.0-DEV-7261c65d52"     false  SignalDecomposition          UUID("11a47235-7b84-4c7c-b885-fc…  v"1.0.4"   :fail   :segfault        594.799  ################################…
   4 │ v"1.9.0-DEV-7261c65d52"     false  HighDimPDE                   UUID("57c578d5-59d4-4db8-a490-a9…  v"1.2.0"   :fail   :segfault        369.78   ################################…
   5 │ v"1.9.0-DEV-7261c65d52"     false  EquationsSolver              UUID("5d795e5a-a85f-4cbe-a274-66…  v"0.2.0"   :fail   :segfault        225.24   ################################…
   6 │ v"1.9.0-DEV-7261c65d52"     false  DiffEqDevTools               UUID("f3b72e0c-5b89-59e1-b016-84…  v"2.30.0"  :fail   :segfault        514.938  ################################…
   7 │ v"1.9.0-DEV-85b895bb69"     false  MathepiaModels               UUID("2bd2a319-f9c6-417c-a4f9-7b…  v"0.1.1"   :fail   :segfault        587.651  ################################…
   8 │ v"1.9.0-DEV-85b895bb69"     false  Ai4EComponentLib             UUID("9901752a-13b3-47c7-8fb4-ce…  v"0.2.0"   :fail   :segfault        375.352  ################################…
   9 │ v"1.9.0-DEV-85b895bb69"     false  FranklinUtils                UUID("dcd8a645-c81d-482f-af4b-56…  v"0.3.4"   :fail   :segfault         51.091  ################################…
  10 │ v"1.9.0-DEV-85b895bb69"     false  ShipMMG                      UUID("37f2b0bf-0c13-4883-8808-e7…  v"0.0.5"   :fail   :segfault        636.421  ################################…
  11 │ v"1.9.0-DEV-85b895bb69"     false  MonteCarloMeasurements       UUID("0987c9cc-fe09-11e8-30f0-b9…  v"1.0.9"   :fail   :test_failures  1031.94   ################################…
  12 │ v"1.9.0-DEV-85b895bb69"     false  ControlSystemIdentification  UUID("3abffc1c-5106-53b7-b354-a4…  v"2.4.0"   :fail   :segfault        559.171  ################################…
  13 │ v"1.9.0-DEV-85b895bb69"     false  MultiScaleArrays             UUID("f9640e96-87f6-5992-9c3b-07…  v"1.9.1"   :fail   :segfault        371.606  ################################…

The lightest package here seems FranklinUtils, I'll try running that a bunch to try and get an rr recording.

@timholy timholy self-assigned this Jul 18, 2022
@timholy
Copy link
Member

timholy commented Jul 18, 2022

I'm assigning myself to fix this, but happy to accept an rr trace. By tomorrow I should be able dive into this.

@maleadt
Copy link
Member

maleadt commented Jul 18, 2022

I couldn't reproduce this locally, so I hacked rr support in PkgEval: https://github.com/JuliaCI/PkgEval.jl/compare/tb/rr

Link to an rr trace will be appended at the end of the log of packages that abort/segfault/reach an unreachable/report GC corruption. Example here: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/e1739aa/report.html (although I let it unconditionally submit the log here). I do expect some false positives because of running under rr though, so I'll revert this soon after getting a (hopefully useful) report here.

e1739aa#commitcomment-78790001=

EDIT: well, that was underwhelming. There's some work to be done in BugReporting.jl before this will work, so I might not have an rr trace before tomorrow. In any case, this would be a useful addition to PkgEval, so I'll have a look anyway.

EDIT2: A new attempt successfully recorded all critical package failures, but of course there wasn't a jl_collect_backedges segfault among them. I'll try once more when I get the rr recording functionality merged.

@timholy
Copy link
Member

timholy commented Jul 22, 2022

In a debug build I hit #46064, the InteractiveUtils case:

Expr not allowed in value position
Internal error: encountered unexpected error in runtime:
ErrorException("")
error at ./error.jl:35
check_op at ./compiler/ssair/verify.jl:52
verify_ir at ./compiler/ssair/verify.jl:243
verify_ir at ./compiler/ssair/verify.jl:79 [inlined]
run_passes at ./compiler/optimize.jl:590
...

@timholy
Copy link
Member

timholy commented Jul 23, 2022

I've not had any luck capturing this locally either, even in a debug build with many more assertions turned on. Given the line number I'm nevertheless about 60% hopeful that #46148 will fix it.

vtjnash added a commit that referenced this issue Jul 25, 2022
The edge-restore algorithm here is pretty bad now, but this should hopefully fix #45444
vtjnash added a commit that referenced this issue Jul 30, 2022
The edge-restore algorithm here is pretty bad now, but this should hopefully fix #45444
@maleadt
Copy link
Member

maleadt commented Aug 3, 2022

I don't think this was fixed, as it occurred on a recent PkgEval run: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2022-08/03/BasicBSpline.primary.log (this was on eedf3f1). Sadly, due to a bug in PkgEval.jl we didn't upload the rr recording...

@maleadt maleadt reopened this Aug 3, 2022
@maleadt
Copy link
Member

maleadt commented Aug 5, 2022

Finally caught this in rr: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/rr/UrlDownload-1659617146.tar.zst. If you want to replay this using BugReporting.jl, this brings you right to the crash: replay("https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/rr/UrlDownload-1659617146.tar.zst"; rr_replay_flags=`--onprocess=65 --goto=1661953`)

@vtjnash
Copy link
Member

vtjnash commented Aug 5, 2022

That run failed to include #46171

@maleadt
Copy link
Member

maleadt commented Aug 6, 2022

Ugh, I assumed it had been back-ported already. Let's assume it's still fixed then.

@maleadt maleadt closed this as completed Aug 6, 2022
KristofferC pushed a commit that referenced this issue Aug 6, 2022
The edge-restore algorithm here is pretty bad now, but this should hopefully fix #45444

(cherry picked from commit 6b51780)
@maleadt
Copy link
Member

maleadt commented Aug 10, 2022

This still happens, now spotted on 686afd3 which definitely includes the fix. PkgEval log: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2022-08/09/report.html, relevant package log: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2022-08/09/ComoniconGUI.primary.log

[21] signal (11): Segmentation fault
in expression starting at none:0
jl_collect_backedges at /workspace/srcdir/src/dump.c:1307 [inlined]
ijl_save_incremental at /workspace/srcdir/src/dump.c:2756
jl_write_compiler_output at /workspace/srcdir/src/precompile.c:65
ijl_atexit_hook at /workspace/srcdir/src/init.c:220
jl_repl_entrypoint at /workspace/srcdir/src/jlapi.c:712
main at /workspace/srcdir/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

rr recording: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/rr/ComoniconGUI-1660060597.tar.zst

To get to the segfault:

using BugReporting
replay("https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/rr/ComoniconGUI-1660060597.tar.zst"; rr_replay_flags=`--onprocess 21 --goto 194177`)

@maleadt maleadt reopened this Aug 10, 2022
@vtjnash vtjnash removed the rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one label Aug 10, 2022
@maleadt

This comment was marked as outdated.

ffucci pushed a commit to ffucci/julia that referenced this issue Aug 11, 2022
The edge-restore algorithm here is pretty bad now, but this should hopefully fix JuliaLang#45444
@maleadt
Copy link
Member

maleadt commented Aug 13, 2022

Failure caught with a debug build:

using BugReporting
replay("https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/rr/DifferentiableStateSpaceModels-1660285239.tar.zst"; rr_replay_flags=`--onprocess 43 --goto 900201`)

pcjentsch pushed a commit to pcjentsch/julia that referenced this issue Aug 18, 2022
The edge-restore algorithm here is pretty bad now, but this should hopefully fix JuliaLang#45444
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment