Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some FFI functions incorrectly being unsafe. #19

Closed
wants to merge 1 commit into from

Conversation

nh2
Copy link

@nh2 nh2 commented Apr 19, 2022

Those functions can be reentrant or do IO.

This fixes a case I encountered on our CI machine where unsafe on PQisBusy() resulted in GHC RTS hangs when a ostgresql notice processor was set to call back into Haskell.

@@ -2373,6 +2373,19 @@ loUnlink connection oid
negError =<< c_lo_unlink c oid


-- Reminder on `unsafe` FFI:
--
-- - `unsafe` calls MUST NOT do any IO!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference to this claim?

Do you rather mean "must not block"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I had "database IO" in mind. Will update to "blocking".

src/Database/PostgreSQL/LibPQ.hsc Outdated Show resolved Hide resolved
src/Database/PostgreSQL/LibPQ.hsc Show resolved Hide resolved
Those functions can call back or or do blocking IO,
which is not allowed for `unsafe` calls.

This fixes a case I encountered on our CI machine where `unsafe` on
`PQisBusy()` resulted in GHC RTS hangs when a postgresql notice processor
was set to call back into Haskell.
@nh2 nh2 force-pushed the fix-reentrant-unsafe-ffi branch from a119b79 to a16375f Compare April 20, 2022 08:36
@nh2 nh2 requested a review from phadej April 20, 2022 08:38
Copy link
Collaborator

@phadej phadej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate. If notice handler is not set (or is non-Haskell-calling function), most of these could remain unsafe, OTOH I do see a value of setting a handler too.

I also have uneasy feeling, as any of now-still-unsafe functions may start to notify in the future. Whether they call notifier is not in documentation, and would be hard to test.

How bad would be to mark all PGConn/PGResult related functions as safe?

@@ -2636,33 +2694,42 @@ foreign import ccall "libpq-fe.h PQunescapeBytea"
-> Ptr CSize
-> IO (Ptr Word8) -- Actually (IO (Ptr CUChar))

foreign import ccall unsafe "libpq-fe.h PQescapeIdentifier"
-- Must not be `unsafe` because `PQescapeIdentifier` may call `libpq_gettext()` that may do IO.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it actually block? Aren't translations loaded once into memory?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are loaded on demand:

https://github.com/postgres/postgres/blob/70f7da3e6ec865a120f3a3fed6173ce20f7d485a/src/interfaces/libpq/fe-misc.c#L1128-L1153

Here, bindtextdomain() sets the directory from which translation files are read, and dgettext() returns the translation string, so in between disk IO must be happening in practice.

Postgres puts an if (!already_bound) around it, but that is quite the implementation detail; to make PQescapeIdentifier() unsafe, we'd have to find some guarantee that it is imposible to call this function without some other safe function triggering the gettext code path first. This would be quite fragile.

@nh2
Copy link
Author

nh2 commented Apr 20, 2022

I also have uneasy feeling, as any of now-still-unsafe functions may start to notify in the future. Whether they call notifier is not in documentation, and would be hard to test.

Yes, implementation change is a risk. Even more likely than notifying that they may notify, they may call libpq_gettext().

I guess that struct accessor functions like PQuser are relatively certain to stay stuct accessor functions, but we don't have guarantees.

Personally I'd really like to have a flag to the GHC RTS that (with some overhead costs) tracks whether any unsafe call calls back into Haskell, and terminates the program. That flag could then be used in test suites.

How bad would be to mark all PGConn/PGResult related functions as safe?

safe calls are made on a separate OS thread so that they cannot block the runtime, and that scheduling is costly (compared to struct access or cached IO, but not to real IO like network or disk access).

On https://gitlab.haskell.org/ghc/ghc/-/issues/13296#note_132110 Simon Marlow estimates the safe overhead to 100 ns.

So the answer would depend on whether there are use cases that call one of these functions very quickly in a loop, e.g. to parse a multi-GB query by parts, or something like that. I have not yet investigated how likely that is to occur.

In any case, I propose that we first merge this fix to fix certain hangs, and then a separate investigation can be launched to see whether a general safer default would be sensible.

@nh2
Copy link
Author

nh2 commented Apr 20, 2022

For some googleability, and if others have similar problems, this is how I found the cause of the hang:

  • Compile my test suite with ghc-options: -debug to enable the debug RTS.

  • Run dist/build/tests/tests +RTS -N16 -v (-v is the flag that turns on event loggint, thus printing stuff like GC events).

  • This prints output like the following (only including the last lines):

    ``` 7fd3657fa000: cap 9: created thread 415 7fd3657fa000: cap 9: thread 415 has label weak finalizer thread 7fd3657fa000: cap 9: all caps stopped for GC 7fd3657fa000: cap 9: finished GC 7fd3867fc000: cap 0: running thread 229 (ThreadRunGHC) 7fd367fff000: cap 4: waking up thread 283 on cap 4 7fd367fff000: cap 4: waking up thread 283 on cap 4 7fd34affd000: cap 13: running thread 379 (ThreadRunGHC) 7fd3a77fe000: cap 8: running thread 315 (ThreadRunGHC) 7fd30bfff000: cap 5: running thread 80 (ThreadRunGHC) 7fd3227fc000: cap 7: running thread 321 (ThreadRunGHC) 7fd3bea04000: cap 2: running thread 397 (ThreadRunGHC) 7fd30bfff000: cap 5: thread 80 stopped (blocked on an MVar) 7fd366ffd000: cap 6: waking up thread 370 on cap 6 7fd366ffd000: cap 6: waking up thread 370 on cap 6 7fd2e5ffb000: cap 12: running thread 408 (ThreadRunGHC) 7fd2e5ffb000: cap 12: thread 408 stopped (suspended while making a foreign call) 7fd387fff000: cap 14: running thread 233 (ThreadRunGHC) 7fd387fff000: cap 14: thread 233 stopped (suspended while making a foreign call) 7fd387fff000: cap 14: running thread 233 (ThreadRunGHC) 7fd387fff000: cap 14: thread 233 stopped (suspended while making a foreign call) 7fd387fff000: cap 14: running thread 233 (ThreadRunGHC) 7fd387fff000: cap 14: thread 233 stopped (suspended while making a foreign call) 7fd387fff000: cap 14: running thread 233 (ThreadRunGHC) 7fd30bfff000: cap 5: running thread 7 (ThreadRunGHC) 7fd3a5ffb000: cap 11: waking up thread 387 on cap 11 7fd3a5ffb000: cap 11: waking up thread 387 on cap 11 7fd364ff9000: cap 10: running thread 333 (ThreadRunGHC) 7fd366ffd000: cap 6: running thread 370 (ThreadRunGHC) 7fd3657fa000: cap 9: running thread 235 (ThreadRunGHC) 7fd2e5ffb000: cap 12: running thread 408 (ThreadRunGHC) 7fd3657fa000: cap 9: thread 235 stopped (suspended while making a foreign call) 7fd3877fe000: cap 15: running thread 241 (ThreadRunGHC) 7fd3877fe000: cap 15: thread 241 stopped (suspended while making a foreign call) 7fd3a77fe000: cap 8: waking up thread 387 on cap 11 7fd3877fe000: cap 15: running thread 241 (ThreadRunGHC) 7fd3877fe000: cap 15: thread 241 stopped (suspended while making a foreign call) 7fd30bfff000: cap 5: thread 7 stopped (yielding) 7fd30bfff000: cap 5: running thread 7 (ThreadRunGHC) 7fd30bfff000: cap 5: thread 7 stopped (suspended while making a foreign call) 7fd2e5ffb000: cap 12: thread 408 stopped (suspended while making a foreign call) 7fd2e5ffb000: cap 12: running thread 408 (ThreadRunGHC) 7fd2e5ffb000: cap 12: thread 408 stopped (suspended while making a foreign call) 7fd2e5ffb000: cap 12: running thread 408 (ThreadRunGHC) 7fd3227fc000: cap 7: thread 321 stopped (heap overflow) 7fd3227fc000: cap 7: requesting sequential GC 7fd3867fc000: cap 0: thread 229 stopped (heap overflow) 7fd3be203000: cap 3: running thread 319 (ThreadRunGHC) 7fd3be203000: cap 3: thread 319 stopped (yielding) 7fd2e77fe000: cap 9: running thread 413 (ThreadRunGHC) 7fd2e77fe000: cap 9: thread 413 stopped (finished) 7fd3657fa000: cap 9: running thread 235 (ThreadRunGHC) 7fd3657fa000: cap 9: thread 235 stopped (blocked on an MVar) 7fd3657fa000: cap 9: thread 414 migrating to cap 5 7fd3a6ffd000: cap 9: running thread 11 (ThreadRunGHC) 7fd3a6ffd000: cap 9: waking up thread 235 on cap 9 7fd3a6ffd000: cap 9: thread 11 stopped (yielding) 7fd3237fe000: cap 1: running thread 3 (ThreadRunGHC) 7fd3237fe000: cap 1: thread 3 stopped (yielding) 7fd3877fe000: cap 15: running thread 241 (ThreadRunGHC) 7fd3877fe000: cap 15: thread 241 stopped (suspended while making a foreign call) 7fd3877fe000: cap 15: running thread 241 (ThreadRunGHC) 7fd364ff9000: cap 10: thread 333 stopped (yielding) 7fd3877fe000: cap 15: thread 241 stopped (yielding) 7fd367fff000: cap 4: running thread 283 (ThreadRunGHC) 7fd3a5ffb000: cap 11: running thread 387 (ThreadRunGHC) 7fd34affd000: cap 13: thread 379 stopped (suspended while making a foreign call) 7fd387fff000: cap 14: thread 233 stopped (yielding) 7fd34affd000: cap 13: running thread 379 (ThreadRunGHC) 7fd34affd000: cap 14: created thread 416 7fd3a77fe000: cap 8: waking up thread 283 on cap 4 7fd3a77fe000: cap 8: waking up thread 370 on cap 6 7fd2e5ffb000: cap 12: thread 408 stopped (yielding) 7fd3a5ffb000: cap 11: thread 387 stopped (suspended while making a foreign call) 7fd3bea04000: cap 2: thread 397 stopped (yielding) 7fd366ffd000: cap 6: thread 370 stopped (yielding) 7fd3a5ffb000: cap 11: running thread 387 (ThreadRunGHC) 7fd3a77fe000: cap 8: thread 315 stopped (yielding) 7fd366ffd000: cap 6: waking up thread 370 on cap 6 7fd367fff000: cap 4: thread 283 stopped (yielding) 7fd367fff000: cap 4: waking up thread 283 on cap 4 7fd3a5ffb000: cap 11: thread 387 stopped (blocked on an MVar) 7fd3a5ffb000: cap 11: waking up thread 387 on cap 11 7fd321ffb000: cap 4: running thread 136 (ThreadRunGHC) 7fd2e6ffd000: cap 11: running thread 13 (ThreadRunGHC) 7fd2e6ffd000: cap 11: waking up thread 387 on cap 11 7fd2e6ffd000: cap 11: thread 13 stopped (yielding) 7fd321ffb000: cap 4: thread 136 stopped (yielding)

    -- hangs here

    </details> 
    
  • Filtering and re-ordering the output so that for each capability I see what the last message for it was:

    cap  1: thread 3 stopped (yielding)
    cap  2: thread 397 stopped (yielding)
    cap  3: thread 319 stopped (yielding)
    cap  4: thread 136 stopped (yielding)
    cap  5: thread 7 stopped (suspended while making a foreign call)
    cap  6: waking up thread 370 on cap 6
    cap  7: requesting sequential GC
    cap  8: thread 315 stopped (yielding)
    cap  9: thread 11 stopped (yielding)
    cap 10: thread 333 stopped (yielding)
    cap 11: thread 13 stopped (yielding)
    cap 12: thread 408 stopped (yielding)
    cap 13: running thread 379 (ThreadRunGHC)
    cap 14: created thread 416
    cap 15: thread 241 stopped (yielding)
    

    This suggested that foreign calls are involved.

  • Attaching to the hanging process with gdb -p $PID, and running thread apply all backtrace to print all threads' backtraces, I found one with this trace:

    Thread 32 (Thread 0x7fd34affd000 (LWP 1714036) "benaco-tests:w"):
    #0  0x00007fd3fad11cf5 in __futex_abstimed_wait_common64 () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libpthread.so.0
    #1  0x00007fd3fad0bc22 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libpthread.so.0
    #2  0x00000000039cd0a9 in waitCondition ()
    #3  0x0000000003991234 in waitForWorkerCapability ()
    #4  0x0000000003991c10 in yieldCapability ()
    #5  0x000000000399da16 in scheduleYield ()
    #6  0x000000000399ce24 in schedule ()
    #7  0x000000000399fe72 in scheduleWaitThread ()
    #8  0x0000000003994675 in rts_evalIO ()
    #9  0x0000000001ba9e60 in zdbenacozm0zi1zi0zi0zmCSvtJjDyD2JEU6ocAylRawzdBenacoziStorezdbenacozzm0zzi1zzi0zzi0zzmCSvtJjDyD2JEU6ocAylRawzuBenacozziStorezuinlinezzuczzuffizzu6989586621681244124 ()   <- this is my notice processor, generated by inline-c
    #10 0x00007fd3fc69209c in pqGetErrorNotice3 () from /nix/store/ay1gs2am2ani1kyyfjpgbsvl5ynm2vpw-postgresql-9.6.24-lib/lib/libpq.so.5
    #11 0x00007fd3fc6923b2 in pqParseInput3 () from /nix/store/ay1gs2am2ani1kyyfjpgbsvl5ynm2vpw-postgresql-9.6.24-lib/lib/libpq.so.5
    #12 0x00007fd3fc689415 in PQisBusy () from /nix/store/ay1gs2am2ani1kyyfjpgbsvl5ynm2vpw-postgresql-9.6.24-lib/lib/libpq.so.5
    #13 0x0000000002af1a42 in postgresqlzmsimplezm0zi6zi4zmJcdIv2KqvAC7dE8DaaZZEkW_DatabaseziPostgreSQLziSimpleziInternal_zdwgetResult_info ()   <- Haskell function calling postgresql-simple
    #14 0x0000004200e14248 in ?? ()
    #15 0x0000000000000000 in ?? ()
    

    The key thing in here is that I saw that PQisBusy() called back into functions of the Haskell runtime. So I checked whether the wrapper for that function was safe as needed. But it was unsafe.

@phadej
Copy link
Collaborator

phadej commented Apr 20, 2022

i'd prefer to default to all PGConn and PGResult functions to be marked safe. E.g. PQgetlength is called on each table cell, so the price goes up anyway, (in FromRow instances of postgresql-simple e.g.)

So I'd rather error-out to the safe side, and then mark functions unsafe when there is motivation and evidence it's fine to do so.

Copy link
Contributor

@ulidtko ulidtko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surface-level LGTM, #9 related

phadej added a commit that referenced this pull request Jul 29, 2023
See #19

libpq can callback into Haskell, or do IO, so it's safer to just not use unsafe.
The overhead should be minimal.
@phadej
Copy link
Collaborator

phadej commented Jul 29, 2023

I made all functions safe in #48

@phadej phadej closed this Jul 29, 2023
@nh2
Copy link
Author

nh2 commented Jul 29, 2023

Very good!

@nh2
Copy link
Author

nh2 commented Oct 3, 2024

I started an initiative for GHC to facilitate finding long-running unsafe calls:

GHC #25333: Add RTS stats and alerts for long-running unsafe foreign function (FFI) calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants