-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults when under pressure? #1986
Comments
Hi Peter, I previously helped one of the organisations experiencing the problems reported in gatsbyjs/gatsby#6291 and there's a summary of the findings in gatsbyjs/gatsby#6291 (comment) Thanks for bringing it to my attention again. By stressing my machine with other CPU-bound processes at the same time until it overheated and throttled the clock rate, I finally managed to reproduce a crash and get a backtrace using the latest sharp v0.23.3, its vendored copy of libvips v8.8.1 and the latest Node.js LTS v12.13.1.
From what I can tell, the Gatsby build process is queuing filesystem-based input images one-at-a-time then attempting to concurrently generate 6 output images to memory, resized in various dimensions. All of these images are JPEGs. sharp uses libuv-managed threads, which by default will limit this to 4 concurrently (and queue the other 2). It looks like Gatsby calls If I set the @jcupitt Can you see any possible thread-safety issues in this backtrace? Perhaps something in I'll attempt to get it to crash with a debug build of libvips to try to discover which functions the |
Hi. I'm really happy to hear you were able to get a repro AND trace, as I was unable to get the trace. I will try to go through the code in Gatsby tomorrow and see how we can improve that situation. I don't think it should be hammering the process like that in the first place as that has different perf implications as well.
That's interesting because on my machine, this happens all the time (whenever I trigger any computation) and have these messages in my logs every time. So perhaps the problem is related to clock throttling and stressing. Anyways. Happy to hear you have a repro and hoping it'll improve the situation :) |
(Perhaps my test with single cpu was botched, I can't say for sure right now, so take that with a grain of salt) |
Here's a more detailed backtrace using the latest libvips master branch compiled from source:
|
That's very interesting. I'll have a look today. |
I think I found what's making those message before the crash: https://github.com/libvips/libvips/blob/master/libvips/iofuncs/header.c#L311 If you change that to be: /*
if( G_VALUE_TYPE( value ) == G_TYPE_STRING )
g_value_init( &meta->value, VIPS_TYPE_REF_STRING );
else
g_value_init( &meta->value, G_VALUE_TYPE( value ) );
*/
g_value_init( &meta->value, 0 ); And try
The same set of errors, so probably the
This could be caused by an image being unreffed during the copy, but that seems unlikely to me, the chaos would be even worse. It's more likely that some other thread is changing the metadata on the input image during the copy. Could this be happening? libvips images have a list of arbitrary name/value pairs attached to them, and these are inherited as images are processed. They are used for things like ICC profiles and EXIF data, but user code can add extra metadata too if it's convenient. Image metadata is supposed to be immutable: after the image has been constructed, it's all frozen and must not be changed or you get bad behaviour like this. However, at the moment, there's no lock mechanism to prevent you changing image metadata. How about adding something to enforce this? It could be very simple, eg: void
vips_image_set( VipsImage *image, const char *name, GValue *value )
{
g_assert( name );
g_assert( value );
/* If this image is shared, block metadata changes.
*/
if( G_OBJECT( image )->ref_count > 1 ) {
g_info( "can't set metadata \"%s\" on shared image", name );
return;
}
meta_init( image );
(void) meta_new( image, name, value );
... etc. |
Thanks John, amazing debugging skills there as always. If I add your suggestion to libvips and run the sharp test suite against it I see things like:
which led me to a few possible sources within libvips: https://github.com/libvips/libvips/blob/master/libvips/foreign/jpeg2vips.c#L613 Next step (for me) is to try to reproduce the original crash with this extra debugging in place. |
Yes, bits of the test suite fail too, I'll fix them. |
Thanks John, I also added this debug to https://github.com/libvips/libvips/blob/master/libvips/colour/profile_load.c#L208 |
Commit bb15cd9 ensures sharp always takes a copy before updating/removing metadata. |
If images are shared (ref count > 1), block changes to the set of metadata items on the image. These can cause crashes in highly threaded programs. See lovell/sharp#1986
I made a PR to gather up any changes libvips needs for this. Please push any related patches there. I put your nice |
During write, we often call vips__exif_update(). This updates the exif block from the other image metadata prior to save. Always copy the image before calling this. See lovell/sharp#1986
I found a couple of places in libvips which were breaking the "metadata is frozen" rule. The big one was I'll backport libvips/libvips#1483 (but not the locks on metadata changes) to current stable 8.8 and it should be easy to test. |
During write, we often call vips__exif_update(). This updates the exif block from the other image metadata prior to save. Always copy the image before calling this. See lovell/sharp#1986
OK, the head of 8.8 (it's 8.8.4 at the moment) includes this change to metadata handling and (I hope!) should be easy to test. |
Brilliant, thanks John, I've been unable to reproduce a segfault using the |
That's great! @pvdz would you be able to test 8.8.4? It should be a drop-in replacement for your current libvips. |
(or are there related sharp changes that would be needed too, Lovell?) |
I found a couple of places in sharp that could potentially cause this problem, addressed in bb15cd9, which will be in sharp v0.23.4. |
@jcupitt Brilliant, thanks John. @pvdz In terms of semver, sharp is at major version zero, which means breaking changes such as removal of support for Node.js 8 arrive in minor increments and remain compliant with e.g. use of the ^ range selector. The removal of support for Node.js 8 unblocks #1282, which I believe is of interest to Gatsby amongst many others. The ABI stability that N-API will (hopefully) bring is pretty much the last remaining task before we organise the "sharp version 1" release party. |
I was thinking about this again.
Even if we find all of the modifications to metadata in potentially shared
images in sharp and libvips, other libvips users could still be in error
and get bitten by these races.
The problem is that the API allows potentially racey behaviour and has no
way of reliably checking or enforcing the conventions that user code is
supposed to follow to be safe.
How about adding some mutexes to make the current API safe, and designing a
new, non-racey API for the next version. I think we’d just need to add a
global lock for setting image metadata and for metadata copy.
It’s a bit ugly, but the performance hit should be minor.
What do you think?
…On Sat, 18 Jan 2020 at 10:18, Lovell Fuller ***@***.***> wrote:
@jcupitt <https://github.com/jcupitt> Brilliant, thanks John.
@pvdz <https://github.com/pvdz> In terms of semver, sharp is at major
version zero, which means breaking changes such as removal of support for
Node.js 8 arrive in minor increments and remain compliant with e.g. use of
the ^ range selector.
The removal of support for Node.js 8 unblocks #1282
<#1282>, which I believe is of
interest to Gatsby amongst many others. The ABI stability that N-API will
(hopefully) bring is pretty much the last remaining task before we organise
the "sharp version 1" release party.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1986?email_source=notifications&email_token=AAENZ26PUK2JBJBMO3RZJH3Q6LJQTA5CNFSM4JRY4XJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJJVCNI#issuecomment-575885621>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAENZ22RBBK7RIMDCAYARC3Q6LJQTANCNFSM4JRY4XJQ>
.
|
Does it have to be a global mutex though @jcupitt ? Can't it possibly be one mutex per object at which level metadata are abstracted in libvips? Unless the metadata lib is not thread-safe itself? |
It could be a lock on the image, but there isn’t a convenient one we can
use, so we’d need to add another member. I think I’d be nervous about doing
that in a minor release.
The libvips global lock is only used very lightly, so there should be
little contention. It ought to be a safe and simple fix for now.
We can try something better for 8.10.
|
I made a branch with locks for set / remove / copy. Hopefully this will resolve the crash, or at least change the stack trace. Let's remove this hack and add some new API for 8.10. |
Another attempt at fixing crashes on metadata chenage in highly threaded applications. Global lock around set, remove and copy metadata. This is crude, but simple, the performance impact should be small, and ought to resolve the problem. We'll do something better for the next version. see lovell/sharp#1986
I can no longer reproduce this problem when using libvips compiled from the |
That's great! @pvdz if you give a thumbs-up as well, let's push out 8.9.1 with this fix. |
I'm a little busy to dig deep I'm afraid. Would suggest to push this as is. Sounds some things have been fixed and that's an improvement regardless. If I can repro it later I'll report back. Thanks! |
OK, let's merge and make 8.9.1. |
@lovell with libvips v8.9.1 available now, can we get a sharp v0.24.1 containing the updated dependency? i'd love to see these segfaults gone ;) |
@oezi sharp will always look for and use a globally-installed libvips in preference to its prebuilt copy to allow you to use a newer libvips with an older sharp. |
If you could also release 8.9.1 in https://github.com/lovell/sharp-libvips/releases I can test it. Or another easy way to temporarily get a contained build (on linux, ubuntu) to test drive. I'm fine with inline replacing the cached 8.9.0 file with a new version as a workaround. Not sure what your release schedule is/was here? :) |
@lovell can you give us any update on this, please? |
It seems I can no longer trigger the problem on this branch. 30 or 40 runs on high system load (aka a Zoom call, lol) without signs of trouble. Thank you! |
Can confirm, no more segfaults with the |
This is great news, thank you very much for testing! |
sharp v0.25.0 with libvips v8.9.1 is now available; enjoy! |
Awesome. Thanks for the effort on your end(s)! :D We expect to upgrade soon and ship this as well. |
We've now bumped Sharp in Gatsby, affecting the following versions. Thanks again!
|
I'm trying to debug a segfault while using Gatsby that is potentially in the realm of using sharp.
(I'm a performance eng for Gatsby)
It seems that sites with lots of images can trigger a segfault and we're not sure when exactly this happens.
I get this fairly frequently with a simple site that just has many images (~1500 for my machine) on the example site posted (in http://github.com/gatsbyjs/gatsby/pull/6291) as a repro;
Clone https://github.com/jp887/gatsby-issue6291-repro
Update package.json to set all gatsby deps to *
Clear the npm cache (
~/.npm
)Clear node_modules, in case that's relevant
npm install
yarn gen-images 1500
(It triggers with 1500 images for me, this will try to process 9000 images in total)gatsby build
For me, often it segfaults roughly between 25% and 75%.
I'm on ubuntu (xfce) 19.10
Node v8.16.2 (through nvm)
I went through some of the older Gatsby issues relating to this bug. I am able to reproduce it with
.simd(true)
and.simd(false)
, and same for.cache(true)
and.cache(false)
.Reducing the available memory to node does not seem to make a difference (I set it to 500mb, half of the default, but saw no increase in crash rate). Was also able to repro it running single core.
Please let me know how I can help :)
The text was updated successfully, but these errors were encountered: