Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gap: 4.11.1 -> 4.12.1 #192548

Merged
merged 2 commits into from
Nov 4, 2022
Merged

gap: 4.11.1 -> 4.12.1 #192548

merged 2 commits into from
Nov 4, 2022

Conversation

collares
Copy link
Member

@collares collares commented Sep 23, 2022

Description of changes

aarch64 deadlocks on Sage tests no longer happen after this update and a few minor Sage tweaks (which will be done in a separate PR). Hooray!

This applies an unreviewed patch from https://trac.sagemath.org/ticket/34391 (https://git.sagemath.org/sage.git/patch?id2=9.8.beta2&id=eb8cd42feb58963adba67599bf6e311e03424328), but the test changes all look reasonable.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.11 Release Notes (or backporting 22.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@collares
Copy link
Member Author

collares commented Sep 23, 2022

@ofborg build sage

@collares
Copy link
Member Author

The deadlocks seem to have been replaced by crashes! This is very good news. I'll investigate in two weeks, hopefully gap-system/gap#5063 (which also affects x86_64) will be addressed by then too.

@ChrisJefferson
Copy link
Contributor

I'd be interested to know what these crashes are, is there a way to see them? -- people are successfully running GAP 4.12 on aarch64 as far as I know.

@collares
Copy link
Member Author

collares commented Sep 23, 2022

@ChrisJefferson The crashes happen during Sage doctesting. The build/test log is at https://logs.nix.ci/logfile/nixos/nixpkgs.192548/d56e9c1c-e4c5-4397-b15d-4135431c6c21, and you can grep for capture_stdout to see the backtraces. Note that this is with https://trac.sagemath.org/ticket/34391 applied, which changed how CloseOutput is called.

@collares
Copy link
Member Author

collares commented Sep 23, 2022

@ChrisJefferson I changed the Sage function slightly to avoid a possible GC problem on the Sage side. The capture_stdout function on the Sage side is now this:

cdef str capture_stdout(Obj func, Obj obj):
    cdef Obj s, stream, output_text_string
    cdef UInt res
    cdef TypOutputFile output

    try:
        GAP_Enter()
        s = NEW_STRING(0)
        output_text_string = GAP_ValueGlobalVariable("OutputTextString")
        stream = CALL_2ARGS(output_text_string, s, GAP_True)

        if not OpenOutputStream(&output, stream):
            raise GAPError("failed to open output capture stream for "
                           "representing GAP object")

        CALL_1ARGS(func, obj)
        CloseOutput(&output)
        return char_to_str(CSTR_STRING(s))
    finally:
        GAP_Leave()

Unfortunately, there are still some segfaults with backtraces which look like the following:

#0  0x0000fffff79740e8 in wait4 ()
#1  0x0000fffff5dc6b78 in print_enhanced_backtrace ()
#2  0x0000fffff5dc8190 in sigdie ()
#3  0x0000fffff5dcb1c0 in cysigs_signal_handler ()
#4  0x0000fffff7ffb7cc in __kernel_rt_sigreturn ()
#5  0x0000ffff99a0bf28 in ConvString ()
#6  0x0000000000000000 in ?? ()
#7  0x0000000000000000 in ?? ()
#8  0x0000000000000000 in ?? ()
#9  0x0000ffff99989930 in Pr ()
#10 0x0000ffff9998aa18 in CloseOutput ()
#11 0x0000ffff99884828 in capture_stdout () at /build/sage-src-9.7/pkgs/sagemath-standard/sage/libs/gap/element.pyx:154

For convenience, the Pr call, which is supposed to flush the buffer before closing, is made at https://github.com/gap-system/gap/blob/v4.12.0/src/io.c#L938, I think the ConvString call is made at https://github.com/gap-system/gap/blob/v4.12.0/src/io.c#L1126, although there are missing frames, and ConvString itself is at https://github.com/gap-system/gap/blob/v4.12.0/src/stringobj.c#L1249.

One funny thing I see over and over with our aarch64 builders is that they seem to hit a lot more GC-related problems then the x86_64 ones, even when the bug is not aarch64-specific. Is ConvString GC-unsafe somehow, is CloseOutput being called wrongly, or is the problem elsewhere?

@collares
Copy link
Member Author

collares commented Sep 23, 2022

I should also point out that there are only two callsites for the function capture_stdout function I mentioned above:

cdef str gap_element_repr(Obj obj):
    cdef Obj func = GAP_ValueGlobalVariable("ViewObj")
    return capture_stdout(func, obj)


cdef str gap_element_str(Obj obj):
    cdef Obj func = GAP_ValueGlobalVariable("Print")
    return capture_stdout(func, obj)

I wonder if calling GAP_ValueGlobalVariable without doing the GAP_Enter/GAP_Leave dance can cause GC problems.

Edit: Although it's hard to be 100% sure due to the non-deterministic nature of GC bugs, looks like that's really the problem! ofBorg ran the tests twice without crashing, while previously it crashed every time.

Copy link

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Some feedback, in case you are interested :-)

pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
@fingolfin
Copy link

I just saw the discussion on AArch64 crashes above. Some remarks:

  • "Is ConvString GC-unsafe somehow" -- depends on what you mean? It can allocate which can trigger a GC, which can cause problems; but that holds for many GAP kernel functions. It never should raise an error and is pretty safe in this regard at least
  • "is CloseOutput being called wrongly" -- it looks right to me. but a warning: if the CALL_1ARGS should raise a GAP error, then you'll miss the CloseOutput invocation, which could cause problems
  • "I wonder if calling GAP_ValueGlobalVariable without doing the GAP_Enter/GAP_Leave dance can cause GC problems" -- it is no worse than ConvString (actually, it is likely to cause an allocation -- indeed only if the string passed to it has not yet been used as name of global variable. But you call it on "OutputTextString" so it won't ever trigger a GC.

However, what could in theory happen is that GAP starts a garbage collection, but does not "see" your references to objects. You have these:

    cdef Obj s, stream, output_text_string
    cdef UInt res
    cdef TypOutputFile output

But I am not sure if the GC "sees" all of them this way in your code... nominally GAP_Enter exists to ensure that. Hence I am surprised it works "better" without it ?!?. I know very little about Cython, though, so I can't judge at all how this happens. Probably this is something to discuss with @embray ?

@collares
Copy link
Member Author

collares commented Oct 7, 2022

Thanks for the aarch64 comments! I should have written another comment to clarify the current status instead of just an edit. Currently sage tests pass on aarch64, but I needed to patch Sage like so: https://github.com/NixOS/nixpkgs/blob/3be405a031dcc259de0d53ff2adbb50c6247f3a6/pkgs/applications/science/math/sage/patches/copy-string-before-gap-leave.patch. I was a bit busy the last few days but I will upstream this patch to Sage soon. In particular, this changes two things:

  • Sage was previously getting the C-style string backing GAP's text stream (which held the output of ViewObj or Print) and holding onto it past the GAP_Leave call. I fixed that by copying it to a Python string inside a GAP_Enter/GAP_Leave scope, otherwise I saw crashes.
  • The above was not sufficient to fix all crashes. I also needed to move the GAP_ValueGlobalVariable("ViewObj") and GAP_ValueGlobalVariable("Print") calls inside the GAP_Enter/GAP_Leave block, otherwise I'd also see crashes.

I think I expressed myself badly before, but to clarify: You're absolutely right that narrowing the GAP_Enter/GAP_Leave blocks makes things worse. I had to make sure relevant things were "protected" by those blocks to avoid crashes.


echo "Copying files to target directory"
cp -ar . "$out/share/gap/build-dir"
# make install creates an empty pkg dir. make check requires a tst dir.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you install the tst files just so that you can run the test suite? Isn't that overkill? Or do you consider it desirable to install test suites? What others packages do that (e.g. for git itself, I don't think I've ever seen its test suite being installed alongside git itself).

Copy link
Member Author

@collares collares Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Nix packages often include wrappers to compensate for the non-FHS directory structure used by Nix, and we create a wrapper analogous to gap.sh (in bin/gap) to make sure -l gets passed in and GAP can find its files (to illustrate the non-FHS thing, the current wrapper for GAP 4.11 is located at /nix/store/n9gvcx1v1lwwcax8m8p6qzwmnk1fb67m-gap-4.11.1/bin/gap on my machine). Nix's GAP package insists on running tests with the installed version of GAP instead of the build dir version, I can only assume this caught a wrapper issue in the past.

I definitely think this is worth revisiting now that make install exists, but I'd like feedback from @timokau, who I believe was the maintainer who designed the current test structure in #53037, before changing this to make sure I'm not missing some key fact.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that instead of a wrapper script you could also add -DSYS_DEFAULT_PATHS="prefix" to CFLAGS (Fedora does it this way).

(See also gap-system/gap#5096 for a plan to do this "automatically " in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not remember why I decided to do things this way. If there is a way to run the tests with the installed version of gap without installing the tests, I agree that that would be better.

Copy link

@fingolfin fingolfin Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to let the installed GAP run the non-installed test suite (and I agree this is sensible and desirable).

one way would be to invoke it like this (untested; replace $srcdir suitably):

gap --quitonbreak -c "TestDirectory( \"$srcdir/tst/testinstall\", rec(exitGAP := true) );"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will try this! It will be a nice improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally had a chance to test this myself and unfortunately it doesn't work that way, because some tests want to access additional files in the tst directory and for this assume it is available in a GAP root... but you could still do this then:

$prefix/bin/gap --quitonbreak -l ";$srcdir" $srcdir/tst/testinstall.g

The -l ";$srcdir" option appends $srcdir to the list of GAP root paths, so that code trying things like ReadGapRoot("tst/testinstall/MatrixObj/testmatobj.g") still works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing and sorry for not doing it earlier. I appreciate the suggestion but I worry that -l ";$srcdir" might be equivalent to testing the build-dir version of GAP, in the sense that a GAP test run might succeed by finding things in $srcdir (other than tst/) that it wouldn't on a normal test run.

I also want to keep the package as lean as possible, but the default GAP 4.12.1 install takes 759MB, and just 6MB of those come from the tst directory. The current version of the package is heavier because it ships the full build dir (802MB), including the tst subdirectory. Therefore, I am inclined to postpone this improvement to have more time to experiment, leaving a TODO pointing to this discussion, mostly because I will still be pretty busy until next week. If further experimentation leads to the conclusion that testing the build-dir version of GAP is fine (which is likely!), I can just move the testrun from installCheckPhase to checkPhase, not ship the tst directory and avoid the problem altogether.

The -DSYS_DEFAULT_PATHS tip is interesting and I appreciate the information! Wrappers are very common in Nix land, though, and I want to investigate if this would break a content-addressed build of GAP. Therefore, I probably will investigate both improvements at once in the near future.

I hope postponing these improvements doesn't give the wrong impression. I really appreciate the feedback and will act on the feedback in the near future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's not urgent.

BTW regarding space, you could gzip some data files esp. in large packages, too reduce this a bit. E.g. for the termux package I suggested something like this to them:

	# To get them to be small enough, could gzip them in place
 	# (GAP transparently allows read access to those)
 	gzip -n pkg/primgrp/data/*.g
 	gzip -n pkg/transgrp/data/*.grp
 	gzip -n -f pkg/smallgrp/id*/*.*
 	gzip -n -f pkg/smallgrp/small*/*.*

This was only for the required packages primgrp, transgrp, smallgrp; it could be applied to more (e.g. irredsol/data/*.grp, ctbllib/data/*.tbl, sglppow/lib/3hoch8/rank*). I will also look into modifying at least those three packages to ship with pre-compressed files. (I've also used -9 / --best in the past but mostly the savings don't seem to justify the slow performance. YMMV)

@fingolfin
Copy link

FYI, I just released GAP 4.12.1

pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/sage/sagelib.nix Outdated Show resolved Hide resolved
pkgs/applications/science/math/gap/default.nix Outdated Show resolved Hide resolved
@collares
Copy link
Member Author

collares commented Oct 28, 2022

Thank you! Sorry, I was busy last week. Will push the changes now.

@collares collares marked this pull request as ready for review October 28, 2022 21:33
@collares
Copy link
Member Author

collares commented Oct 28, 2022

I've dropped the extra aarch64 patch needed for Sage tests to pass. I don't think they need to block the update, since Sage tests were already broken on aarch64 before. The Sage fixes have been submitted upstream (https://trac.sagemath.org/ticket/34701) and are now part of #198355, which depends on the present PR. The GAP update itself should now be in a good state thanks to invaluable help from @fingolfin, so I'll put this up for review from the other maintainers.

@collares collares requested review from timokau, omasanori and 7c6f434c and removed request for omasanori, 7c6f434c and timokau October 28, 2022 21:36
@collares collares changed the title gap: 4.11.1 -> 4.12.0 gap: 4.11.1 -> 4.12.1 Oct 28, 2022
@7c6f434c 7c6f434c merged commit 58f6230 into NixOS:master Nov 4, 2022
@collares collares deleted the gap-4.12 branch January 26, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants