Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Windows tests in the master branch #2890

Closed
olexandr-konovalov opened this issue Oct 1, 2018 · 11 comments
Closed

Failing Windows tests in the master branch #2890

olexandr-konovalov opened this issue Oct 1, 2018 · 11 comments
Labels
kind: bug Issues describing general bugs, and PRs fixing them os: windows Issues and PRs that are (at least partially) specific to Windows

Comments

@olexandr-konovalov
Copy link
Member

olexandr-konovalov commented Oct 1, 2018

The Jenkins job to test release candidate for the next major release on a cygwin-free windows machine (only accessible from St Andrews here) last time passed on September 16th 2018, and fails today.

testinstall.g passes without and with default packages. However, teststandard has problems:

  1. this is a test file added in ENHANCE: Improvements to double coset calculations for hard cases #2741 - I can confirm that the same problem occurs on other systems too, it's now Windows specific (and that's why the number of remaining diffs in Test failures due to method re-ordering #2818 in teststandard is bigger by 4):
########> Diff in /proc/cygdrive/C/gap-4.11.0/tst/testextra/doublecoset.tst:12
# Input is:
m:=MaximalSubgroupClassReps(g);;
# Expected output:
# But found:
Error, reached the pre-set memory limit
(change it with the -o command line option)
########
########> Diff in /proc/cygdrive/C/gap-4.11.0/tst/testextra/doublecoset.tst:13
# Input is:
u:=First(m,x->Index(g,x)=17931375);;
# Expected output:
# But found:
Error, no method found! For debugging hints type ?Recovery from NoMethodFound
Error, no 1st choice method found for `FirstOp' on 2 arguments
########
########> Diff in /proc/cygdrive/C/gap-4.11.0/tst/testextra/doublecoset.tst:14
# Input is:
dc:=DoubleCosetRepsAndSizes(g,u,u);;
# Expected output:
# But found:
Error, no method found! For debugging hints type ?Recovery from NoMethodFound
Error, no 1st choice method found for `DoubleCosetRepsAndSizes' on 3 arguments
########
########> Diff in /proc/cygdrive/C/gap-4.11.0/tst/testextra/doublecoset.tst:15
# Input is:
Length(dc);Sum(dc,x->x[2])=Size(g);
# Expected output:
913
true

# But found:
Error, Variable: 'dc' must have a value
Error, Variable: 'dc' must have a value
########

If this tess can not be run with our default memory settings, I can move it e.g. to benchmarks to run weekly there in Jenkins.

  1. However, the next two diffs are Windows specific and I suspect that they are caused by Add C++ support to kernel; use it to reduce code duplication in permutat.c, objfgelm.c and others #2667:
testing: /proc/cygdrive/C/gap-4.11.0/tst/teststandard/processes/children.tst
   4837 [main] gap 3932 child_info::sync: wait failed, pid 3316, Win32 error 5
 321123 [main] gap 3932 fork: child -1 - forked process 3316 died unexpectedly, retry 0, exit code 0x24, errno 11
     10 [main] gap 3376 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
5328282 [main] gap 3376 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
21601391 [main] gap 3932 fork: child -1 - forked process 3376 died unexpectedly, retry 0, exit code 0x100, errno 11
     68 [main] gap 4136 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
2127988 [main] gap 4136 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
28052096 [main] gap 3932 fork: child -1 - forked process 4136 died unexpectedly, retry 0, exit code 0x100, errno 11
     60 [main] gap 3968 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1791374 [main] gap 3968 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
32379154 [main] gap 3932 fork: child -1 - forked process 3968 died unexpectedly, retry 0, exit code 0x100, errno 11
      8 [main] gap 5468 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1795553 [main] gap 5468 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
36775397 [main] gap 3932 fork: child -1 - forked process 5468 died unexpectedly, retry 0, exit code 0x100, errno 11
      8 [main] gap 2788 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1779028 [main] gap 2788 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
41125386 [main] gap 3932 fork: child -1 - forked process 2788 died unexpectedly, retry 0, exit code 0x100, errno 11
      7 [main] gap 3092 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1801796 [main] gap 3092 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
45572687 [main] gap 3932 fork: child -1 - forked process 3092 died unexpectedly, retry 0, exit code 0x100, errno 11
      8 [main] gap 4176 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1799540 [main] gap 4176 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
49847121 [main] gap 3932 fork: child -1 - forked process 4176 died unexpectedly, retry 0, exit code 0x100, errno 11
      7 [main] gap 1952 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
7270367 [main] gap 1952 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
59698525 [main] gap 3932 fork: child -1 - forked process 1952 died unexpectedly, retry 0, exit code 0x100, errno 11
      8 [main] gap 4316 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
17604072 [main] gap 4316 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
79808138 [main] gap 3932 fork: child -1 - forked process 4316 died unexpectedly, retry 0, exit code 0x100, errno 11
      7 [main] gap 3820 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1836918 [main] gap 3820 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
84182159 [main] gap 3932 fork: child -1 - forked process 3820 died unexpectedly, retry 0, exit code 0x100, errno 11
      7 [main] gap 6020 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1822207 [main] gap 6020 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
88512524 [main] gap 3932 fork: child -1 - forked process 6020 died unexpectedly, retry 0, exit code 0x100, errno 11
      7 [main] gap 6104 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1790691 [main] gap 6104 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
92874277 [main] gap 3932 fork: child -1 - forked process 6104 died unexpectedly, retry 0, exit code 0x100, errno 11
      8 [main] gap 4180 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
1807997 [main] gap 4180 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
97255873 [main] gap 3932 fork: child -1 - forked process 4180 died unexpectedly, retry 0, exit code 0x100, errno 11
      8 [main] gap 5968 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - MEM_COMMIT failed, Win32 error 1455
2138651 [main] gap 5968 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
101974554 [main] gap 3932 fork: child -1 - forked process 5968 died unexpectedly, retry 0, exit code 0x100, errno 11
########> Diff in /proc/cygdrive/C/gap-4.11.0/tst/teststandard/processes/child\
ren.tst:16
# Input is:
for i in [1..reps] do
children := List([1..20], x -> runChild(Random([1..2000]), Random([false,true]\
)));;
if ForAny(children, x -> x=fail) then Print("Failed producing child\n"); fi;
Perform(children, CloseStream);
od;
# Expected output:
# But found:
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Panic: cannot fork to subprocess (errno 11).
Failed producing child
Error, no method found! For debugging hints type ?Recovery from NoMethodFound
Error, no 1st choice method found for `CloseStream' on 1 arguments
The 1st argument is 'fail' which might point to an earlier problem

and

testing: /proc/cygdrive/C/gap-4.11.0/tst/testextra/small_groups2.tst
      6 [main] gap 3316 C:\gap-4.11.0\.libs\gap.exe: *** fatal error in forked process - WFSO timed out after longjmp
6301183 [main] gap 3316 cygwin_exception::open_stackdumpfile: Dumping stack trace to gap.exe.stackdump
   46281 ms (2015 ms GC) and 2.91GB allocated for small_groups2.tst
@olexandr-konovalov
Copy link
Member Author

Also, note that when I tested #2667, on Windows, the directory for the binaries had name bin/i686-pc-cygwin-default32-kv5, but in the master branch the name of that directory is now bin/i686-pc-cygwin-default32-kv5. I had to adjust Windows Jenkins script by adding -kv5 to ensure that GAP builds on Windows.

@ChrisJefferson
Copy link
Contributor

These fork related errors are always a possibility whenever you copy cygwin to a different machine, and why officially cygwin doesn't support us doing it. The problem is more likely to trigger as you add more DLLs, so I guess we are just getting unlucky (there are probably other machines it will work on, it's a subtle bug).

@olexandr-konovalov
Copy link
Member Author

Thanks for explanation @ChrisJefferson and also for telling this morning where -kv5 comes from.

For -kvN in the path, we need to think on further automation, e.g. restore make cygwin which we had at some point before.

For fork related errors, maybe time to upgrade Cygwin installation we use to build GAP...

@hulpke
Copy link
Contributor

hulpke commented Oct 2, 2018

@alex-konovalov
This is a larger example that I added at your request. By all means move it away.

On the other hand, why is there a fundamental reason for the 1GB memory limit? I've now hit it a couple of times with more elaborate examples.

@olexandr-konovalov
Copy link
Member Author

@hulpke yes, I remember that I've asked for it. Still good to have it around. A quick fix would be to move it to benchmarks, a more elaborate would be to split make teststandard into make teststandard and make testextra to match directories, and allow more memory in the latter test.

1GB limit helps to catch memory regressions, and there were cases in the past when its presence allowed to detect some issues.

@fingolfin
Copy link
Member

But using 1GB is a rather crude way to catch memory regressions. Much better would be to catch the output of TestDirectory, which tells us for each test how much memory it consumed, and then graph that.

@olexandr-konovalov
Copy link
Member Author

olexandr-konovalov commented Oct 2, 2018

That output has a different meaning though:

testing: /circa/scratch/gap-jenkins/workspace/GAP-master-test/GAPCOPTS/64build\
/GAPTARGET/install/label/kovacs/GAP-master-snapshot/tst/testinstall/random.tst
    7729 ms (677 ms GC) and 1.41GB allocated for random.tst

or

total    365782 ms (96287 ms GC) and 18.5GB allocated

show how much was in total, which is larger than 1GB, but at any given time the amount of memory used by GAP was smaller than 1GB.

@fingolfin fingolfin added the kind: bug Issues describing general bugs, and PRs fixing them label Mar 21, 2019
@embray
Copy link
Contributor

embray commented Aug 21, 2019

The problem is more likely to trigger as you add more DLLs

For reducing the chances of these kinds of fork problems it is good to rebase all executables and DLLs after building. For Sage I use this script (though easily adaptable): https://gitlab.com/sagemath/sage/blob/master/src/bin/sage-rebase.sh This finds all executables and DLLs in Sage and rebases them together. It is run automatically after each build.

What "rebase" means in this case is that it ensures each DLL is given its own non-overlapping range in the process's address space. This way, when a process is "forked", the new process will load the same DLLs to the same addresses. Otherwise, they are relocated at runtime by the NT kernel, and there is no way Cygwin can guarantee that they will be loaded again at the same address after a "fork".

@embray
Copy link
Contributor

embray commented Aug 21, 2019

(This is a large part of why porting Sage in particular to 32-bit Cygwin is so difficult: You literally just run out of address spaces for all DLLs in Sage to occupy non-overlapping addresses, even though in typical usage you won't load every DLL).

@fingolfin
Copy link
Member

Is this issue still relevant?

BTW note that on GAP master, I added a year ago code to use posix_spawn instead of fork when available (see 6e34384), which should perhaps help with Cygwin.

@fingolfin fingolfin added the status: awaiting response Issues and PRs whose progress is stalled awaiting a response from (usually) the author label Oct 21, 2020
@olexandr-konovalov
Copy link
Member Author

Fine to close, thanks.

@no-response no-response bot removed the status: awaiting response Issues and PRs whose progress is stalled awaiting a response from (usually) the author label Oct 24, 2020
@no-response no-response bot reopened this Oct 24, 2020
@olexandr-konovalov olexandr-konovalov added status: awaiting response Issues and PRs whose progress is stalled awaiting a response from (usually) the author and removed status: awaiting response Issues and PRs whose progress is stalled awaiting a response from (usually) the author labels Oct 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Issues describing general bugs, and PRs fixing them os: windows Issues and PRs that are (at least partially) specific to Windows
Projects
None yet
Development

No branches or pull requests

5 participants