Rewrite cucumber test caching (and support logic) #2795

TheMarex · 2016-08-22T08:36:21Z

Issue

This PR implements #2745. As a result a lot of the support logic has to be reworked. Logs are now handled differently, they are saved by scenario in test/logs/{feature_path}/{scenario}.log.

/cc @MoKob @daniel-j-h @danpat @lbud

Tasklist

make all tests pass
rebase + squash
review
adjust for for comments

Requirements / Relations

None.

MoKob · 2016-08-22T15:42:39Z

cucumber.js

-    default: '--require features --tags ~@stress --tags ~@todo',
-    verify: '--require features --tags ~@todo --tags ~@bug --tags ~@stress -f progress',
-    jenkins: '--require features --tags ~@todo --tags ~@bug --tags ~@stress --tags ~@options -f progress',
-    bugs: '--require features --tags @bug',


are we completely removing the bugs tag, or is this an accident to leave this out?

ok, looks like all bugs are removed below.

ah this was just a small cleanup. @todo and @bug seem to have served the same function, so it is probably easier to have just one.

There are actually quite a few test cases marked as todo that seem weird, we should maybe clean up at some point. (@stress also seems to be dysfunctional)

Fully agree on the clean-up.
I tried to use todo as missing feature and bug as broken feature. I don't think that this separation serves much of a purpose, though. I am missing a complete overview if there can be important distinctions in the code-base but I agree that it looks like a good candidate to clean-up.

TheMarex · 2016-08-22T22:20:24Z

Some tests are currently failing because osrm-routed just exists with error code 0 without any errors, in the middle of a scenario. Each test passes individually, just running them like cucumber -p verify triggers it.

@lbud does this sound familiar to any bugs you hit back in the day?

EDIT: Fixed this. maxBuffer for exec was too small.

TheMarex · 2016-08-30T18:25:44Z

Tests pass for me locally but there seem to be problems on travis.

/cc @springmeyer

springmeyer · 2016-08-30T22:08:36Z

I've been poking at this branch. All tests are passing locally, but on travis there are still 60 failures:

712 scenarios (60 failed, 652 passed)
3299 steps (60 failed, 1 skipped, 3238 passed)

And what looks like a bug in our JS code whereby we try to write to the log file after finishScenarioLogs has been called and the logStream has been nulled out.

/home/travis/build/Project-OSRM/osrm-backend/node_modules/cucumber/lib/cucumber/runtime/event_broadcaster.js:30
            throw error;
            ^
TypeError: node_modules/cucumber/lib/cucumber/support_code/library.js:15 node_modules/cucumber/lib/cucumber/support_code/library.js:15 Cannot read property 'write' of null
    at Object.writeToStream (/home/travis/build/Project-OSRM/osrm-backend/features/support/log.js:44:39)
    at start (/home/travis/build/Project-OSRM/osrm-backend/node_modules/d3-queue/build/d3-queue.js:68:13)
    at poke (/home/travis/build/Project-OSRM/osrm-backend/node_modules/d3-queue/build/d3-queue.js:56:26)
    at Queue.queue.defer (/home/travis/build/Project-OSRM/osrm-backend/node_modules/d3-queue/build/d3-queue.js:34:7)
    at Object.finishLogs.setupScenarioLogFile.callback.log.callback [as log] (/home/travis/build/Project-OSRM/osrm-backend/features/support/log.js:49:23)
    at emitOne (events.js:82:20)
    at Socket.emit (events.js:169:7)
    at readableAddChunk (_stream_readable.js:153:18)
    at Socket.Readable.push (_stream_readable.js:111:10)
    at Pipe.onread (net.js:536:20)

springmeyer · 2016-08-31T00:57:42Z

I was able to replicate the above errors on linux in a docker image of Ubuntu Xenial. It looks like the logStream problem is independent of the cucumber test failures. The test failures can be replicated with ./node_modules/.bin/cucumber.js features/testbot/load.feature: https://gist.github.com/springmeyer/97f48cf7f64e9710e46b0b58655ec279. I notice that osrm-routed is left running.

I'm unsure the cause of these failures. But I wonder if the move from child_process.exec to child_process.spawn has caused the issue.

oxidase · 2016-08-31T04:34:09Z

@springmeyer it is nodejs/node#2098

exec https://github.com/Project-OSRM/osrm-backend/pull/2795/files#diff-80cc062f7a9c64a60e9ea8310db0675aR31 on linux runs command via sh

/bin/sh -c "/home/miha/foss/osrm-backend/build/osrm-routed /home/miha/foss/osrm-backend/test/cache/testbot/load.feature/dd1cffbaa9edee02f6605c9dd9f008c0/7dfbf4b6f2687a0dd069d80e893bb71b/23_load_data_directly_st.osrm -p 5000"

child.pid is a PID of /bin/sh
https://github.com/Project-OSRM/osrm-backend/pull/2795/files#diff-122ec2288966c104a49c369d3146f016R39 kills sh process and makes osrm-routed process orphaned
still running osrm-routed messes up following tests

It worked before on linux because osrm-routed was spawn-ed directly without sh.
I don't see any easy solution, but changing back to use spawn for osrm-routed. If it is ok i can update the branch.

TheMarex · 2016-08-31T07:49:16Z

I would totally be in favor of using spawn. Reasons why we use exec at the moment:

Passing environment variables over the shell (setting options.env does not seem to work on some platforms @springmeyer ?)
We need to capture a copy of the full stdout and stderr for some test cases (features/step_definitions/options.js)

Everything is now abstracted using runBin so it should be easy to switch, once the above problems are solved.

springmeyer · 2016-08-31T16:59:46Z

Passing environment variables over the shell (setting options.env does not seem to work on some platforms @springmeyer ?)

tldr; I also think spawn should be fine for us now.

@TheMarex you may be remembering the problem I hit with DYLD_LIBRARY_PATH on osx: it was that DYLD_LIBRARY_PATH is stripped by OS X when passed in options.env. You can workaround this with exec because it supports prepended environment variables like FOO=bar ./command whereas with spawn the only way to pass environment variables is by passing them to options.env. Anyway, that limitation of spawn does not apply to us anymore because my original need for prepending DYLD_LIBRARY_PATH was avoided (using install_name_tool) (however this is still only in WIP branches).

springmeyer · 2016-08-31T17:23:20Z

Thanks everyone for the help and comments. My next action: convert runBin to using child_process.spawn. This will require moving to a 'cmd',['arg','arg'] format. Will first ensure all tests are still passing on OS X, then will see if tests are passing on travis/linux. If travis/linux is still failing then I'll take a look at killing/tracking the child using the pid like we did before: however hopefully this will be unneeded.

oxidase · 2016-08-31T23:03:29Z

still one issue with the log stream that must be deleted in a callback at https://github.com/Project-OSRM/osrm-backend/pull/2795/files#diff-122ec2288966c104a49c369d3146f016R38 but not with a race condition before child exit

springmeyer · 2016-09-01T01:50:12Z

Thank you @oxidase for the great work on using execFile and for more flexible passing of options. I fixed one issue with the STXXLCFG value being passed through (that was leading to test failures because of slower running (or buggy?) stxxl disk fallback). And I found a place to shutdown osrm-routed to avoid the race condition. Looks like travis is now going green 🎉. Unless anyone has concerns I think this is ready to merge.

springmeyer · 2016-09-01T18:44:33Z

Testing locally and noticing what the # of files inside test/cache stays constant now, as we want, between runs of npm test:

$ find test/cache/ -type f | wc -l
16316

But after a run there are 128 files that are newly modified. This a minor problem but still seems unexpected to me: https://gist.github.com/springmeyer/c9ae99dd3f6ead94c1070365893c178e. @TheMarex can you think of a reason why the files in the cache would be overridden even if nothing changed in the osrm-backend code?

springmeyer · 2016-09-01T20:40:36Z

But after a run there are 128 files that are newly modified.

Looking like a race condition. I've found that when I run time ./node_modules/.bin/cucumber.js features/testbot/traffic_turn_penalties.feature sometimes osrm-contract and osrm-extract are run and sometimes not).

springmeyer · 2016-09-01T21:09:52Z

Found and fixed the problem in c1b504d. I noticed that 1 times in 10 the hash returned from

osrm-backend/features/lib/hash.js

Lines 8 to 21 in c1b504d

    
           hashOfFiles: (paths, cb) => { 
        
               let queue = d3.queue(); 
        
               for (let i = 0; i < paths.length; ++i) { 
        
                   queue.defer(fs.readFile, paths[i]); 
        
               } 
        
               queue.awaitAll((err, results) => { 
        
                   if (err) return cb(err); 
        
                   let checksum = crypto.createHash('md5'); 
        
                   for (let i = 0; i < results.length; ++i) { 
        
                       checksum.update(results[i]); 
        
                   } 
        
                   cb(null, checksum.digest('hex')); 
        
               }); 
        
           },

was different. This was happening because sometimes the order of the files passed in was different. Serializing the queue so that the files were read in a predictable order and therefore consistently sorted before hashing should solve this.

oxidase · 2016-09-05T06:34:52Z

@springmeyer looks good to me for rebase and merge. The last timeout issue was due to slow md5 sum computation with a cold FS cache (locally for me ~10 seconds).

oxidase · 2016-09-05T06:55:51Z

Just found one issue - https://github.com/Project-OSRM/osrm-backend/blob/tests/caching/features/support/hooks.js#L53 always makes osrm-routed shutdown. I will revert it and add osrm-routed logging output to a new stream.

oxidase · 2016-09-14T16:53:51Z

@TheMarex there is still a problem with a data race if timeout occurs during osrm-extract or osrm-contract. 1b4f087 is a partial "solution", because if timeout occurs during osrm-extract, the scheduled osrm-contract will be started with parameters for the next scenario, but timestamp will be created for the original one. I see two possibilities:

make all task parameters local at defer calls. It will ignore timeouts during extraction and contraction (this is similar to the previous behavior), so it will be possible to run multiple extract/contract processes simultaneously.
setup scenario queue in Before, make extract/contract tasks abortable, and abort execution of queued tasks if a timeout occurs.

springmeyer · 2016-09-16T17:37:15Z

@oxidase thanks for your further work on this. I propose we merge this and ticket the remaining issue you are seeing around behavior when a timeout occurs. Since, if I am understanding right, 1) timeouts should be rare, 2) you have ideas for solving handling them better, and 3) we have other PRs that would benefit from this being merged first (notably #2834). Does that sound right? Do you agree that merging this and handling timeout handling in a separate PR would be acceptable?

Data race occurs in this.processedCacheFile after timeouts in runBin callbacks

karenzshea · 2016-09-16T18:54:07Z

I rebased this onto master. Conflicts were just around retagging @ a few cucumber tests.

update: argh messed this up the first time.

oxidase · 2016-09-16T22:52:43Z

@springmeyer i have made the first option, because making a global queue with abortable jobs would require too much time. At the moment if preprocessing starts it will finish even if a timeout occurs.Most important is that stamp files correspond to the original scenario. The last change makes consistent only extraction and contraction. If timeout occurs during writeOSM or linkOSM then possible error will be

    Step Definition: features/step_definitions/routing.js:4
    Message:
      Error: EEXIST: file already exists, link '/home/miha/foss/osrm-backend/test/cache/guidance/end-of-road.feature/bdbf19af33facfd685cb902e571cb78e/8_end_of_road_with_through_street.osm' -> '/home/miha/foss/osrm-backend/test/cache/guidance/end-of-road.feature/bdbf19af33facfd685cb902e571cb78e/7d282afd3cf004904a81c1bb22d409e6/8_end_of_road_with_through_street.osm'

To fix it the corrupted OSM files must be removed, but for 5 seconds timeout it is very unlikely.

TL;DR: LGTM for merge.

oxidase

👍

springmeyer · 2016-09-19T23:41:29Z

Rebased and squashed and landed in 7d124ce. This landed a bit sooner than I intended - meant to consult with @MoKob beforehand. But unless there are concerns I'll keep it in master and it should have no harm on the code and just improves caching on travis and predictability and speed of test runs.

#2745

MoKob · 2016-09-20T10:47:52Z

@springmeyer seems fine. Since it is not touching any features, we are in a good state. I just would love to get 5.4 fixed before introducing major features into master, since porting patches could get painful otherwise. Hopefully we will have this done soon.

This reverts commit 7d124ce.

TheMarex · 2016-10-02T11:50:34Z

Great work here everyone. 🙇

TheMarex force-pushed the tests/caching branch from e85f386 to ff097c8 Compare August 22, 2016 08:52

MoKob reviewed Aug 22, 2016
View reviewed changes

TheMarex mentioned this pull request Aug 31, 2016

Exit code test failure on OSX #2823

Closed

karenzshea assigned springmeyer Aug 31, 2016

karenzshea mentioned this pull request Sep 1, 2016

Allow passing options to lua profiles #2834

Closed

4 tasks

MoKob added Refactor Work In Progress labels Sep 2, 2016

MoKob changed the title ~~[WIP] Rewrite cucumber test caching (and support logic)~~ Rewrite cucumber test caching (and support logic) Sep 2, 2016

oxidase force-pushed the tests/caching branch from 9575bc5 to f95f1f2 Compare September 5, 2016 06:46

oxidase force-pushed the tests/caching branch from f95f1f2 to 6fbfdf1 Compare September 5, 2016 21:09

TheMarex added 2 commits September 16, 2016 14:49

Change sha1 to md5 in cucumber because its faster

26de439

Move cucumber code around

f4d6f45

Dane Springmeyer and others added 7 commits September 16, 2016 14:53

fix the passing of STXXLCFG value

8736575

shut down osrm-routed in After hook

7d3bd69

avoid potential double callback call

e017529

fix unstable hashing of files

dbd605d

Make log stream local to children processes

b595084

Fix stamp file names data race

9f4c880

Data race occurs in this.processedCacheFile after timeouts in runBin callbacks

Set 30 seconds timeout for MD5 hashing in BeforeFeatures

dd08ab2

karenzshea force-pushed the tests/caching branch from 6fbfdf1 to b2d45b7 Compare September 16, 2016 18:53

karenzshea force-pushed the tests/caching branch from b2d45b7 to 7fdb8f6 Compare September 16, 2016 19:18

oxidase added 3 commits September 17, 2016 00:15

Print exit reason with signals information

8b654a2

Avoid osrm-routed shutdown in After hook

6abc132

Make a local copy of parameters to avoid inconsistency at timeouts

f6e2ba0

oxidase force-pushed the tests/caching branch from fbbfbfa to f6e2ba0 Compare September 16, 2016 22:39

oxidase approved these changes Sep 17, 2016

View reviewed changes

karenzshea added Ready To Merge and removed Work In Progress labels Sep 19, 2016

MoKob added this to the 5.5.0 milestone Sep 19, 2016

springmeyer closed this in 7d124ce Sep 19, 2016

springmeyer reopened this Sep 19, 2016

springmeyer closed this Sep 19, 2016

MoKob pushed a commit that referenced this pull request Sep 20, 2016

Close GH-2795: Rewrite cucumber test caching (and support logic). Fixes

c1041e5

#2745

MoKob deleted the tests/caching branch September 20, 2016 10:48

daniel-j-h added a commit that referenced this pull request Sep 20, 2016

Revert #2795: Rewrite cucumber test caching, Tests Failing At Random

e1a5d8a

This reverts commit 7d124ce.

daniel-j-h mentioned this pull request Sep 20, 2016

Revert #2795: Rewrite cucumber test caching, Tests Failing At Random #2934

Closed

oxidase mentioned this pull request Sep 20, 2016

Fix "Error: write after end" and removed @bug tags #2936

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite cucumber test caching (and support logic) #2795

Rewrite cucumber test caching (and support logic) #2795

TheMarex commented Aug 22, 2016 •

edited by springmeyer

Loading

MoKob Aug 22, 2016

MoKob Aug 22, 2016

TheMarex Aug 22, 2016 •

edited

Loading

MoKob Aug 22, 2016 •

edited

Loading

TheMarex commented Aug 22, 2016 •

edited

Loading

TheMarex commented Aug 30, 2016

springmeyer commented Aug 30, 2016 •

edited

Loading

springmeyer commented Aug 31, 2016

oxidase commented Aug 31, 2016 •

edited

Loading

TheMarex commented Aug 31, 2016

springmeyer commented Aug 31, 2016

springmeyer commented Aug 31, 2016

oxidase commented Aug 31, 2016

springmeyer commented Sep 1, 2016

springmeyer commented Sep 1, 2016

springmeyer commented Sep 1, 2016

springmeyer commented Sep 1, 2016

oxidase commented Sep 5, 2016

oxidase commented Sep 5, 2016

oxidase commented Sep 14, 2016

springmeyer commented Sep 16, 2016

karenzshea commented Sep 16, 2016 •

edited

Loading

oxidase commented Sep 16, 2016

oxidase left a comment

springmeyer commented Sep 19, 2016

MoKob commented Sep 20, 2016

TheMarex commented Oct 2, 2016

Rewrite cucumber test caching (and support logic) #2795

Rewrite cucumber test caching (and support logic) #2795

Conversation

TheMarex commented Aug 22, 2016 • edited by springmeyer Loading

Issue

Tasklist

Requirements / Relations

MoKob Aug 22, 2016

Choose a reason for hiding this comment

MoKob Aug 22, 2016

Choose a reason for hiding this comment

TheMarex Aug 22, 2016 • edited Loading

Choose a reason for hiding this comment

MoKob Aug 22, 2016 • edited Loading

Choose a reason for hiding this comment

TheMarex commented Aug 22, 2016 • edited Loading

TheMarex commented Aug 30, 2016

springmeyer commented Aug 30, 2016 • edited Loading

springmeyer commented Aug 31, 2016

oxidase commented Aug 31, 2016 • edited Loading

TheMarex commented Aug 31, 2016

springmeyer commented Aug 31, 2016

springmeyer commented Aug 31, 2016

oxidase commented Aug 31, 2016

springmeyer commented Sep 1, 2016

springmeyer commented Sep 1, 2016

springmeyer commented Sep 1, 2016

springmeyer commented Sep 1, 2016

oxidase commented Sep 5, 2016

oxidase commented Sep 5, 2016

oxidase commented Sep 14, 2016

springmeyer commented Sep 16, 2016

karenzshea commented Sep 16, 2016 • edited Loading

oxidase commented Sep 16, 2016

oxidase left a comment

Choose a reason for hiding this comment

springmeyer commented Sep 19, 2016

MoKob commented Sep 20, 2016

TheMarex commented Oct 2, 2016

TheMarex commented Aug 22, 2016 •

edited by springmeyer

Loading

TheMarex Aug 22, 2016 •

edited

Loading

MoKob Aug 22, 2016 •

edited

Loading

TheMarex commented Aug 22, 2016 •

edited

Loading

springmeyer commented Aug 30, 2016 •

edited

Loading

oxidase commented Aug 31, 2016 •

edited

Loading

karenzshea commented Sep 16, 2016 •

edited

Loading