Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next generation Tile Reduce #46

Merged
merged 70 commits into from
Nov 21, 2015
Merged

Next generation Tile Reduce #46

merged 70 commits into from
Nov 21, 2015

Conversation

mourner
Copy link
Member

@mourner mourner commented Oct 30, 2015

An early version of the next-generation Tile Reduce that is thousands of times faster than the current one. This is not a joke.

For example, generating GeoJSON road difference between OSM & Tiger 2015 for US West z12 (~26% of US) takes 2 minutes 45 seconds on my Macbook Pro; previously it would take a day on a much more powerful EC2 c3.4xlarge instance eating all of the available RAM.

The new architecture also makes Tile Reduce much more flexible and loosely coupled, e.g. easy to extend for raster support, weird tile formats, lazy decoding, etc.

I'll devlog how I did this, but the main idea behind the rewrite is removing almost all communication between the main thread and the workers and avoiding unnecessary work:

  • responsibility for fetching/parsing tiles fully moves to the workers
  • vector tiles are not parsed and converted to GeoJSON by default; instead, VT layer objects are passed to the workers for lazy decoding (parsing only the features that are relevant to the current task)
  • big GeoJSON output like difference is never passed back to the main thread; instead, it's streamed directly from workers to stdout; reduce event is only for smaller values like statistics

Somewhat breaks API compatibility, but this is something we have to do to support the new architecture. Upgrading existing processors should be easy once we finalize the work.

Also sports a fancy progress bar so that you could feel the unreal performance:

tilereduce-progress

While all the basic stuff works, this is still a work in progress. Rough list of remaining tasks:

  • clean up and push the new diff processor example (today)
  • get feedback on the API changes
  • code review by the team
  • readd remote tiles support
  • readd non-bbox covers
  • streaming tile list input
  • avoid overloading the message queue
  • readd overzoom (slicing to higher zoom levels than the sources provide)
  • fix interleaved output issue @tcql
  • readd tests (in progress) @morganherlocker
  • readd rate limiting
  • support streaming tile list from mbtiles directly @mourner
  • documentation & upgrade guide

I would love feedback and help on this. We could win so much with a blazing fast Tile Reduce, so let's make it happen sooner.

cc @morganherlocker @MateoV @aaronlidman @tcql @tmcw @lxbarth @ericfischer @batpad @Rub21 @geohacker @arunasank @isiyu

@mourner
Copy link
Member Author

mourner commented Oct 30, 2015

Here's that US West OSM-Tiger diff I generated in 3 minutes on a map (disregard the weird boundaries).

@aaronlidman
Copy link
Contributor

This sounds fantastic Vlad, I'll put aside some time this weekend to look more in-depth.

@tcql
Copy link
Contributor

tcql commented Oct 30, 2015

just adding my 👍 / proof; this is way fast. I did tests using mapbox/osm-sidewalker. I like it. The api on the map side feels slightly less friendly to newbies (since it comes through as vector tile instead of geojson), but I think it's advantageous overall.

I'm definitely in to sink some time to help move this too.

@mourner
Copy link
Member Author

mourner commented Oct 30, 2015

The api on the map side feels slightly less friendly to newbies (since it comes through as vector tile instead of geojson), but I think it's advantageous overall.

With the new architecture, it's pretty easy to make an option to pass GeoJSON instead of VT to the map function, and make it the default. It will still be relatively fast.

@mourner mourner mentioned this pull request Oct 30, 2015
@tcql
Copy link
Contributor

tcql commented Nov 4, 2015

added support for tile queue pausing, so we don't pass too many messages to child workers

@mourner
Copy link
Member Author

mourner commented Nov 5, 2015

@tcql manual limiting broke the tile-reduce — now if you run the count example, the total tiles and results are different each time.

@mourner
Copy link
Member Author

mourner commented Nov 5, 2015

Fixed the tile list text stream somehow, but the JS array tile stream is still broken — for some reason it finishes when it hits the 50k limit.

@mourner
Copy link
Member Author

mourner commented Nov 5, 2015

Fixed the JS stream as well. Now I wonder if there would be a benefit to try to switch to using spawn and piping between the processes instead of send, like in comments here nodejs/node-v0.x-archive#4374

edit: also this

@mourner
Copy link
Member Author

mourner commented Nov 5, 2015

Running count on z12 world on my MBP, it goes super-smooth and fast for a while (getting to maybe 30% of the world tiles), then suddenly node workers memory starts to grow out of control and it starts to stutter, eventually filling the memory and making the terminal unresponsive:

image

Middle flat green area is the job at first, then you see the spike.

I really wonder what might be the cause. We want the runs to have flat memory consumption regardless of the job size.

@mourner
Copy link
Member Author

mourner commented Nov 5, 2015

Ha, when I switched the pause limit from 50000 to 5000, the world count went smoothly without memory spikes and finished successfuly in under 7 minutes!

2303777 / 2303777 tiles (100%), 401.8s elapsed [===============...
Features total: 436992923

Now let's try OSM coverage...

@mourner
Copy link
Member Author

mourner commented Nov 5, 2015

if we handled process communication with streams, backpressure would stop the problem, but we'd kind of need to do a file readstream => some round-robin selection of which worker to pipe to

Now I see what you mean here @tcql (I'm still a streams noob), let's explore this cool idea. Another benefit is that as far as I understand, in pre-Node4 forked process send is blocking while spawned process piping is not. And we could get some perf gains by fully avoiding spamming the event loop.

@mourner
Copy link
Member Author

mourner commented Nov 9, 2015

@tcql nope, replacing isPaused() with paused still breaks everything in Node 0.12.

@morganherlocker
Copy link
Contributor

I am working on testing now. I have generated a very small mbtiles fixture weighing in at 1MB total. I think it will simplify our process if we just check this in for testing, rather than dealing with s3 and caching fixture assets in our CI. For benchmarking (more rare), people can use whatever they deem appropriate, such as the world OSM QA mbtiles file, which is just a curl away.

https://api.mapbox.com/styles/v1/morganherlocker/cigshqusv005xa6ks7iua9b2l.html?title=true&access_token=pk.eyJ1IjoibW9yZ2FuaGVybG9ja2VyIiwiYSI6Ii1zLU4xOWMifQ.FubD68OEerk74AYCLduMZQ#15.4/36.9777/-122.0222/-0.3

screenshot 2015-11-09 14 52 28

@morganherlocker
Copy link
Contributor

I processed an old Tiger road file from 2009 for the same area so we can test data with multiple sources, and have the diffing example be runnable without downloading and processing external data. These are published at a stable mapid, so we can use that url for testing http tile sources.

screenshot 2015-11-09 16 00 52

});

function loadSource(source, done) {
if (source.mbtiles) sources.push({name: source.name, getTile: require('./mbtiles')(source, done)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we waiting for the ready callback to be called anywhere? For the remote parser it appears that the getTile parser is not actually returned by the time the ready callback fires. I am not sure if this is a bug, but the flow is a bit difficult to follow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morganherlocker for the remote parser, ready is called immediately because there is no initialization step — it is ready to receive requests immediately unlike mbtiles which needs to open a db connection first before getting any tiles to work on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way to make this an interface where any source can be used the same way, without knowing the implementation details (this could help us with additional plugins down the road). Perhaps instead of returning the getTile function synchronously, we should send it back as the result of the ready callback. I believe that would allow users of the parse interfaces to treat them the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am going to proceed with writing tests for this in its current form and may refactor in a separate PR (requires some refactoring of the worker implementation). The remote interface diverts the execution flow in a confusing way IMO, for example:

var getTile = remote(source, function(err) {
  // i need to wait until i get here to see if there is an error on initialization
  if(err) throw err;
  // next line will fail because getTile has not actually returned yet, despite the parser being "ready"
  getTile([5276, 12757, 15], function(err, layers) {
    // yay, results!
  });
});

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should improve the flow here to make things easier to test and extend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented your suggestion.

@morganherlocker
Copy link
Contributor

@mourner it appears we return all layers in this branch, as opposed to only the requested layers in the source object. Could preserving the layers array provide any significant parsing gains (keep in mind most people will want geojson most of the time)? Is selectively parsing worth the effort on the VT parsing side?

Current source example:

{
  name: 'streets',
  url: 'https://b.tiles.mapbox.com/v4/mapbox.mapbox-streets-v6/{z}/{x}/{y}.vector.pbf',
  layers: ['roads', 'tunnels']
}

@morganherlocker
Copy link
Contributor

The latest build is passing on node 0.10, but failing on node 0.12. I added 0.10 to the travis config to track this behavior.

The issue is that the decoded vector tile under 0.10 looks like this with a raw data array:

{"buildings":{"version":1,"name":"buildings","extent":2048,"length":264,"_pbf":{"buf":[26,243,128,1,10,9,98,117, [...]

..while the 0.12 version returns a Buffer object with a wrapped data array:

{"buildings":{"version":1,"name":"buildings","extent":2048,"length":264,"_pbf":{"buf":{"type":"Buffer","data":[26,243,128,1,10,9,98,117,105,108,100,105,110,103,115,18,40,18,6,0,0,1,1,2,2,24,3 [...]

My hunch is that this issue is upstream in either the pbf lib or the vector-tile-js lib, but I will need to do some more digging.

@mourner
Copy link
Member Author

mourner commented Nov 11, 2015

@morganherlocker regarding layers — I didn't readd it yet because of lazy pbf decoding, but when we readd automatic GeoJSON parsing by default, we should definitely also readd layers property at the same time for performance gains (but make it optional so that it parses all layers by default).

Regarding VT 0.10 vs 0.12, I'm looking into the issue now.

@mourner
Copy link
Member Author

mourner commented Nov 11, 2015

@morganherlocker figured out the VT decoding issue — this happens simply because JSON.stringify has a different behavior on Buffer objects:

console.log(JSON.stringify(new Buffer([0xc8,0xe8,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x01])));
[200,232,255,255,255,255,255,255,255,1] # node 0.10
{"type":"Buffer","data":[200,232,255,255,255,255,255,255,255,1]} # node 0.12

You just need to compare objects differently, e.g. using t.deepEqual on objects rather than comparing strings.

@mourner
Copy link
Member Author

mourner commented Nov 12, 2015

Added an ability to omit area alltogether if you use mbtiles sources — it'll stream tile numbers from the first source if it's mbtiles using node-mbtiles ZXYStream. We can update the API later to be able to choose which source to use.

@morganherlocker
Copy link
Contributor

I am noticing that binarysplit seems to crash when tile-reduce is run in the context of a test. I am pretty sure the main thread is grabbing TAP data somewhere, but need to do some more digging.

@mourner
Copy link
Member Author

mourner commented Nov 20, 2015

Warning: force-pushed a rebased branch. It merges cleanly now. Also squashed some tiny commits like "fix typo", and all merge commits were removed automatically.

@mourner
Copy link
Member Author

mourner commented Nov 20, 2015

The diff script now lives at https://github.com/mourner/road-diff. I'll clean it up before blogging next week.

@tcql
Copy link
Contributor

tcql commented Nov 20, 2015

working on adding a short doc section about events, so we've got bases covered

Edit: 👍 done. @morganherlocker @mourner review when you get a chance? I think we need to say more under the main Events header, but my brain doesn't want to do words right now

@mourner
Copy link
Member Author

mourner commented Nov 20, 2015

👍 let's merge and publish 3.0.0. We can improve as we go after that.

This was referenced Nov 20, 2015
@morganherlocker morganherlocker merged commit 5fa9ef8 into master Nov 21, 2015
@morganherlocker morganherlocker deleted the nextgen branch November 21, 2015 02:42
@morganherlocker
Copy link
Contributor

:shipit:

@lxbarth
Copy link

lxbarth commented Nov 21, 2015

🚂 !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants