Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use boost asio to run threadpool #197

Merged
merged 10 commits into from
Feb 27, 2021
Merged

Use boost asio to run threadpool #197

merged 10 commits into from
Feb 27, 2021

Conversation

kleunen
Copy link
Contributor

@kleunen kleunen commented Feb 21, 2021

  • Small optimization to mmap storage
  • Use thread pool to run the tile generation

Can you check if the shapes are rendered correctly, it seems they always output 0 at my machine:
Shape points: 0, lines: 0, polygons: 0

@kleunen
Copy link
Contributor Author

kleunen commented Feb 21, 2021

The CI now failes due to boost version. But the CI also has newer boost version. I think you can configure with the BOOST_ROOT define. You actually don't have to install using apt-get.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 21, 2021

@systemed
Copy link
Owner

A very handy 20% improvement on the Great Britain extract:

real	43m58.630s
user	237m29.682s
sys	14m49.743s

Haven't tested the shapefile rendering yet but the output log looks good ("Shape points: 0, lines: 0, polygons: 439").

@kleunen kleunen force-pushed the master branch 3 times, most recently from d23f9d4 to 99d69e0 Compare February 22, 2021 06:50
@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

The CI now has the compiled executable as artifact to download. You can also compile windows and macos executable using actions and have as download. Completely link it static. Can be convenient for users to download

@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

I think it is also possible to make the reading of the pbf somewhat parallel. The decompression can be done parallel

Hmm, on second thought, I measured, and the decompression time is actually quite insignificant

@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

Stored 112737562 nodes, 14982848 ways, 75012 relations
Shape points: 0, lines: 0, polygons: 24
Generated points: 9718722, lines: 4523991, polygons: 12568893
Zoom level 14, writing tile 38000 of 51383
Filled the tileset with good things at netherlands.mbtiles

real 123m31.701s
user 56m30.177s
sys 21m11.627s

For me the real time is twice the time for user (cpu time).
So the process is waiting many times for the kernel, maybe for reading the file input ?
Or more likely the reading/writing of the mmap file.

tileData.SetZoom(zoom);
sharedData.zoom = zoom;
// Loop through tiles
uint tc = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this outside the loop to be consistent with total_tiles, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I noticed the counter was wrong, but that is why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zoom level 14, writing tile 38000 of 51383

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I added a print when the final tile is generated.

cout << "Reading tile " << srcZ << ": " << srcX << "," << srcY << " (" << (run+1) << "/" << runs << ")" << endl;
vector<char> pbf = mapsplitFile.readTile(srcZ,srcX,tmsY);
// Write to a temp file and read back in - can we do this better via a stringstream or similar?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed this temporary file, but I did not test this. Can you test this ? It should be OK llike this

@systemed
Copy link
Owner

Thought I'd stress-test it by running it on a France extract (lots of buildings)!

real	131m26.129s
user	842m48.936s
sys	28m25.527s

14GB memory used, 2.8GB output file, and the map looks great.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

But what is this mapsplitting the temporary file was used for ?
I removed the temporary, but I do not know how to test

@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

You can also try complete Europe with zoom level op to 12 or so ?

@systemed
Copy link
Owner

mapsplit is a Java tool to split a single .osm.pbf into a database of tiled .osm.pbfs: https://wiki.openstreetmap.org/wiki/Mapsplit , https://github.com/simonpoole/mapsplit

It's in tilemaker as an option to reduce memory usage - tilemaker reads in a square of source tiles from the .msf file at once, creates the MVT output from them, then loops. Here's a sample .msf file: https://www.systemed.net/temp/centre-latest.msf

It's less crucial than it once was now that we have the mmap store, but I think it's probably worth keeping as it offers a further reduction in memory usage. I've just run this branch over it and it worked fine.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

I am trying out 1 more optimization

nope.. seems to make no difference, so not worthwhile to commit

@kleunen kleunen force-pushed the master branch 3 times, most recently from 2ae31e5 to d14f26c Compare February 22, 2021 16:34
@kleunen
Copy link
Contributor Author

kleunen commented Feb 22, 2021

I tried compiling against luajit, but that actually seems broken.

@systemed
Copy link
Owner

Luajit is famously pernickety (e.g. Project-OSRM/osrm-backend#2955) - I'll give that some attention in due course.

@systemed
Copy link
Owner

htop is showing RES 92.2G so I guess it's mapping a fair amount into memory. I wonder whether the WayStore has grown to such a size that it can't keep the full NodeStore in memory any more.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

htop is showing RES 92.2G so I guess it's mapping a fair amount into memory. I wonder whether the WayStore has grown to such a size that it can't keep the full NodeStore in memory any more.

Yes. The node store is too big. And when you are processing a way, the nodes need to be looked up in the nodes map. And this causes pages to be swapped in and out.

Ideally you want related nodes to be stored in the same page on the drive. But i think it does. The hash function actually keeps the ids sequential and does not completely randomize them. So they should be in buckets close to each other.

The generated ways get swapped out quickly to disk. Only the nodes are read very often when loading the pbf, because of the ways. And if this does not fit in memory. It needs to be loaded from disk everytime. And the way building stalls

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

An hdd is simply not suited for this behaviour. The process performs random reads and the hdd needs to seek the block. Ssd has much better random read performence.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

Ah wait. The vector does not need to store the key and the hashmap does. So this indeed will save memory

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

Ah wait. The vector does not need to store the key and the hashmap does. So this indeed will save memory

Does osmium renumber say how much nodes and ways there are in Europe?

@systemed
Copy link
Owner

2710803032 nodes, 325100853 ways. (From osmium fileinfo. osmium is awesome.)

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

struct LatpLon {
    int32_t latp;
    int32_t lon;
};

So this is 8 bytes per Node, and NodeID = 4 bytes. But I think actually, unordered_map aligns the entries on 8 bytes anyway. Because unaligned read is slow. So if you set the hashmap to 3000 nodes preallocated, you would need 3000 * 16 bytes = 48G for the node store. So this should fit in the memory. If you preallocate a vector 3000 entries, 3000 * 8 bytes = 24g. It is worth a try, 24G should be easy to keep in memory.

But also consider that a HDD is simply not a good choice :) An amazon EBS volume has 1000 mb/sec. An SSD can also achieve 600 mb/sec. So this can help keeping the CPU busy.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

I also though about you can use ZSTD compression on the ways with a dictionary:
https://facebook.github.io/zstd/

So if you have a linestring or polygon, with a sequence of node ids, you first substract every node id from previous node id:
id1 id2 id3 id4
becomes:
id1 (id2 - id1) (id3 - id2) (id4 - id3)

This will usually result in a sequence like:
id1 1 1 1
And then you perform compression with a preset dictionary.

But I think with a fast enough SSD, it will already be easy to keep 1 CPU busy and you may be able to perform a parallel loading actually. Use multiple CPU when loading the PBF.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

What I see on IOtop now. My tilemaker is using 32G on NL+BE+DE conversion. And first it reads with 120Mb/sec from disk. Does some processing. And then you see a write burst with 80Mb/sec.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 26, 2021

These algorithms are actually causing the ways to be loaded into memory again:
https://github.com/systemed/tilemaker/blob/master/include/osm_store.h#L422-L526

So if memory is limited, the ways are loaded, nodes being pushed out of memory. Later nodes being loaded again into memory, ways pushed out. Etc ... :)

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

I converted now Netherlands with the edges around the borders with germany and belgium. I first did a bounding box export from Europe. And then converted using tilemaker:

Stored 176982865 nodes, 25353999 ways, 94603 relations
Shape points: 0, lines: 0, polygons: 24
Generated points: 14316834, lines: 8192207, polygons: 20968705
Zoom level 14, writing tile 66000 of 66000
Filled the tileset with good things at netherlands.mbtiles

real 193m54.834s
user 83m2.147s
sys 30m3.398s

I have 8G of main memory and 4G swap. This works quite well, it is possible now. It is IO bounded when performing the operation, but still, the conversion time is quite acceptable. What I do notice is that the final file is not completely clipped to the bounding box. I have some parts which are extending out of the bounding box, but that is no problem. Panning can be restricted in mapbox GL JS.

https://mqtt.kleunen.nl/map/

I would be interested in a carto like style :)

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

Here I am missing a street ?
Reigershof Schoonhoven Netherlands

image

vs OSM:
image

@systemed
Copy link
Owner

systemed commented Feb 27, 2021

I've experimented with using a sequential node store.

In osm_store.h the quick-and-dirty changes are:

template <typename T> using scoped_alloc_t = boost::container::scoped_allocator_adaptor<T>;
using ll_allocator_t = boost::interprocess::node_allocator<LatpLon, mmap_file_t::segment_manager>;
using vector_t = boost::container::vector<LatpLon, scoped_alloc_t<ll_allocator_t>>;
vector_t *mLatpLons;

void reopen(mmap_file_t &mmap_file)
{
    mLatpLons = mmap_file.find_or_construct<vector_t>("node_store")(mmap_file.get_segment_manager());
}

void reserve(uint nodes) {
    mLatpLons->reserve(nodes);
    mLatpLons->resize(nodes);
}

and then in osm_store.cpp:

size_t NodeStore::count(NodeID i) const {
    return mLatpLons->size() > i ? 0 : 1;
}

void NodeStore::insert_back(NodeID i, LatpLon coord) {
    mLatpLons->at(i) = coord;
}

void NodeStore::clear() {
// commented out because otherwise it resizes back to 0
//	mLatpLons->clear();
}

And... it appears to work! Slowly. I set it running last night and it has literally just finished reading in the .pbf. There was a lot of HDD churn in the latter stages, but it got much further into the process before that started. It's now just starting to write tiles:

Zoom level 5, writing tile 100 of 22946608

Memo to self: I need to specify 4200 as the size for the nodestore, not 4300 as that overflows the uint. ;)

How best to integrate this as an option I'm not sure - I'm guessing some template magick would be the right approach but that starts to stretch my C++ knowledge...


Netherlands map looks great! I think the missing road is a result of the highway=construction tag - we do handle this in the Lua profile but possibly not in the way that the OMT stylesheet expects.

ZSTD-like delta encoding for nodelists would be an interesting idea and could save a lot of memory. That's exactly how the .pbf format works (https://wiki.openstreetmap.org/wiki/PBF_Format#Ways_and_Relations). I briefly looked for an existing C++/C implementation we could drop in but haven't turned anything up yet.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

Yes, well, actually, i was thinking about Fibonacci Coding to store Linestrings and Rings:
https://www.geeksforgeeks.org/fibonacci-coding/

I think we can win a lot there. And the Fibonacci coding is easier than importing a complete compression library. Although with compression would also be possible. Because actually, you are generating a stream of linestring/rings. You never remove a linestring from the mmap file. So basicly you are just appending them to the back of the file. But still, coding only on a vector might be easier.

Also, if you use COMPACT_NODES, you can use a vector on the node id, like you did. And also compress the linestring/polygons with fibonacci coding. Because the nodeid's are smaller, the difference between two nodeids will also be smaller. So they will be compressed efficiently using fibonacci coding.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

You won't need any template magic. You can just implement two node stores: one using the unordered_map and the other using the vector. You can switch with the macro COMPACT_NODES between the implementation.

But you could also switch from the command line. The OSMStore actually only uses the reopen/reserve/at and insert_back. You can make a base class for this with virtual methods, and have two implementations. Using a commandline option you can then switch between normal/compact mode. And create the Normal or Compact node store from main.cpp and give a pointer to OSMStore. That way you can actually configure it using a command line option.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

How big is your swapfile on the system btw?

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

ZSTD-like delta encoding for nodelists would be an interesting idea and could save a lot of memory. That's exactly how the .pbf format works

Yes. If you keep it like it was in the original file. You dont even need to recode the nodelist. Maybe just need to unpack it using an iterator when you use it.
https://www.boost.org/doc/libs/master/boost/beast/core/detail/varint.hpp

@systemed
Copy link
Owner

osm_store.dat is currently 256GB. It's currently on z9 and the HDD is spinning so much it's about to go into orbit. I've just ordered an SSD.

Do you think this PR is ready to merge as is or would you like to do some more to it?

@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

No. You can merge like this. :)

Is it using 24 threads now? Or not? Was the ssd expensive? Because you need a sas ssd right? :)

@systemed
Copy link
Owner

SAS would have been nice but way too expensive! I've gone for SATA for now - I figure it'll still be a big improvement over currently. I'll rerun Europe when it arrives: for now I've killed it because most of the threads were pretty much waiting for IO.

@systemed systemed merged commit b8c7a78 into systemed:master Feb 27, 2021
@kleunen
Copy link
Contributor Author

kleunen commented Feb 27, 2021

Yes. Sata is much better deal. I thought the hp only had sas. It will make a difference yes.

@systemed
Copy link
Owner

It'll do SATA but at 3Gbps rather than 6Gbps. Still much better than what I'm getting currently!

Now merged - thanks for another awesome PR.

@leonardehrenfried
Copy link
Contributor

I tried this on a box with 64GB of memory and an SSD. The whole processing of Germany was actually a little slower than before, presumably because it is not keeping everything in memory anymore.

Is it possible to tell tilemaker to keep everything in memory, as it did before? Some comments mention a RAM disk or a tmpfs but I cannot find any information about specifying where the mmap file should be stored.

@kleunen
Copy link
Contributor Author

kleunen commented Feb 28, 2021

  --store arg (=osm_store.dat) temporary storage for node/ways/relations data
  --init-store arg (=20:5)     initial number of millions of entries for the
                               nodes (20M) and ways (5M)
  --verbose                    verbose error output

You can also initialize the store with the number of nodes/ways that you expect. You can gets this number from
osmium fileinfo -e

Did you really time the conversion ?
Because we have not seen a slow-down.

Maybe the default for osm_store.dat should be /tmp/osm_store.dat ?

@leonardehrenfried
Copy link
Contributor

I'm comparing against the code before you introduced the mmap (not to before the asio) and I noticed perhaps a 5% slowdown.

Thanks for the command line params. I will try them and report back.

@leonardehrenfried
Copy link
Contributor

So I tried putting the mmap file into /dev/shm which in Debian systems is a tmpfs. The full command is this:

tilemaker
  --output=/var/tilemaker/output.mbtiles \
  --config=/var/tilemaker/config-openmaptiles.json \
  --process=/var/tilemaker/process-openmaptiles.lua \
  --store=/dev/shm/tilemaker-osm.dat \
  --verbose

However, the process exits with code 135 after opening the pbf file without having processed anything.

The log is the following:

Layer place (z4-14)
Layer poi (z12-14)
Layer poi_detail (z14-14) -> poi
Layer housenumber (z14-14)
Layer waterway (z6-14)
Layer waterway_detail (z12-14) -> waterway
Layer transportation (z8-14)
Layer transportation_main (z9-14) -> transportation
Layer transportation_mid (z11-14) -> transportation
Layer transportation_detail (z13-14) -> transportation
Layer transportation_name (z8-14)
Layer transportation_name_mid (z12-14) -> transportation_name
Layer transportation_name_detail (z14-14) -> transportation_name
Layer building (z13-14)
Layer water (z6-14)
Layer ocean (z6-14) -> water
Layer water_name (z13-14)
Layer water_name_detail (z13-14) -> water_name
Layer aeroway (z11-14)
Layer aerodrome_label (z10-14)
Layer park (z11-14)
Layer landuse (z11-14)
Layer landcover (z6-14)
Layer mountain_peak (z11-14)
Initializing storage to 20M nodes and 5M ways
Reading .shp ocean
Reading .pbf /srv/tilemaker-input.osm.pbf
Unable to open coastline/water_polygons.shp or coastline/water_polygons.SHP.

Do you know why this happens?

@leonardehrenfried
Copy link
Contributor

Hmm, if it configure the mmap file to placed in /tmp which on Debian is not a tmpfs it seems to work. Do you have an idea why it could be problematic with tmpfs?

@leonardehrenfried
Copy link
Contributor

Ah, hang on. I'm running this in a docker container so perhaps in there /dev/shm doesn't work...

@kleunen
Copy link
Contributor Author

kleunen commented Apr 10, 2021

I have a new laptop with intel core i7 10850 and 32G of ram. I am converting now 1/4 of the planet (south america). But what I see now is that when I get to the higher zoom level, the cpu usage drops to 33% and tilemaker is just generating the tiles very fast. I think we have to start a work item on the thread pool for every 100 tiles instead of every single tile. Because I think the conversion is just going to fast to keep all my cpu's busy (12) and lock contention occurs when starting the work items on the thread pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants