-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ | next #22
base: cpp
Are you sure you want to change the base?
C++ | next #22
Conversation
having to re-read the whole file Put this in a const Just get rid of this limit, we can handle it without
adds miscellaneous optimizations to rust version.
MIT license in mapped_file.hpp and mapped_file.cpp - Supports 64-bit file seeking. (+4GiB files) - Can memory map portions of the opened file or entire file. - Can flush modified read-write mappings back into disk. - Read-write regions will grow the backing file in multiple 4096 blocks. - mapped::file class for accessing an file on disk. - mapped::region class for memory mapping raw area of file. - mapped::struct_region<T> template for accessing an on-disk structure - mapped::array_region<T> template for accessing an on-disk array of T Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- silence few std::printf's since opening non-existing file is handled by returning -1 Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
The memory map now supports mapping oversized "window" into the file: - flush(), sync() only flush the user area - jump(), flushJump() have fast path speed up when new user area fits into the oversized window. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Provide region::writeAt() and region::readAt() that enable copying data into/from the backing file even if the target area of the backing file is not memory-mapped. - Fixup flushed length in flush() sync() - Run clang-format Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Provide FSTUNE flag that attempts to speed up file access when new file created with CREATE|RESIZE. It effectievely sets chattr +X and +A flags on the file. - Make readAt() const qualified. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Provide proper move aware object. region objects are now safe to use in STL containers like vector/deque. - Implement region::resident() (not tested) Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- region::window() allows over-extending the memory-mapping The "user mapped" portions stays same but regionSize() is changed. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- For resident() it is better to mark the entire mapped region rather than just the user area. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Implement more fine-grained locking for region. - Implement region::discard() This effectively zero fills memory area within the mapping and punches hole into the backing file. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
This branch is intended to be an integration branch where I record combined work of any contributors and my own for the C++ code base. The branch is fork of mikepound/main. All contributors are welcome. :-) My intended work process for all code in this branch: - Pull requests are accepted: I will checkout and merge them locally and then publish the updated branch. - Merges will not be squashed: Code is merged with `--no-ff -S --signoff` options to record the entire commit history of the merged branch as verified. If the source branch is deleted the merged branch commits will not vanish and the branch can be restored. - Only signed off commits are accepted from contributors. Contributors must at least make their commits with --signoff to distinguish them from others. - All branches should be rebased onto opencubes/next to keep history linear. As starting point: Merge branch 'feature/libmappedfile' into next Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- The filePointer points into read-only memory from mmap() so apply const to few places to ensure nothing is writing into it. - getCubesByShape() may return pointers to past-end of the mmap() area if shape table entry size is zero. ShapeEntry::offset can be wrong if the size is also zero. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- I can actually read how the progress is calculated. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
DEBUG_LEVEL selects the level of debug prints that are compiled in. 0 => Same as not compiling with DEBUG at all. 1 => Only DEBUG_PRINT() 2 => DEBUG1_PRINT() and lower levels are enabled 3 => DEBUG2_PRINT() and lower levels are enabled Change few of the noisiest prints to be silent with DEBUG_LEVEL == 1 Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
This is v3 reversion of this hack: Previously the uint8_t bit-field actually caused Cube to be 16-bytes due to padding. Bitpack/Hack the size, is_shared flag and memory address into into private struct bits_t. This halves the Cube struct size. Note: If we get any segfaults from de-referencing the pointer returned by get() helper this hack must be reverted. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Small changes diffed. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Launching new threads is expensive. Refactor the cubes.cpp threading code so that The started threads are kept running until the main process is complete. - Allow main thread do a it's preparation work in parallel with the running Workset. (The next cache file can be loaded while the old one is being processed.) Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Implement replacement for Cache::save() CacheWriter should produce identical files to the old code, but is slightly faster as it doesn't wait for the file finalization. The old code still exists as reference but nothing is using it except tests. - libmappedfile would allow the serialization process to be parallelized. (WIP, Not implemented yet.) - Move Header ShapeEntry into cacheformat namespace - Implement CacheWriter - Update cubes.cpp to use the new CacheWriter - Cube::copyout() helper. Idea for this helper is that if the cube representation is something else than plain XYZ array. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- CacheWriter now uses thread pool and copies the Hashy using worker threads. This would not be possible without libmapped_file. (N=13 completes now in less than 310 seconds, depends on disk) - Add nice progress bar Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
The old cache code has been deprecated since CacheWriter arrived: Only user was in tests/src/test_cache.cpp so drop the test case because it doesn't have any impact on the main cubes anymore. - Delete include/cache.hpp src/cache.cpp source files. Hopefully they will not be missed. :-) Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
CacheWriter didn't properly wait for queued job(s) to complete. Fix with counter that is incremented on queue and decremented *after* the task is run. Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com> Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
The C++ implementation has gained the split cache files and their associated command line parameters since Readme.md was last updated. Document the `./cubes` program usage and how to use the split cache files. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Imported commit v2 for next branch. - Current `git rev-list -n1 HEAD`, used compiler and build type and critical settings are embedded into the cubes binary. - `cubes -v` prints now how it was built. - CUBES_MAX_N constant now available from "config.hpp" - CONFIG_PACK_CUBE_ADDR now available from "config.hpp" - New options can be added into "config.hpp.in" - Add anti-goof measure for the read-only config.hpp The config defines can be changed at cmake configure time. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- CUBES_PACK_CUBE_XYZ_ADDR CMake option. By default do still enable compaction of Cube struct into 8-bytes. If the hack does not work on some system this can be set to OFF to revert the hack on configure time. - Add assert into Cube::copyout() Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Hashy code is somewhat tangled and there is now known possible data-race in `Hashy::insert()`. This issue cannot be permanently fixed without hiding the `Hashy::byshape` under protected/private and preventing direct access to the member. Replacements to the direct member access will come in later changes. - Move Subhashy and Subsubhashy out from Hashy class. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Make Subsubhashy a class to note its members aren't directly accessible. - Hide members under protected - Discover class users and fix them. Mainly iterating the SubsubHashy. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Make Subhashy a class to note its members aren't directly accessible. - Hide members under protected - Discover class users and fix them. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Finally fix the potential data-race in Hashy::insert(): insert() uses the at() to lookup/create the shape and it is thread-safe. - Make Hashy a class to note its members aren't directly accessible. - Hide members under protected - Discover class users and fix them. - Added begin(), end(), numShapes() and at() replacing direct member access. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Implement few basic operations in mapped::file so that mapped::region is not needed for these: - readAt() and writeAt() - copyAt() is the most interesting because the data copy is done by the operating system. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Implement way to temporally dump the cube data into disk storage in order to save system memory. For `./cubes -n 13 -w -s -u` run heaptrack tool reports: - total runtime: 26min 18s - peak RSS: 2.4 Gb - peak heap memory: 978 Mb This confirms that only the std::unordered_set<> internal nodes (and the lookup array) are kept in memory. Slow down is expected as accessing an element reads it from the disk. The swap files are named as `storage_<number>.bin` in the cache folder. These files are normally deleted as soon as they are no longer needed. Important!! the process can open so many files simultaneously that the system NOFILE limit is reached. This limit should be raised with `ulimit -n 128000` to avoid terminating the program. The minimum number for open file handles is at least: <maximum number of shapes for N> * 32 - CubeSwapSet is specialized std::unordered_set<> that stores the cube data in a file. - CubeStorage acts as pseudo allocator for the cube data. - CubePtr is the key type inserted in to CubeSwapSet. This only an 64-bit offset into the backing file and CubePtr is owned by CubeStorage that created it. - CubePtr::get(const CubeStorage&) reads out the Cube from the storage. Hashy users are adapted to use it where needed. - Clearing Hashy is now quite fast because there is no memory to be freed for CubePtrs. SubsubHashy::clear() simply deletes the data and the backing file. - Compiling in C++20 mode enables speed up by allowing SubsubHashy::contains() to work with Cube and CubePtr types. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Thread-local read-cache for CubeStorage: The read-cache is private for each thread that calls CubeStorage::read() The cache is shared by all CubeStorage instances per thread. Entries are evicted from the cache with LRU policy. (least-recently-used) - Massive CacheWriter optimizations: The written CubeStorage file is extremely useful for CacheWriter. CacheWriter now uses mapped::file::copyAt() to merge the CubeStorage file into the saved cache-file as-is. This completely by-passes iterating the CubeSwapSet Cube-by-Cube and makes CacheWriter::save() return without waiting data copy process to actually complete. Once copy job is complete the source CubeStorage file is deleted. CubeStorage::discard() now simply drops reference to the old file. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Memory map 2 MiB region at end of the backing file. This consumes additional 2 MiB of RAM per CubeStorage instance but reduces the number of file::truncate() and systems calls issued by large factor. The mapped region also speeds up CubeStorage::read() if the CubePtr falls into the mapped area as mapped::region::readAt() can simply memcpy the data. - Reduce Subsubhashy::insert() write-lock scope. If the entry is dropped (because another thread inserted it first) unlock immediately before CubeStorage::drop() is called. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Overall features: - `CubeSwapSet` provides large memory savings for Hashy by writing the Cube data contiguously into temporary files. SubsubHashy inserts CubePtr that refer to the storage file that is managed by `CubeStorage`. - Any accessed Cubes from CubeSwapSet are cached with per-thread read-cache. The cache has LRU eviction policy with 1024 Cubes per thread. This nearly eliminates any slow-down caused by reading the storage file. - CacheWriter takes advantage of the CubeStorage's contiguous data layout enabling near instantaneous save(). The data is merged into the cache-file via mapped::file::copyAt() followed by delete of the temporary storage file. (max -t N simultaneous copies can be issued before any waiting happens)
Sry for the silence, I still haven't found time to look at your work. |
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Surprisingly N=14 is not possible with 16GiB of memory because at certain point of progress OS begins to swap *something* and the process grinds to halt. This happens even if *there is free-memory available* so something is going haywire. I found out that the culprit maybe that large (+3 GiB) CacheReader memory mappings are being swapped out from the memory. The OS is trying keeping the previously accessed memory in system memory to our detriment. For -t K threads we only need to have K Cubes from the cache-file in memory at once. Only way out of this problem is to not memory map the entire cache file at once and instead read it Cube-by-Cube. I think @nsch0e would have wanted to implement reading this way from the beginning but he was missing the `mapped::file::readAt()` that works with absolute file offsets and can read the file in parallel. Currently FlatCache and CacheReader use the same CubeIterator and ShapeRange types. This is an problem for implementing better CubeIterator that reads the Cubes one-by-one from a file because any changes to these would break FlatCache that doesn't use cache files. Start by adding abstract interfaces for CubeIterator and ShapeRange. - ICubeIterator base class interface for Cube iterators - CubeIterator the current implementation for ICubeIterator. - CacheIterator type-erased proxy. This is needed to avoid disrupting the CubeIterator class users too much and make the type-erased iterator work in practice. - IShapeRange base class interface. - Make ICache::getCubesByShape() return reference to the IShapeRange. - Adapt CubeIterator users to use CacheIterator instead. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Remove CacheReader XYZ mapping. - Add CubeReadIterator that reads Cubes one at time. - FileShapeRange takes the cache file and offsets into the file - Update CacheReader::loadFile() to initialize array of FileShapeRange from the cache file. Result is celebration hooray for computing N=14 first time with less than 9 GiB of RSS: ```process output shape 99/101 [ 3 5 5] shape 2 5 5 shape 3 4 5 num: 588828 saved ./cache/cubes_14_3-5-5.bin, took 0.01 s process output shape 100/101 [ 4 4 4] shape 3 4 4 shape 4 4 4 num: 3341560 saved ./cache/cubes_14_4-4-4.bin, took 0.11 s process output shape 101/101 [ 4 4 5] shape 3 4 5 shape 4 4 4 num: 752858 saved ./cache/cubes_14_4-4-5.bin, took 0.02 s took 7231.83 s num total cubes: 1039496297``` My nvme disk was not particularly happy with with `output shape 80/101 [ 2 3 4]` that produced an +8 GiB file at end. The disk throttled badly after reaching 60*C... But it did complete eventually at reasonable pace and memory usage dropped below 7 GiB for rest of the run. N=15 will require more tuning to the CubeStorage read-cache and more parallel file system. btrfs looks to be not very good at this job as writing the storage files in parallel reduces the program to near single threaded speed. Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Solve problem with the system trashing the CacheReader memory map. CubeReadIterator now reads the Cubes one-by-one from the cache-file. the Cube XYZ data is not memory mapped at all by CacheReader. - N=14 is possible with 9 GiB of memory and very fast disk. :-) Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
@nsch0e no need to hurry. 👍 I merged an "milestone" change: 37d51e5 This was first version of C++ cubes that was able to compute N=14 with 16GiB of system memory. (approx 9 GiB peak) N=15 will require some tinkering with the CubeSwapper to reduce the I/O load and dedicated file system to store the output... |
How long did N=14 compute? What do you mean with dedicated file system? |
The computation took 7231.83 seconds, see the commit I linked in my previous message. I had to pause the process several times to let the nvme disk cool down a bit. What I meant is that when running N=14 and beoynd the cache folder should be put onto as fast as possible disk storage with parallel/ext4 filesystem. I have two nvme disks so I can do raid0 and put ext4 onto that for the cache folder. |
I finally managed to publish an workable version of the CubeSwapper. 😄
@nsch0e can you take a look?
The most interesting bits are introduced at commit: 64278c8
If you can review the changes so far I would be glad.
I mostly intend this PR as discussion thread for this branch.