Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ | next #22

Open
wants to merge 48 commits into
base: cpp
Choose a base branch
from
Open

C++ | next #22

wants to merge 48 commits into from

Conversation

JATothrim
Copy link

I finally managed to publish an workable version of the CubeSwapper. 😄

@nsch0e can you take a look?

The most interesting bits are introduced at commit: 64278c8
If you can review the changes so far I would be glad.

I mostly intend this PR as discussion thread for this branch.

datdenkikniet and others added 30 commits August 5, 2023 19:05
having to re-read the whole file

Put this in a const

Just get rid of this limit, we can handle it without
adds miscellaneous optimizations to rust version.
MIT license in mapped_file.hpp and mapped_file.cpp

- Supports 64-bit file seeking. (+4GiB files)
- Can memory map portions of the opened file or entire file.
- Can flush modified read-write mappings back into disk.
- Read-write regions will grow the backing file in multiple 4096 blocks.
- mapped::file class for accessing an file on disk.
- mapped::region class for memory mapping raw area of file.
- mapped::struct_region<T> template for accessing an on-disk structure
- mapped::array_region<T> template for accessing an on-disk array of T

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- silence few std::printf's since opening non-existing file
  is handled by returning -1

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
The memory map now supports mapping oversized "window" into the file:
- flush(), sync() only flush the user area
- jump(), flushJump() have fast path speed up when new user area
  fits into the oversized window.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Provide region::writeAt() and region::readAt() that
  enable copying data into/from the backing file even if the
  target area of the backing file is not memory-mapped.
- Fixup flushed length in flush() sync()
- Run clang-format

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Provide FSTUNE flag that attempts to speed up file access
  when new file created with CREATE|RESIZE.
  It effectievely sets chattr +X and +A flags on the file.
- Make readAt() const qualified.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Provide proper move aware object.
  region objects are now safe to use in STL containers like vector/deque.
- Implement region::resident() (not tested)

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- region::window() allows over-extending the memory-mapping
  The "user mapped" portions stays same but regionSize() is changed.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- For resident() it is better to mark the entire mapped region
  rather than just the user area.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Implement more fine-grained locking for region.
- Implement region::discard()
  This effectively zero fills memory area within
  the mapping and punches hole into the backing file.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
This branch is intended to be an integration branch where I record
combined work of any contributors and my own for the C++ code base.
The branch is fork of mikepound/main.

All contributors are welcome. :-)

My intended work process for all code in this branch:

- Pull requests are accepted: I will checkout and merge them locally
  and then publish the updated branch.
- Merges will not be squashed: Code is merged with `--no-ff -S --signoff`
  options to record the entire commit history of the merged branch
  as verified. If the source branch is deleted the merged branch
  commits will not vanish and the branch can be restored.
- Only signed off commits are accepted from contributors.
  Contributors must at least make their commits with --signoff
  to distinguish them from others.
- All branches should be rebased onto opencubes/next to keep history linear.

As starting point:
Merge branch 'feature/libmappedfile' into next

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- The filePointer points into read-only memory from mmap()
  so apply const to few places to ensure nothing is writing into it.
- getCubesByShape() may return pointers to past-end of the mmap() area
  if shape table entry size is zero.
  ShapeEntry::offset can be wrong if the size is also zero.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- I can actually read how the progress is calculated.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
DEBUG_LEVEL selects the level of debug prints that are compiled in.
0 => Same as not compiling with DEBUG at all.
1 => Only DEBUG_PRINT()
2 => DEBUG1_PRINT() and lower levels are enabled
3 => DEBUG2_PRINT() and lower levels are enabled

Change few of the noisiest prints to be silent with DEBUG_LEVEL == 1

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
This is v3 reversion of this hack:
Previously the uint8_t bit-field actually caused Cube to be 16-bytes
due to padding.
Bitpack/Hack the size, is_shared flag and memory address into
into private struct bits_t. This halves the Cube struct size.

Note: If we get any segfaults from de-referencing the pointer
returned by get() helper this hack must be reverted.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Small changes diffed.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Launching new threads is expensive.
  Refactor the cubes.cpp threading code so that
  The started threads are kept running until the main process is complete.
- Allow main thread do a it's preparation work
  in parallel with the running Workset.
  (The next cache file can be loaded while the old one is being processed.)

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Implement replacement for Cache::save()
CacheWriter should produce identical files to the old code,
but is slightly faster as it doesn't wait for the file finalization.
The old code still exists as reference but nothing is using it except tests.

- libmappedfile would allow the serialization process to be parallelized.
  (WIP, Not implemented yet.)
- Move Header ShapeEntry into cacheformat namespace
- Implement CacheWriter
- Update cubes.cpp to use the new CacheWriter
- Cube::copyout() helper. Idea for this helper is that if
  the cube representation is something else than plain XYZ array.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- CacheWriter now uses thread pool and copies the Hashy using
  worker threads. This would not be possible without libmapped_file.
  (N=13 completes now in less than 310 seconds, depends on disk)
- Add nice progress bar

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
The old cache code has been deprecated since CacheWriter arrived:
Only user was in tests/src/test_cache.cpp so drop the test case
because it doesn't have any impact on the main cubes anymore.

- Delete include/cache.hpp src/cache.cpp source files.
  Hopefully they will not be missed. :-)

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
CacheWriter didn't properly wait for queued job(s) to complete.
Fix with counter that is incremented on queue and
decremented *after* the task is run.

Signed-off-by: Jarmo Tiitto <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
The C++ implementation has gained the split cache files and their
associated command line parameters since Readme.md was last updated.
Document the `./cubes` program usage and how to use the split cache files.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Imported commit v2 for next branch.
- Current `git rev-list -n1 HEAD`, used compiler and build type
  and critical settings are embedded into the cubes binary.
- `cubes -v` prints now how it was built.
- CUBES_MAX_N constant now available from "config.hpp"
- CONFIG_PACK_CUBE_ADDR now available from "config.hpp"
- New options can be added into "config.hpp.in"
- Add anti-goof measure for the read-only config.hpp
  The config defines can be changed at cmake configure time.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- CUBES_PACK_CUBE_XYZ_ADDR CMake option.
  By default do still enable compaction of Cube struct into 8-bytes.
  If the hack does not work on some system this can be set to OFF
  to revert the hack on configure time.
- Add assert into Cube::copyout()

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Hashy code is somewhat tangled and there is now known possible
data-race in `Hashy::insert()`.
This issue cannot be permanently fixed without hiding the `Hashy::byshape`
under protected/private and preventing direct access to the member.

Replacements to the direct member access will come in later changes.

- Move Subhashy and Subsubhashy out from Hashy class.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Make Subsubhashy a class to note its members aren't directly accessible.
- Hide members under protected
- Discover class users and fix them.
  Mainly iterating the SubsubHashy.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Make Subhashy a class to note its members aren't directly accessible.
- Hide members under protected
- Discover class users and fix them.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Finally fix the potential data-race in Hashy::insert():
  insert() uses the at() to lookup/create the shape and it is thread-safe.
- Make Hashy a class to note its members aren't directly accessible.
- Hide members under protected
- Discover class users and fix them.
- Added begin(), end(), numShapes() and at() replacing direct member access.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Implement few basic operations in mapped::file so that mapped::region
is not needed for these:
- readAt() and writeAt()
- copyAt() is the most interesting because the data copy is
  done by the operating system.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Implement way to temporally dump the cube data into disk storage
in order to save system memory.

For `./cubes -n 13 -w -s -u` run

heaptrack tool reports:
- total runtime: 26min 18s
- peak RSS: 2.4 Gb
- peak heap memory: 978 Mb

This confirms that only the std::unordered_set<> internal
nodes (and the lookup array) are kept in memory.
Slow down is expected as accessing an element reads it from the disk.

The swap files are named as `storage_<number>.bin` in the cache folder.
These files are normally deleted as soon as they are no longer needed.

Important!!
the process can open so many files simultaneously
that the system NOFILE limit is reached.
This limit should be raised with `ulimit -n 128000` to avoid terminating
the program. The minimum number for open file handles is at least:
<maximum number of shapes for N> * 32

- CubeSwapSet is specialized std::unordered_set<> that stores the cube data in a file.
- CubeStorage acts as pseudo allocator for the cube data.
- CubePtr is the key type inserted in to CubeSwapSet.
  This only an 64-bit offset into the backing file and
  CubePtr is owned by CubeStorage that created it.
- CubePtr::get(const CubeStorage&) reads out the Cube from the storage.
  Hashy users are adapted to use it where needed.
- Clearing Hashy is now quite fast because there is no memory to be
  freed for CubePtrs. SubsubHashy::clear() simply deletes the data
  and the backing file.
- Compiling in C++20 mode enables speed up by allowing
  SubsubHashy::contains() to work with Cube and CubePtr types.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Thread-local read-cache for CubeStorage:
  The read-cache is private for each thread that calls CubeStorage::read()
  The cache is shared by all CubeStorage instances per thread.
  Entries are evicted from the cache with LRU policy. (least-recently-used)

- Massive CacheWriter optimizations:
  The written CubeStorage file is extremely useful for CacheWriter.
  CacheWriter now uses mapped::file::copyAt() to merge the
  CubeStorage file into the saved cache-file as-is.
  This completely by-passes iterating the CubeSwapSet Cube-by-Cube
  and makes CacheWriter::save() return without waiting data copy
  process to actually complete.
  Once copy job is complete the source CubeStorage file is deleted.
  CubeStorage::discard() now simply drops reference to the old file.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Memory map 2 MiB region at end of the backing file.
  This consumes additional 2 MiB of RAM per CubeStorage instance but
  reduces the number of file::truncate() and systems calls issued
  by large factor.
  The mapped region also speeds up CubeStorage::read() if the CubePtr
  falls into the mapped area as mapped::region::readAt() can simply
  memcpy the data.
- Reduce Subsubhashy::insert() write-lock scope.
  If the entry is dropped (because another thread inserted it first)
  unlock immediately before CubeStorage::drop() is called.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Overall features:
- `CubeSwapSet` provides large memory savings for Hashy
  by writing the Cube data contiguously into temporary files.
  SubsubHashy inserts CubePtr that refer to the storage file that
  is managed by `CubeStorage`.
- Any accessed Cubes from CubeSwapSet are cached with per-thread
  read-cache. The cache has LRU eviction policy with 1024 Cubes per thread.
  This nearly eliminates any slow-down caused by reading the storage file.
- CacheWriter takes advantage of the CubeStorage's contiguous data
  layout enabling near instantaneous save().
  The data is merged into the cache-file via mapped::file::copyAt()
  followed by delete of the temporary storage file.
  (max -t N simultaneous copies can be issued before any waiting happens)
@nsch0e
Copy link
Owner

nsch0e commented Aug 25, 2023

Sry for the silence, I still haven't found time to look at your work.
Will try next weekend. 😉

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
Surprisingly N=14 is not possible with 16GiB of memory
because at certain point of progress OS begins to swap
*something* and the process grinds to halt.
This happens even if *there is free-memory available* so something is
going haywire.

I found out that the culprit maybe that large (+3 GiB)
CacheReader memory mappings are being swapped out from the memory.
The OS is trying keeping the previously accessed memory in system memory
to our detriment.
For -t K threads we only need to have K Cubes from the cache-file
in memory at once.

Only way out of this problem is to not memory map the entire cache file
at once and instead read it Cube-by-Cube.
I think @nsch0e would have wanted to implement reading this way from
the beginning but he was missing the `mapped::file::readAt()`
that works with absolute file offsets and can read the file in parallel.

Currently FlatCache and CacheReader use the same CubeIterator
and ShapeRange types.
This is an problem for implementing better CubeIterator that reads
the Cubes one-by-one from a file because any changes to
these would break FlatCache that doesn't use cache files.

Start by adding abstract interfaces for CubeIterator and ShapeRange.

- ICubeIterator base class interface for Cube iterators
- CubeIterator the current implementation for ICubeIterator.
- CacheIterator type-erased proxy.
  This is needed to avoid disrupting the CubeIterator class users
  too much and make the type-erased iterator work in practice.
- IShapeRange base class interface.
- Make ICache::getCubesByShape() return reference to the IShapeRange.
- Adapt CubeIterator users to use CacheIterator instead.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Remove CacheReader XYZ mapping.
- Add CubeReadIterator that reads Cubes one at time.
- FileShapeRange takes the cache file and offsets into the file
- Update CacheReader::loadFile() to initialize array of
  FileShapeRange from the cache file.

Result is celebration hooray for computing N=14 first time
with less than 9 GiB of RSS:

```process output shape  99/101 [ 3  5  5]
  shape 2 5 5
  shape 3 4 5
  num: 588828
saved ./cache/cubes_14_3-5-5.bin, took 0.01 s
process output shape 100/101 [ 4  4  4]
  shape 3 4 4
  shape 4 4 4
  num: 3341560
saved ./cache/cubes_14_4-4-4.bin, took 0.11 s
process output shape 101/101 [ 4  4  5]
  shape 3 4 5
  shape 4 4 4
  num: 752858
saved ./cache/cubes_14_4-4-5.bin, took 0.02 s
took 7231.83 s
num total cubes: 1039496297```

My nvme disk was not particularly happy with with
`output shape  80/101 [ 2  3  4]` that produced an +8 GiB file at end.
The disk throttled badly after reaching 60*C...
But it did complete eventually at reasonable pace and
memory usage dropped below 7 GiB for rest of the run.

N=15 will require more tuning to the CubeStorage read-cache and
more parallel file system.
btrfs looks to be not very good at this job
as writing the storage files in parallel reduces the program to
near single threaded speed.

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
- Solve problem with the system trashing the CacheReader memory map.
  CubeReadIterator now reads the Cubes one-by-one from the cache-file.
  the Cube XYZ data is not memory mapped at all by CacheReader.
- N=14 is possible with 9 GiB of memory and very fast disk. :-)

Signed-off-by: JATothrim <jarmo.tiitto@gmail.com>
@JATothrim
Copy link
Author

JATothrim commented Aug 26, 2023

@nsch0e no need to hurry. 👍

I merged an "milestone" change: 37d51e5

This was first version of C++ cubes that was able to compute N=14 with 16GiB of system memory. (approx 9 GiB peak)
the cache folder is at 50GiB (includes previous runs) and my disk was not happy heating to over 60*C while doing it. 🥲

N=15 will require some tinkering with the CubeSwapper to reduce the I/O load and dedicated file system to store the output...

@nsch0e
Copy link
Owner

nsch0e commented Aug 26, 2023

How long did N=14 compute? What do you mean with dedicated file system?

@JATothrim
Copy link
Author

JATothrim commented Aug 26, 2023

How long did N=14 compute? What do you mean with dedicated file system?

The computation took 7231.83 seconds, see the commit I linked in my previous message. I had to pause the process several times to let the nvme disk cool down a bit. What I meant is that when running N=14 and beoynd the cache folder should be put onto as fast as possible disk storage with parallel/ext4 filesystem. I have two nvme disks so I can do raid0 and put ext4 onto that for the cache folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants