Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per Dataset Caching #38

Closed
wants to merge 49 commits into from
Closed

Per Dataset Caching #38

wants to merge 49 commits into from

Conversation

flippmoke
Copy link

Updated the way caching work so that caches can be set per dataset or in a global way, this converts the cache size limit to per dataset is the per dataset option is selected.

@flippmoke
Copy link
Author

I realized that the GDALGetCacheMax() and GDALGetCacheMax64() are used in several places and this needs to be corrected to use the dataset cache max as necessary. I will submit a change to this soon.

Additionally, new tests are likely to need to be written to test with the "GDAL_DATASET_CACHING=YES". Not quite sure yet how to do this in the framework, will look into that sometime soon.

…heMax in various spots in the code. In many situations this should be pulling the size of the non global cache size in the situation where a dataset specific cache size was specified. Added methods to access the RBM from the Dataset and RasterBand as well.
@rouault
Copy link
Member

rouault commented Jul 9, 2014

Happy to see that you managed to do the implementation accordingly with our discussion.
A few remarks :

  • in the GDALRasterBlockManager, we probably only need the 64bit variants (and probably without the 64 suffix). The xxxxx and xxxxx64() are only necessary in the C functions for backward compatibility purpose
  • for a GDALRasterBlockManager created by a GDALDataset, the hRBMutex should not be taken
  • GDALDestroyRasterBlockManager should be called in GDALDestroyDriverManager
  • GDAL_DATASET_CACHING should be mentionned in the generated documentation. It should be explicited that GDAL_CACHEMAX is then a per-dataset value and not a global value.
  • in GDALRasterBand, you probably need a GetBlockManager() (private/protected) method to avoid the repetition of the logic if ( poDS == NULL ) poRBM = GetGDALRasterBlockManager(); else poRBM = poDS->poRasterBlockManager;
  • In GDALDataset, we probably need a public accessor to the dataset raster block SetCacheMax() method, to be able to set it more easily than with globally altering GDAL_CACHEMAX
  • and I'm also wondering if we wouldn't want a method to turn on dataset caching on a case-by-case purpose rather than with GDAL_DATASET_CACHING. But that might be inconvenient to do it after dataset instanciation.
  • yes we need new tests. Likely a new .py script in autotest/gcore
  • and also probably a RFC to advertize this non-trivial change in the core and receive feedback from other PSC members

flippmoke added 3 commits July 9, 2014 15:20
…heMax in various spots in the code. In many situations this should be pulling the size of the non global cache size in the situation where a dataset specific cache size was specified. Added methods to access the RBM from the Dataset and RasterBand as well.
@flippmoke
Copy link
Author

I am slightly concerned that with out a global cache size, the logic in gcore/rasterio.cpp might need to be revisited? Would two different sized caches be an issue?

@rouault
Copy link
Member

rouault commented Jul 9, 2014

For rasterio.cpp we might perhaps use the min of the source and target dataset cache size for the default value of nTargetSwathSize. And for the if (bDstIsCompressed && bInterleave && nTargetSwathSize > poRBM->GetCacheMax64()) test, we should compare to the target dataset cache size.

@flippmoke
Copy link
Author

Removed the xxxx64 suffixes from the GDALRasterBlockManager class.

flippmoke added 12 commits July 9, 2014 18:16
… a global way, this converts the cache size limit to per dataset
…with RasterBands that do not have a dataset.
…heMax in various spots in the code. In many situations this should be pulling the size of the non global cache size in the situation where a dataset specific cache size was specified. Added methods to access the RBM from the Dataset and RasterBand as well.
…heMax in various spots in the code. In many situations this should be pulling the size of the non global cache size in the situation where a dataset specific cache size was specified. Added methods to access the RBM from the Dataset and RasterBand as well.
@flippmoke
Copy link
Author

Sorry for the confusion, attempted to rebase like I should have the first time and it seems to have caused a lot of extra confusion in the pull request...

…eventing a segfault when destroying a mutex that was already NULL.
@flippmoke
Copy link
Author

Where in the automated documentation should I put the information? It seems like some of this information is currently only available via the wiki.

@flippmoke
Copy link
Author

Doing some testing on multithreading with the cache system, found an issue that is interesting. I keep getting this failure due to multi threading:

ERROR 7: Assertion `nLockCount == 0' failed
in file `gdalrasterblock.cpp', line 314

Basically this is caused when in GDALRasterBand, there is a deletion of a GDALRasterBlock. Like this:

poBlock->DropLock();
delete poBlock;

Should there be a method in the GDALRasterBlockManager that handles the deletion of GDALRasterBlocks such that it will not delete them if the lock count is higher then 0? Additionally, this would mean that a mutex would need to be wrapped around the GDALRasterBlock->AddLock() and GDALRasterBlock->DropLock()

flippmoke added 16 commits July 21, 2014 16:43
…with RasterBands that do not have a dataset.
…heMax in various spots in the code. In many situations this should be pulling the size of the non global cache size in the situation where a dataset specific cache size was specified. Added methods to access the RBM from the Dataset and RasterBand as well.
… a global way, this converts the cache size limit to per dataset
…with RasterBands that do not have a dataset.
…heMax in various spots in the code. In many situations this should be pulling the size of the non global cache size in the situation where a dataset specific cache size was specified. Added methods to access the RBM from the Dataset and RasterBand as well.
…eventing a segfault when destroying a mutex that was already NULL.
Conflicts:
	gdal/gcore/gdalrasterblock.cpp
@flippmoke
Copy link
Author

Started the RFC, I know it probably needs more details.

http://trac.osgeo.org/gdal/wiki/rfc47_dataset_caching

I still haven't been able to have time to learn how the testing suite works in order to add the tests for this code properly as I have been using my own code mostly to test it.

@rouault
Copy link
Member

rouault commented Jul 22, 2014

Regarding "The current scope of this mutex makes it lock for the entire read and writing process of the underlying driver", unless I'm seriously mistaken, this is wrong. At least for the read part that occurs outside of the rasterblock mutex. The writing may occur under the mutex from GDALRasterBlock::Internalize() when the block cache is saturated and they are dirty writable blocks to flush.

I'd like also a clear explanation in the RFC of what is thread-safe and what is not. I'm afraid people will think that using a GDAL dataset handle is now thread-safe, whereas it is generally not, except in the rare cases where the underlying driver is thread-safe, which must be pretty much only the case of the MEM driver. And I think it is not even completely the case. For example GDALRasterBand::GetMaskBand() is not thread-safe. Most Set methods in MEMRasterBand are not thread-safe either.

I'm still worrying that we can have dead-lock issues with the per-dataset rasterblock mutex and the per-band mutex. Imagine a 2 band dataset whose IReadBlock() would call GetLockedBlockRef() on the other band (similar to what is done in the GTiff driver for example). Now imagine that thread T1 calls GetLockedBlockRef on block (1,1) of band 1, at the same time that thread T2 calls GetLockedBlockRef on block(1,1) of band 2. In T1: As block (1,1) of band 1 is not cached yet, it goes to calling IReadBlock() under the per-band1 mutex. Then IReadBlock() calls GetLockedBlockRef on block(1,1) of band 2, but per-band2 mutex is already taken by thread T2. Same with T2 w.r.t T1. Dead-lock.


Regarding the code :

The comment of GDALRasterBlockManager::SafeLockBlock() about "touching" the block no longer apply in the code since the call to Touch() has been suppressed

In GDALRasterBand::Fill(), I think all changes are whitespace changes

GDALRasterBlock::AddLock() and DropLock() could likely be implemented with CPLAtomicInc() and CPLAtomicDec() instead of full mutex locking.

It is inconsistant to take the mutex in GDALRasterBlock::MarkDirty() and not in MarkClean()

GDALRasterBlock::Delete() should perhaps be renamed MarkForDeletion() or something like that

GDALRasterBand::FlushBlock(): why the last 2 operations on the block are not in the mutex ? (not saying this is wrong, but probably deserves some comments)

GDALRasterBlock::Internalize() : the comment about flushing other blocks is no longer up-to-date


Regarding testing, as this heavily implies multi-threading, which Python is of no help to test, some parts of the test will have to be written as C/C++ code in autotest/cpp.

@flippmoke
Copy link
Author

It is wrong, I was mistaken in my thinking of the previous scope of the mutex. I will correct it here soon, it does lock during writing within GDALRasterBlock::Internalize() as you stated. This is where it remained locked the longest typically during my testing.

Working on addressing the other items as quickly as possible

rouault added a commit that referenced this pull request Mar 8, 2018
This should perhaps help fixing, or at least workarounding, the following
issue seen randomly on the gcc52_stdcpp14_sanitize Travis-CI target.

e.g on https://api.travis-ci.org/v3/job/351015753/log.txt

Running gdrivers/gdalhttp.py...
  TEST: http_1 ... success
ERROR 1: Range downloading not supported by this server!
ERROR 1: Request for 8-407 failed
  TEST: http_2 ... success
  TEST: http_3 ... success
ERROR 4: `/vsicurl/ftp://download.osgeo.org/gdal/data/gtiff/utm.tif' not recognized as a supported file format.
  TEST: http_4 ... HTTP service for ftp://download.osgeo.org/gdal/data/gtiff/utm.tif is down (HTTP Error: ftp error: 425 Security: Bad IP connecting.)
cannot open URL
skip
  TEST: http_5 ... success
ERROR 1: JSON parsing error: unexpected character (at offset 0)
  TEST: http_6 ... success
==31274==WARNING: AddressSanitizer failed to allocate 0x62d00019040f bytes
==31274==AddressSanitizer's allocator is terminating the process instead of returning 0
==31274==If you don't like this behavior set allocator_may_return_null=1
==31274==AddressSanitizer CHECK failed: ../../.././libsanitizer/sanitizer_common/sanitizer_allocator.cc:147 "((0)) != (0)" (0x0, 0x0)
    #0 0x7f3f527259f4  (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0x9e9f4)
    #1 0x7f3f5272a453 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0xa3453)
    #2 0x7f3f526a7461  (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0x20461)
    #3 0x7f3f527286d5  (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0xa16d5)
    #4 0x7f3f526acb3d  (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0x25b3d)
    #5 0x7f3f526adbd5  (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0x26bd5)
    #6 0x7f3f5271e32e in realloc (/home/travis/build/OSGeo/gdal/install-gcc-5.2.0/lib64/libasan.so.2.0.0+0x9732e)
    #7 0x7f3f3ea2a940 in VSIRealloc /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsisimple.cpp:814
    #8 0x7f3f3e8e3958 in VSICurlHandleWriteFunc /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsil_curl.cpp:839
    #9 0x7f3f2ff61442 in Curl_client_write (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x16442)
    #10 0x7f3f2ff8bc46 in Curl_pp_readresp (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x40c46)
    #11 0x7f3f2ff621ab  (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x171ab)
    #12 0x7f3f2ff64546  (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x19546)
    #13 0x7f3f2ff61bbf  (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x16bbf)
    #14 0x7f3f2ff61cd1  (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x16cd1)
    #15 0x7f3f2ff6776b in Curl_disconnect (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x1c76b)
    #16 0x7f3f2ff7c170 in curl_multi_cleanup (/usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4+0x31170)
    #17 0x7f3f3e902c6b in ClearCache /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsil_curl.cpp:2913
    #18 0x7f3f3e8fc9f7 in ~VSICurlFilesystemHandler /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsil_curl.cpp:2564
    #19 0x7f3f3e8fcf23 in ~VSICurlFilesystemHandler /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsil_curl.cpp:2569
    #20 0x7f3f3e9e4a02 in VSIFileManager::~VSIFileManager() /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsil.cpp:1805
    #21 0x7f3f3e9e5baa in VSICleanupFileManager /home/travis/build/OSGeo/gdal/gdal/port/cpl_vsil.cpp:1966
    #22 0x7f3f3e533e42 in GDALDriverManager::~GDALDriverManager() /home/travis/build/OSGeo/gdal/gdal/gcore/gdaldrivermanager.cpp:262
    #23 0x7f3f3e53418d in GDALDriverManager::~GDALDriverManager() /home/travis/build/OSGeo/gdal/gdal/gcore/gdaldrivermanager.cpp:329
    #24 0x7f3f3e53a14e in GDALDestroyDriverManager /home/travis/build/OSGeo/gdal/gdal/gcore/gdaldrivermanager.cpp:898
    #25 0x7f3f27794ea3 in ffi_call_unix64 (/usr/lib/python2.7/lib-dynload/_ctypes.so+0x1aea3)
    #26 0x7f3f277948c4 in ffi_call (/usr/lib/python2.7/lib-dynload/_ctypes.so+0x1a8c4)
    #27 0x7f3f277852c1 in _ctypes_callproc (/usr/lib/python2.7/lib-dynload/_ctypes.so+0xb2c1)
    #28 0x7f3f27785aa1  (/usr/lib/python2.7/lib-dynload/_ctypes.so+0xbaa1)
    #29 0x4c2645 in PyObject_Call (/usr/bin/python2.7+0x4c2645)
    #30 0x537589 in PyEval_EvalFrameEx (/usr/bin/python2.7+0x537589)
    #31 0x5376f1 in PyEval_EvalFrameEx (/usr/bin/python2.7+0x5376f1)
    #32 0x5376f1 in PyEval_EvalFrameEx (/usr/bin/python2.7+0x5376f1)
    #33 0x53e2af in PyEval_EvalCodeEx (/usr/bin/python2.7+0x53e2af)
    #34 0x536c45 in PyRun_StringFlags (/usr/bin/python2.7+0x536c45)
    #35 0x53ecb4 in PyRun_SimpleStringFlags (/usr/bin/python2.7+0x53ecb4)
    #36 0x51e62d in Py_Main (/usr/bin/python2.7+0x51e62d)
    #37 0x7f3f504657ec in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x217ec)
    #38 0x41bad0  (/usr/bin/python2.7+0x41bad0)




git-svn-id: https://svn.osgeo.org/gdal/trunk@41669 f0d54148-0727-0410-94bb-9a71ac55c965
@rouault rouault closed this Mar 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants