-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Overhead is Too High (10x or more) #1613
Comments
This is highly dependent upon string length. If all your strings are 1 character, then overhead is typically 20-30x. If your strings are 1000 characters, then overhead is more like 2%. Most std::string implementations have an allocation-avoidance path below a certain length (depending upon the size of the std::string itself, which is usually about 24-32 bytes on 64-bit systems), whereby they store the string in the std::string structure. So the optimal string length is around 20-24 bytes.
Yes, that does seem like too much. Worth investigating. My experience with std::map is that its overhead per entry is roughly 3 pointers worth, plus the memory allocator overhead and padding. On 64-bit systems, that is again 24-32 bytes. That plus the string for the key, means the nodes should be roughly 64 bytes plus the size of the basic_json. Is the basic_json around 70-80 bytes? That seems like more than it should be as, IIRC, it is a union of the possible types -- string, map, vector being the large ones. Plus a few bits for the type information. Best case scenario (on 64 bit systems) is going to be roughly 96 bytes though, so only about 40-45% smaller. Doing better than the STL containers is actually quite a challenge unless you are willing to sacrifice generality and functionality fairly substantially. On 64-bit systems some cunning memory management could let you use smaller offsets in place of pointers, but this is a lot of effort (and bugs) for what is likely going to be no more than a 2-3x improvement. The win on 32-bit systems will be less. That's a pretty steep maintenance and generality cost, for relatively modest gains. Doesn't mean its not worth doing, but not to be undertaken lightly. For large piles of json data you really should consider whether a DOM representation is even the right choice. It may simply be better to parse it SAX style and keep it around in its text form to reparse as-needed. Or keep pointers/offsets into the image and partially re-parse when needed. |
For what it's worth, I think it's a universal truth that keys into the JSON will be smaller than 100 characters. The data could be very long, but optimizing the storage of the keys in the key-value pairs could be a fairly substantial gain across all use-cases. In terms of the size of basic_json, the documentation claims:
I can confirm that it looks like the storage for the single json value is indeed just a pointer, defined here: https://github.com/nlohmann/json/blob/develop/single_include/nlohmann/json.hpp#L13671 That should be only 9 total bytes for basic_json (plus all the memory that it points to, of course). It's possible that Visual Studio is including the de-referenced string in the allocation calculation for the TreeNode, but that wouldn't explain why the value was a constant 140, instead of variable (I have some string data that are 1kb, though most are small). |
BTW, how are you measuring the size of each TreeNode? |
I'm using the Memory Profiler in Visual Studio, and running through the allocation output. That gives me the function call stack, and the size of the allocated object. |
Relevant: #1516 Have you tried building in Release mode? That sometimes can make a huge difference. |
I would expect Release mode to improve speed, but not memory. At least not by 10x memory. That said, I switched over to RapidJson, so I don't have the code to run this again at this moment. The file I was using was this large file, if you'd like to reproduce the issue for analysis. |
Just as a reference, the memory consumption was ~2.79GB for the following code built by GCC 7.4 x86/64 with #include <string>
#include <iostream>
#include <fstream>
#include "json.hpp"
using namespace std;
int main() {
ifstream fin("scryfall-all-cards.json");
if(!fin)
{
cout << "Failed to open file." << endl;
return 1;
}
auto json = nlohmann::json::parse(fin);
cout << "json parsed. Press any key to exit" << endl;
string dummy;
cin >> dummy;
return 0;
} |
Release mode will definitely improve memory on Visual C++, as the debug versions of the containers include extra information to aid with iterator debugging issues. FYI, |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Already posted in other issue, but reposting here... I just got into this same issue, some 1.6GB json wouldn't load (even with 10GB memory). |
Sounds great! Is it something that is compatible to this library? Would be great to have that feature. |
Hi @nlohmann, thanks for the kind feedback. I think it's very possible and fundamental to have such feature in this project, as we live in "Big Data World" now, and as currently One brief example of how much compatible it is now:
What I did was a very fast workaround, that has some weaknesses (but worked for the purposed I needed). I needed to work with a 1.6GB JSON file, that have around 6000 top level entries (around 200KB each, all entries in the format: Every time user asks for some specific key in One nice feature there is that it allows creating json from I added some auxiliar methods, like Now the drawbacks: Regarding nlohmann::json, as I understood from some comments in this thread, I think that there could be some alternative ready-to-use parsers, that user just pass as some option (maybe some template option on nlohmann::json? I haven't looked at any of it now...), and I would suggest at least two strategies:
There is still another problem, as users keep visiting "nodes", these nodes expand, so it would be nice to have some strategy to move them back to string cache strategy (as my Some "temptative" implementation on
Sorry for the long post, if you think some of these ideas are reasonable within the scope of nlohmann::json, I could take some more time to understand how it works and maybe help "solving this issue". Best regards! |
Some updates on this @nlohmann.. afterall, I had to study a little bit more on nlohamann::json library, and learned a few interesting things.
The simplest parenthesis counting that I did, could not handle any sort of arrays (ok, my file did not have any lists...). So, in order to provide greater generality, I used nlohmann::json to parse fields, and I had to do it twice, as I couldn't find a way to escape exception handling over partial parsings (maybe it's something that exists already, but I just did it twice and it worked).
But for even greater generality, maybe what users want, is somehow indicating which fields should be explored lazily, as they are supposed to be huge (huge objects or huge lists). So, maybe a more informed outline of an approach could be:
Just to conclude this, I'm quite bothered that a huge JSON that only contain dictionaries (no lists) can allow such simple strategy that parses it in 20 seconds, but as soon as I try to achieve greater generality, now it costs 244 seconds (10x times more), but with memory controlled, at least... so, it really gets me to think that perhaps some |
Shouldn't you just use the nlohmann library's SAX parser for this? Instead of building the object-representation in memory like the normal parsing does, it could instead build a map or hashtable that contains keys -> offsets into the original data. This could do things like record only the base offset of arrays, use a hashing strategy for the keys at each level (rather than a hashtable with string keys), limit the indexing to n levels, etc. This "index" data structure could then be used to extract json objects on demand by looking up their location rapidly. Properly abstracted it could also index the original file on disk so that none of the data is in memory. I work on some really huge json files where even the binary image of the file cannot fit in memory, so this sort of an approach would be quite useful. |
@abrownsword I'm speaking as a regular user that just tried to open a 1.6GB json file and realized that it wouldnt work on a 16GB RAM computer... and maybe thats acceptable somewhere worldwide, but here in Brazil we usually dont have access to such machinery. On the other hand, I have quickly coded a solution that worked for me, by counting brackets,but I feel that more and more people will arrive here snd want some straightforward solution. |
There is an abstraction to respond to parsing events in a JSON SAX stream. This is fairly easy to use, and it wouldn't be to hard to use it to build an index... except for one problem: I don't believe it provides a current index. Perhaps the API could be enhanced to do so? I could certainly use that in my use of the SAX parser for error reporting. |
I don't know about the JSON SAX API, but the few lines I've read of the code, I get the feeling that many things can be done, flexibility seems to be quite high. Maybe some internal things needs some adjustments, I'm not capable of giving any suggestions in that direction, at the moment. |
I think that's a good suggestion @igormcoelho. My (limited) experience with the existing SAX interface is that it doesn't provide the byte offset at which each parsing event happens. If it did provide that, it would be a lot more powerful. @nlohmann -- perhaps there is a non-breaking way to make the current byte offset available? |
This is interesting to know @Meerkov, because as soon as I got this issue here, I tried with RapidJson and had exactly the same issue. Document loading consumed all my memory. And it was because of your message up there that I decided to see their project hahaha 😂 but as soon as I had issues again, I decided to code my own solution. I'll take another look at their project. |
@Meerkov just re-did my test again, I realized I was using rapidjson 1.1.0, quite old (2016), so now I tried
It consumed 4.37GB for my 1.6GB file. It's better than 10GB, but still unusable to me, as our machines are mostly limited by 8GB (some have even less memory...). But anyway, thanks for the advice, it may indeed work for many people and it's good to know that. |
@igormcoelho Can you share the file? |
Sorry for the delay in the response @nlohmann... I think that, for the moment, I cannot share the file due some privacy issues. But I can later mimic the file structure and then come back here.. maybe with a random generator of few lines (it's much better approch for testing I guess). In any case, as a followup, I continue to work with the workarounds at VastJSON (which are now multiple, due to many distinct big situations), but I've already noticed that some of the workarounds could be ported here. My ugly solution via exception catching and injecting a manual stream over nlohmann::json could perhaps provide the offset indexer we need @abrownsword, without needing to change the library. However, it's ugly. It's still on stack to study how to do that better, as for the moment I'm just successfully surviving those big jsons. I'll come back here. |
In fact @nlohmann , I notice that the worst behavior seems to happen in a file with quite simple structure.
If you repeat this set into around 5000 top-level Entries (Entry1, Entry2, ..., Entry5000), where each "matrix" has around 100 to 200 elements ("A", "B", "C", ... "Z", "AA", "AB", ...), you will likely have a file with ~1.5G (I didn't compute things precisely here). The good side of this is that this is quite tractable, in this case, by providing a top-level parser that takes advantage of no list existing inside the json. Like I said before, maybe the best solution here into nlohmann::json project is to provide some "big parser examples", where this is one of the possible scenarios... on general, if file is big, I think that it will be very likely that one has too many top-level entries, being those dictionaries or some sort of huge list (I just took care of first case at the moment, as it's the one that mattered to me until now, but other cases are also tractable). |
Context:
I was parsing a 700MB json file, in Visual Studio 2019. The process took almost 9GB of memory! This at first looks like a 10x overhead.
We all know that std::strings have on the order of 2x overhead, so this doesn't seem to explain it.
All std::TreeNode<basic_string, basic_json> are showing up at taking exactly 140 bytes, which seems excessive. Since they are all the same size, it seems likely the keys are all being allocated much larger than necessary, perhaps. Keys in maps tend to be much shorter than the data that is being mapped to.
Possible solution is to move away from std::map ?
For reference, allocating strings looks like it's about 50% of memory, and treenodes are about 25% of memory in my instance.
The text was updated successfully, but these errors were encountered: