Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Overhead is Too High (10x or more) #1613

Closed
Meerkov opened this issue May 29, 2019 · 23 comments
Closed

Memory Overhead is Too High (10x or more) #1613

Meerkov opened this issue May 29, 2019 · 23 comments
Labels
kind: enhancement/improvement state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@Meerkov
Copy link

Meerkov commented May 29, 2019

Context:
I was parsing a 700MB json file, in Visual Studio 2019. The process took almost 9GB of memory! This at first looks like a 10x overhead.

We all know that std::strings have on the order of 2x overhead, so this doesn't seem to explain it.

All std::TreeNode<basic_string, basic_json> are showing up at taking exactly 140 bytes, which seems excessive. Since they are all the same size, it seems likely the keys are all being allocated much larger than necessary, perhaps. Keys in maps tend to be much shorter than the data that is being mapped to.

Possible solution is to move away from std::map ?

For reference, allocating strings looks like it's about 50% of memory, and treenodes are about 25% of memory in my instance.

@abrownsword
Copy link

We all know that std::strings have on the order of 2x overhead, so this doesn't seem to explain it.

This is highly dependent upon string length. If all your strings are 1 character, then overhead is typically 20-30x. If your strings are 1000 characters, then overhead is more like 2%. Most std::string implementations have an allocation-avoidance path below a certain length (depending upon the size of the std::string itself, which is usually about 24-32 bytes on 64-bit systems), whereby they store the string in the std::string structure. So the optimal string length is around 20-24 bytes.

All std::TreeNode<basic_string, basic_json> are showing up at taking exactly 140 bytes, which seems excessive

Yes, that does seem like too much. Worth investigating. My experience with std::map is that its overhead per entry is roughly 3 pointers worth, plus the memory allocator overhead and padding. On 64-bit systems, that is again 24-32 bytes. That plus the string for the key, means the nodes should be roughly 64 bytes plus the size of the basic_json. Is the basic_json around 70-80 bytes? That seems like more than it should be as, IIRC, it is a union of the possible types -- string, map, vector being the large ones. Plus a few bits for the type information. Best case scenario (on 64 bit systems) is going to be roughly 96 bytes though, so only about 40-45% smaller.

Doing better than the STL containers is actually quite a challenge unless you are willing to sacrifice generality and functionality fairly substantially. On 64-bit systems some cunning memory management could let you use smaller offsets in place of pointers, but this is a lot of effort (and bugs) for what is likely going to be no more than a 2-3x improvement. The win on 32-bit systems will be less. That's a pretty steep maintenance and generality cost, for relatively modest gains. Doesn't mean its not worth doing, but not to be undertaken lightly.

For large piles of json data you really should consider whether a DOM representation is even the right choice. It may simply be better to parse it SAX style and keep it around in its text form to reparse as-needed. Or keep pointers/offsets into the image and partially re-parse when needed.

@Meerkov
Copy link
Author

Meerkov commented May 30, 2019

For what it's worth, I think it's a universal truth that keys into the JSON will be smaller than 100 characters. The data could be very long, but optimizing the storage of the keys in the key-value pairs could be a fairly substantial gain across all use-cases.

In terms of the size of basic_json, the documentation claims:

Each JSON object has an overhead of one pointer (the maximal size of a union) and one enumeration element (1 byte).

I can confirm that it looks like the storage for the single json value is indeed just a pointer, defined here: https://github.com/nlohmann/json/blob/develop/single_include/nlohmann/json.hpp#L13671

That should be only 9 total bytes for basic_json (plus all the memory that it points to, of course). It's possible that Visual Studio is including the de-referenced string in the allocation calculation for the TreeNode, but that wouldn't explain why the value was a constant 140, instead of variable (I have some string data that are 1kb, though most are small).

@abrownsword
Copy link

BTW, how are you measuring the size of each TreeNode?

@Meerkov
Copy link
Author

Meerkov commented May 30, 2019

I'm using the Memory Profiler in Visual Studio, and running through the allocation output. That gives me the function call stack, and the size of the allocated object.

@nickaein
Copy link
Contributor

Relevant: #1516

Have you tried building in Release mode? That sometimes can make a huge difference.

@Meerkov
Copy link
Author

Meerkov commented May 30, 2019

I would expect Release mode to improve speed, but not memory. At least not by 10x memory. That said, I switched over to RapidJson, so I don't have the code to run this again at this moment. The file I was using was this large file, if you'd like to reproduce the issue for analysis.

https://archive.scryfall.com/json/scryfall-all-cards.json

@Meerkov Meerkov changed the title Memory Overhead of is Too High (10x or more) Memory Overhead is Too High (10x or more) May 30, 2019
@nickaein
Copy link
Contributor

nickaein commented Jun 2, 2019

Just as a reference, the memory consumption was ~2.79GB for the following code built by GCC 7.4 x86/64 with -O3 flag. I currently don't have access to msvc compiler to test it but I imagine it can be around the same number with a suitable configuration and compiler/build flags.

#include <string>
#include <iostream>
#include <fstream>
#include "json.hpp"

using namespace std;

int main() {

    ifstream fin("scryfall-all-cards.json");
    if(!fin)
    {
        cout << "Failed to open file." << endl;
        return 1;
    }
    
    auto json = nlohmann::json::parse(fin);

    cout << "json parsed. Press any key to exit" << endl;
    string dummy;
    cin >> dummy;
    
    return 0;
}

@gregmarr
Copy link
Contributor

gregmarr commented Jun 3, 2019

Release mode will definitely improve memory on Visual C++, as the debug versions of the containers include extra information to aid with iterator debugging issues. FYI, TreeNode is an internal implementation detail of std::map on VC++. You'd want to look at the size of basic_json, not the size of TreeNode<basic_string, basic_json> to see what memory size comes from the library itself. As a benchmark, you could compare it to the size of TreeNode<char, char> which should be the smallest possible TreeNode.

@stale
Copy link

stale bot commented Jul 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jul 3, 2019
@stale stale bot closed this as completed Jul 10, 2019
@igormcoelho
Copy link

Already posted in other issue, but reposting here...

I just got into this same issue, some 1.6GB json wouldn't load (even with 10GB memory).
I tried other libraries, but couldn't find any solution, so I coded my own solution here: https://github.com/igormcoelho/vastjson/
If this is useful to someone in the future, it just indexes top-level entries as strings (that consumes less space), and put into nlohmann::json when necessary. It worked and I could finally process my big json (with around 2.3GB RAM)!

@nlohmann
Copy link
Owner

nlohmann commented Apr 9, 2021

Sounds great! Is it something that is compatible to this library? Would be great to have that feature.

@igormcoelho
Copy link

igormcoelho commented Apr 10, 2021

Hi @nlohmann, thanks for the kind feedback. I think it's very possible and fundamental to have such feature in this project, as we live in "Big Data World" now, and as currently nlohmann::json is de facto standard C++ json library, the way I know it. I've never looked inside the code, so I'll just put some thoughts here, that may fit or not.

One brief example of how much compatible it is now:

vastjson::VastJSON bigj( new std::ifstream("my_big_file.json")  );
nlohmann::json& entryA = bigj["A"]; // file is parsed as string until entry A is found
nlohmann::json& entryB1 = bigj["B"]["B1"];  // operators can work normally as compound

What I did was a very fast workaround, that has some weaknesses (but worked for the purposed I needed). I needed to work with a 1.6GB JSON file, that have around 6000 top level entries (around 200KB each, all entries in the format: "key" : { /* other substructures here */} ). To solve this quickly, I counted nesting levels '{' and '}' and stored all top-level structures as pure std::string, in a std::map<string, string>, that works side-by-side with another std::map<string, nlohmann::json>.

Every time user asks for some specific key in operator[], I check if json structure already exists, then I return a reference to it. If it doesn't exist yet, I parse the corresponding string with nlohmann::json::parse( str ) before returning, and clear the corresponding string in cache, to avoid wasting precious memory. This way, the global storage is roughly the same size as in disk, besides the previously loaded json structures.

One nice feature there is that it allows creating json from std::ifstream* (or std::unique_ptr<std::ifstream>), meaning that the JSON is consumed on disk only when user requires some key not yet known in cache (and this makes the object creation instant, regardless of size in disk, which is good when experimenting several times with large files).

I added some auxiliar methods, like cacheSize(), unload(key), etc, that are large-file-application specific (for example, when some user wants to return some json back to string cache, for future usage, and less memory consumption).

Now the drawbacks: currently no symbols ", { and } are allowed inside key strings; all top-level entries must be objects (not direct numbers, strings, etc). This means that, as it is now, it is still quite limited (regarding all possible json formats), but I think it's possible to fix. (this fix was done, allowing general JSON to be loaded, but costed A LOT of processing...). Another drawback: only top-level entries are cached, so if JSON has a single top-level entry, the library is useless.

Regarding nlohmann::json, as I understood from some comments in this thread, I think that there could be some alternative ready-to-use parsers, that user just pass as some option (maybe some template option on nlohmann::json? I haven't looked at any of it now...), and I would suggest at least two strategies:

  • top-level caching: means that all entries in top level are stored as strings, and parsed only when necessary (similar to what I did there, but fixing the character parsing drawbacks)
  • global caching: means that all json structures are kept as strings, unless some user "visits" them (may be computationally expensive on practice)

There is still another problem, as users keep visiting "nodes", these nodes expand, so it would be nice to have some strategy to move them back to string cache strategy (as my unload(key) strategy), to allow "unvisiting" nodes (something like "I'm done working with you, for now...").

Some "temptative" implementation on nlohmann::json (using constructor to transparently store "strategy" on json):

nlohmann::json bigj( new std::ifstream("my_big_file.json"), nlohmann::json_big_top_level_strategy);
nlohmann::json& entryA = bigj["A"]; // file is parsed as string until entry A is found
nlohmann::json& entryB1 = bigj["B"]["B1"];  // operators can work normally as compound
entryA.unvisit(); // becomes string cache of "A" -> structure_as_string

Sorry for the long post, if you think some of these ideas are reasonable within the scope of nlohmann::json, I could take some more time to understand how it works and maybe help "solving this issue". Best regards!

@igormcoelho
Copy link

Some updates on this @nlohmann.. afterall, I had to study a little bit more on nlohamann::json library, and learned a few interesting things.
One thing I learned is that my first approach was easily broken, as long as someone had a list that included objects inside... something like:

{ "A" : [ { "B" : 1 } ]  } 

The simplest parenthesis counting that I did, could not handle any sort of arrays (ok, my file did not have any lists...). So, in order to provide greater generality, I used nlohmann::json to parse fields, and I had to do it twice, as I couldn't find a way to escape exception handling over partial parsings (maybe it's something that exists already, but I just did it twice and it worked).
Another thing is that I realized two "greater" strategies may exist:

  • huge objects (and someone wants to lazily explore fields) - this was my original inspiration
  • huge lists (and someone wants lazily explore list items) - I saw an interesting question regarding this

But for even greater generality, maybe what users want, is somehow indicating which fields should be explored lazily, as they are supposed to be huge (huge objects or huge lists).

So, maybe a more informed outline of an approach could be:

  • create a basic_json type, named basic_json_lazy, that holds a string internally, but can actually represent a json string, json object, json list, anything... it will be a string, to keep memory low, and as soon as user tries to "explore" its content (by conversion operators), it gets "parsed" into some real object/list/etc
  • user can manually perform the reverse operation, for example, transforming some json object field into json_lazy field (with string serialized content of json object) - this prevents the need of the "unvisit" operation mentioned earlier
  • during nlohmann::json::parse operation, user can add some sort of lazy_map, that indicates which fields are supposed to be parsed "lazily". Example: "B" -> "B1" is lazy, but rest is parsed instantly, meaning that object "B1" will be "basic_json_lazy" type, only containing string, not the real constructed type.

Just to conclude this, I'm quite bothered that a huge JSON that only contain dictionaries (no lists) can allow such simple strategy that parses it in 20 seconds, but as soon as I try to achieve greater generality, now it costs 244 seconds (10x times more), but with memory controlled, at least... so, it really gets me to think that perhaps some constexpr/template flags that informed no list allowed or no dictionary/object allowed can actually help a lot when files get big. Or maybe it's just that my implementation is so bad. 😂 anyway, it's working, thanks for all the efforts on this amazing library.

@abrownsword
Copy link

abrownsword commented Apr 12, 2021

Shouldn't you just use the nlohmann library's SAX parser for this? Instead of building the object-representation in memory like the normal parsing does, it could instead build a map or hashtable that contains keys -> offsets into the original data. This could do things like record only the base offset of arrays, use a hashing strategy for the keys at each level (rather than a hashtable with string keys), limit the indexing to n levels, etc. This "index" data structure could then be used to extract json objects on demand by looking up their location rapidly. Properly abstracted it could also index the original file on disk so that none of the data is in memory. I work on some really huge json files where even the binary image of the file cannot fit in memory, so this sort of an approach would be quite useful.

@igormcoelho
Copy link

@abrownsword I'm speaking as a regular user that just tried to open a 1.6GB json file and realized that it wouldnt work on a 16GB RAM computer... and maybe thats acceptable somewhere worldwide, but here in Brazil we usually dont have access to such machinery. On the other hand, I have quickly coded a solution that worked for me, by counting brackets,but I feel that more and more people will arrive here snd want some straightforward solution.
That said, saybe the solution is to write some sax_parser, but personally I dont have a clue on how to do that (ok, if I read documentation, etc, I'll be able to), but maybe there is some fundamental abstraction missing on the library, something that represents this caching behavior. Or maybe not.

@abrownsword
Copy link

There is an abstraction to respond to parsing events in a JSON SAX stream. This is fairly easy to use, and it wouldn't be to hard to use it to build an index... except for one problem: I don't believe it provides a current index. Perhaps the API could be enhanced to do so? I could certainly use that in my use of the SAX parser for error reporting.

@igormcoelho
Copy link

I don't know about the JSON SAX API, but the few lines I've read of the code, I get the feeling that many things can be done, flexibility seems to be quite high. Maybe some internal things needs some adjustments, I'm not capable of giving any suggestions in that direction, at the moment.
But one thing I could give advices is: if it's indeed possible to implement such strategies as external/auxiliary SAX parsers, then I think some of these should be offered to users, as "standard" implementations for "standard big-data problems". That's the spirit of the contribution I'm trying to pass on to this amazing library, is that maybe there should be some responses to users that come here and couldn't parse large files. Maybe an answer is: "do you have so many fields in top-level object/dictionary? try this sax_parser_big_objects. Or maybe, do you have long lists? try this sax_parse_big_lists".
It's a complicated compromise for a general library like this, being able to be efficient for tiny json, but I feel that the needs for large-file parsing will be highly necessary, every day more and more, and simply there are no ready-to-use alternatives out there at the moment.
Thanks for the directions @abrownsword , from the little I understood, maybe I could at least avoid the double-parsing I'm making, that saves me 50% of computational time, which is good. But as long as memory is contained, I'm currently ok with 10x more computational costs, even I can decrease it in next weeks/months.

@abrownsword
Copy link

I think that's a good suggestion @igormcoelho. My (limited) experience with the existing SAX interface is that it doesn't provide the byte offset at which each parsing event happens. If it did provide that, it would be a lot more powerful. @nlohmann -- perhaps there is a non-breaking way to make the current byte offset available?

@igormcoelho
Copy link

For what it's worth, RapidJson by Tencent resolved my issues. You can examine what they did to keep memory low.

This is interesting to know @Meerkov, because as soon as I got this issue here, I tried with RapidJson and had exactly the same issue. Document loading consumed all my memory. And it was because of your message up there that I decided to see their project hahaha 😂 but as soon as I had issues again, I decided to code my own solution. I'll take another look at their project.

@igormcoelho
Copy link

igormcoelho commented Apr 15, 2021

@Meerkov just re-did my test again, I realized I was using rapidjson 1.1.0, quite old (2016), so now I tried master branch version.

   FILE* fp = fopen("giant.json", "rb");
   char readBuffer[65536];
   FileReadStream is(fp, readBuffer, sizeof(readBuffer));
   Document d;
   d.ParseStream(is);
   fclose(fp);

It consumed 4.37GB for my 1.6GB file. It's better than 10GB, but still unusable to me, as our machines are mostly limited by 8GB (some have even less memory...). But anyway, thanks for the advice, it may indeed work for many people and it's good to know that.

@nlohmann
Copy link
Owner

nlohmann commented May 1, 2021

@igormcoelho Can you share the file?

@igormcoelho
Copy link

@igormcoelho Can you share the file?

Sorry for the delay in the response @nlohmann... I think that, for the moment, I cannot share the file due some privacy issues. But I can later mimic the file structure and then come back here.. maybe with a random generator of few lines (it's much better approch for testing I guess).

In any case, as a followup, I continue to work with the workarounds at VastJSON (which are now multiple, due to many distinct big situations), but I've already noticed that some of the workarounds could be ported here. My ugly solution via exception catching and injecting a manual stream over nlohmann::json could perhaps provide the offset indexer we need @abrownsword, without needing to change the library. However, it's ugly. It's still on stack to study how to do that better, as for the moment I'm just successfully surviving those big jsons. I'll come back here.

@igormcoelho
Copy link

igormcoelho commented May 19, 2021

In fact @nlohmann , I notice that the worst behavior seems to happen in a file with quite simple structure.
It's a set of square matrices, but organized as dictionaries.
Example:

{
   "Entry1": {
          "A": {"A": 0.0, "B": 0.0, "C": 0.0},
          "B": {"A": 0.0, "B": 0.0, "C": 0.0},
          "C": {"A": 0.0, "B": 0.0, "C": 0.0}
   },
   "Entry2": {
          "A": {"A": 0.0, "B": 0.0, "C": 0.0},
          "B": {"A": 0.0, "B": 0.0, "C": 0.0},
          "C": {"A": 0.0, "B": 0.0, "C": 0.0}
   },
   ...
}

If you repeat this set into around 5000 top-level Entries (Entry1, Entry2, ..., Entry5000), where each "matrix" has around 100 to 200 elements ("A", "B", "C", ... "Z", "AA", "AB", ...), you will likely have a file with ~1.5G (I didn't compute things precisely here).
And this file is the one which seems to expand faster in memory, in a prohibitive manner.

The good side of this is that this is quite tractable, in this case, by providing a top-level parser that takes advantage of no list existing inside the json. Like I said before, maybe the best solution here into nlohmann::json project is to provide some "big parser examples", where this is one of the possible scenarios... on general, if file is big, I think that it will be very likely that one has too many top-level entries, being those dictionaries or some sort of huge list (I just took care of first case at the moment, as it's the one that mattered to me until now, but other cases are also tractable).
Another interesting point that when you have some of those numbers as NaN, as it happened to me, things get even worse, as I needed to find other good solution for that, in a context where the file itself wasn't tractable even with numbers. So, it's maybe another example to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: enhancement/improvement state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

6 participants