Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements a JSON.parse that transforms unsafe numbers to bigints #3133

Closed
wants to merge 1 commit into from

Conversation

lorisleiva
Copy link
Contributor

@lorisleiva lorisleiva commented Aug 21, 2024

This PR implements and uses a custom JSON.parse implementation that supports large unsafe JavaScript integers by wrapping them in bigints.

This means we can start accepting RPC values as bigint right now without loss of precision.

It does so by first transforming the JSON string such that any integer outside of quoted strings will be wrapped in a BitIngValueObject if and only if Number.isSafeInteger returns false for that number.

It then uses a JSON.parse reviver to identify that special object wrapper and casts that to a bigint without loss of information.

There are still a lot more tests I'd like to add to this PR but wanted to gather early feedback before continuing my work.

Copy link

changeset-bot bot commented Aug 21, 2024

⚠️ No Changeset found

Latest commit: 845a4a3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @lorisleiva and the rest of your teammates on Graphite Graphite

@lorisleiva lorisleiva force-pushed the loris/json-parse-for-bigints branch from 6102a4d to addd8e9 Compare August 21, 2024 17:41
Copy link
Collaborator

steveluscher commented Aug 21, 2024

Interesting. I would have presumed that serde_json would not produce invalid JSON strings, but it does.

Sandbox link.

use serde_json::json;

fn main() {
    println!("{}", json!({"thing": 9_223_372_036_854_775_807 as i64}));
}
{"thing":9223372036854775807}

I convinced myself that this is the case by looking at the rentEpoch output of the production RPC.

{"rentEpoch":18446744073709551615}

I think I see what your suggestion is here. Being this the case, create this fixer-upper for the current RPC, but then make RPCv2 actually output a value object. web3.js 2.0 will work with either, from the beginning.

Copy link
Collaborator

steveluscher commented Aug 21, 2024

JSON.parse(
    "{\"hi\":9223372036854775807,\"bye\":\"content\"}",
    (_, value, { source }) =>
        source?.match(/^\d+$/)
            ? BigInt(source)
            : value,
);

Sorry. :)

Edit: Ah shit, fuck you, Safari (and Firefox). https://caniuse.com/mdn-javascript_builtins_json_parse_reviver_parameter_context_argument

Comment on lines 29 to 57
function transformUnquotedSegments(json: string, transform: (value: string) => string): string {
/**
* This regex matches any part of a JSON string that isn't wrapped in double quotes.
*
* For instance, in the string `{"age":42,"name":"Alice \"The\" 2nd"}`, it would the
* following parts: `{`, `:42,`, `:`, `}`. Notice the whole "Alice \"The\" 2nd" string
* is not matched as it is wrapped in double quotes and contains escaped double quotes.
*
* The regex is composed of two parts:
*
* 1. The first part `^([^"]+)` matches any character until we reach the first double quote.
* 2. The second part `("(?:\\"|[^"])+")([^"]+)` matches any double quoted string that may
* and any unquoted segment that follows it. To match a double quoted string, we use the
* `(?:\\"|[^"])` regex to match any character that isn't a double quote whilst allowing
* escaped double quotes.
*/
const unquotedSegmentsRegex = /^([^"]+)|("(?:\\"|[^"])+")([^"]+)/g;

return json.replaceAll(unquotedSegmentsRegex, (_, firstGroup, secondGroup, thirdGroup) => {
// If the first group is matched, it means we are at the
// beginning of the JSON string and we have an unquoted segment.
if (firstGroup) return transform(firstGroup);

// Otherwise, we have a double quoted string followed by an unquoted segment.
return `${secondGroup}${transform(thirdGroup)}`;
});
}
Copy link
Collaborator

@steveluscher steveluscher Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be really tempted to implement this as a parser instead of a regex. Walking every character:

  1. Toggle inQuote on and off when you encounter a non-escaped ".
  2. If inQuote is false and you encounter a contiguous run of numeric digits without a . in the middle, then wrap 'em.
const str = '{"hi":9223372036854775807,"there":123.4,"now":2E32,"then":1.2E32,"bye":"content is \\"1337\\""}';
let out = [];
let inQuote = false;
for (let ii = 0; ii < str.length; ii++) {
    let isEscaped = false;
    if (str[ii] === '\\') {
        out.push(str[ii++]);
        isEscaped = !isEscaped;
    }
    if (str[ii] === '"') {
        out.push(str[ii++]);
        if (!isEscaped) {
            inQuote = !inQuote;
        }
    }
    if (!inQuote) {
        let consumedNumber = '';
        while (str[ii]?.match(/[\d\.Ee]/)) {
            consumedNumber += str[ii++];
        }
        if (consumedNumber.length) {
            if (consumedNumber.includes('.')) {
                out.push(consumedNumber);
            } else {
                out.push(`{"$n":"${consumedNumber}"}`)
            }
        }
    }
    out.push(str[ii]);
}
console.log(out.join(''));
// {"hi":{"$n":"9223372036854775807"},"there":123.4,"now":{"$n":"2E32"},"then":1.2E32,"bye":"content is \"1337\""}

Still need to do some work to convert 2E32 to a bigint; it's not as straightforward as BigInt('2E32').

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree. Regexes always feel a bit hacky but at the same time I was struggling to come up with examples that would not work with this implementation so I though it might be the least intrusive way to implement this. That being said, we still need a shit loads more test to make sure we're not messing up with the integrity of the response data.

I'm up for giving the parser a go though. I don't know if you've seen this library but it could be useful to get some inspiration from. Tbh, I started looking at this parsing logic first and though: hmm maybe a parser is more complicated than it looks like and moved to regexes instead haha. But I think this library does a bit too much for our need anyway.

Anyway, to summary my chaotic thoughts, we can either:

  1. Continue to explore regexes.
  2. Move towards a parsing solution that only transforms the string (what you're suggesting).
  3. Move towards a full parsing solution that outputs parsed data directly by forking/editing this library.

In any case, we need lots more tests.

I lean towards solution 2, it seems like a clean and robust approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String-to-string parsing will be way more resilient, I think. It's already resilient to annoying shit like newlines, spaces, and infinite \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ escape sequences, without added complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented solution 2 in two different ways:

  • One that uses a simple regex to consume numbers.
  • One that uses plain loops to consume numbers.

The while (str[ii]?.match(/[\d\.Ee]/)) wasn't enough to correctly match numbers — e.g. it was matching true because of the e and returning {"result":tru{"$n":"e"}}.

I've also left the old regex-only implementation to compare in the benchmarks but I'm happy to remove it because the parsing is much more robust.

@lorisleiva
Copy link
Contributor Author

Damn that's a shame about the source attribute. It would have been perfect.

Copy link
Contributor

@buffalojoec buffalojoec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @lorisleiva - nice work! I'm stoked to get rid of all the UnsafeBeyond2Pow53Minus1 now. 🚀

I think we should strongly consider benchmarking this parser, though, before cutting a release with it in the response transformer. Something like taking some huge JSON payloads that could come back from the RPC (think getProgramAccounts) and comparing native JSON.parse (or .json()) to this modded version with the reviver. Maybe @steveluscher knows some slick ways to do these benches?

packages/rpc-transport-http/src/json-parse-with-bigint.ts Outdated Show resolved Hide resolved
@lorisleiva lorisleiva force-pushed the loris/json-parse-for-bigints branch 15 times, most recently from bd6725e to 86ed7d4 Compare August 22, 2024 14:38
@lorisleiva lorisleiva force-pushed the loris/json-parse-for-bigints branch from 86ed7d4 to 845a4a3 Compare August 22, 2024 20:36
Copy link

@lorisleiva
Copy link
Contributor Author

As discussed offline, let's revisit this after a significant RPC refactoring that returns a Response object instead of the parsed response data.

@lorisleiva lorisleiva closed this Aug 22, 2024
Copy link
Contributor

github-actions bot commented Sep 9, 2024

Because there has been no activity on this PR for 14 days since it was merged, it has been automatically locked. Please open a new issue if it requires a follow up.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants