Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage on large inputs #56

Open
melisgl opened this issue Jun 9, 2023 · 2 comments
Open

Memory usage on large inputs #56

melisgl opened this issue Jun 9, 2023 · 2 comments

Comments

@melisgl
Copy link
Contributor

melisgl commented Jun 9, 2023

I'm using the per-block implementation in parse-doc, but it's still fairly easy to run out of memory with large %blocks with something like this:

CL-USER> (time
          (let ((input (with-output-to-string (out)
                         (loop repeat 100000
                               do (format out "- ~A ~A ~A ~A~%"
                                          (random 1000000) (random 1000000)
                                          (random 1000000) (random 1000000))))))
            (3bmd-grammar::parse-doc input)
            (length input)))
Evaluation took:
  12.364 seconds of real time
  12.371129 seconds of total run time (11.750773 user, 0.620356 system)
  [ Run times consist of 5.771 seconds GC time, and 6.601 seconds non-GC time. ]
  100.06% CPU
  37,030,481,562 processor cycles
  15,570,202,368 bytes consed
  
2955662
CL-USER> (/ 15570202368 2955662.0)
5267.924

This example uses a bulleted list because it is probably the worst offender, but a large paragraph behaves similarly.

According to time, consing scales linearly with the number of repeats, which is good. Perhaps 5267 bytes per character is too high, but I suspect that the main problem is that maximum size of the working set also scales linearly.

@3b
Copy link
Owner

3b commented Jun 11, 2023

Yeah, memory use can be a problem for packrat parsers, since they memoize a lot in return for linear runtime.
From the packrat thesis:

The primary disadvantage of packrat parsing is its storage cost, which is a constant multiple of the total input size rather than being proportional to the nesting depth of the syntactic constructs appearing in the input.

Best fix would probably be to replace it with a CommonMark parser, since that is specified so it can be parsed by lines. As mentioned in #53 though, that would require rewriting things and probably would only be partially compatible with existing 3bmd code at best.

You can see what it is doing with

(esrap:trace-rule '3bmd-grammar::doc :recursive t)

before running a parse (try with small inputs, one line of the list from your test is about 1500 lines of output). (esrap:untrace-all-rules) to turn it off again.

If I remember how packrat works, each of the | or -> in the trace is memoized, which is why there is so much overhead.

And now that I look at the trace, I remember that list items (and probably a few other things) do recursive parses, so it is repeating some work but a lot of that consing is actually getting thrown away at the end of the recursive parse.

Possibly adding (:text t) to the raw-line rule would help with working set if you have some way to test that? (Doesn't seem to affect consing, but I think it would hold on to much less data that way without adding much processing overhead.)

There might be some other similar places where we can reduce the size of the cache by throwing away things that are only there for the grammar and not needed in the output, will have to look into that at some point. Might also be some places where we can reorder or add rules to avoid testing things that could be proven impossible. For example it seems to be doing extra work to determine there isn't a block at eof, which affects consing buts gets thrown away at the end of the recursive parse so doesn't affect working set much.

(defrule %block (and (! eof) (* blank-line)
                       #.(cons 'or %block-rules%))
  (:destructure (eof blank block)
                (declare (ignore eof blank))
                block))
;; define-extension-block needs to be updated to match as well, I think?

would avoid that if you want to try it, but need to think/test a bit more to be sure it doesn't have any side effects. Maybe similar for inlines?

@melisgl
Copy link
Contributor Author

melisgl commented Feb 18, 2025

That %BLOCK definition doesn't seem to speed things up for me.
I ended up doing what PARSE-DOC does but conditionally on the input size:
https://github.com/melisgl/mgl-pax/blob/4847e82/src/document/markdown.lisp#L20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants