-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug performance differences #5
Comments
I'm running the viztracer analysis steps documented here: https://sdk.meltano.com/en/latest/dev_guide.html#testing-performance
|
If the record volume is large, my guess is that non-batch is spending a lot of time serializing data. We did ship meltano/sdk#1196 a while ago, so if serialization is the reason, we may need to start looking into faster methods. |
What is so unexpected to me is why RecordMessage dicts would add anything significant above-and-beyond what the jsonl batch files need... I'd expect most of the dict transformation cost to also hit the jsonl implementation. This implementation uses pandas to create the initial dict, which might be faster than other methods of creating/modifying dicts... but even then, the dict still needs to be converted to a json string in both cases. The batch implementation has almost zero cost, which is just really surprising to me in contrast. |
Update: On looking more into the code, I can see there are several methods that get called from within I'll enumerate and analyze these as a next step. Update (2): It appears that the slowdown is in Specifically: |
Okay, I think I have a plan.
|
@aaronsteers that exists in the form of |
Thanks! Yeah, I was looking at that. Much appreciated. |
Actually - Looks like the above two methods were a red herring. After making Then I compared the performance of this:
With this:
And the performance hit was identical. Turns out, my PandasStream class was calculating schema dynamically each time. I may have thought it was called only during discovery or at the beginning of the stream, which is why I didn't optimize it or cache internally. Very sorry for the false alarm. I'll post back shortly the performance with schema internally cached. I expect I'll see a huge speedup after that. |
ceacde8#diff-361f0bf4b027dd5af43140c9c304f69595460090c289b5b610b9ed00e199b81b This seems to resolve the performance issues. Runtimes are now consistently <20 seconds. Resolving. |
As noted in #4, there's a large difference in speed of output to
.jsonl.gz
batch file, versus speed of output to a text file via STDOUT piping. This may be unavoidable, but it it a large enough difference that it would be worth analysis with performance monitoring to see if anything seems obvious.time pipx run jafgen
meltano.yml
Batch-based execution log (15.5 seconds)
Non-batch execution log (2min and 12sec)
cc @edgarrmondragon, @kgpayne
The text was updated successfully, but these errors were encountered: