Debug performance differences #5

aaronsteers · 2023-03-31T19:22:59Z

As noted in #4, there's a large difference in speed of output to .jsonl.gz batch file, versus speed of output to a text file via STDOUT piping. This may be unavoidable, but it it a large enough difference that it would be worth analysis with performance monitoring to see if anything seems obvious.

Invoking the base library directly (about 20 seconds):
- Replicate with time pipx run jafgen
Invoking with batch (about 20 seconds)
- Log below, produced by invoking via Meltano with batch config provided
Invoking without batch (about 2 minutes)
- Log below, produced by commenting out the batch config in meltano.yml

Batch-based execution log (15.5 seconds)

meltanolabs-tap-jaffle-shop-py3.10aj@ajs-macbook-pro tap-jaffle-shop % time poetry run meltano invoke tap-jaffle-shop > outfile.singer.jsonl
2023-03-31T18:22:32.793376Z [info     ] Environment 'test' is active
[SIM]: 100%|██████████████████████████████████████████████████████████████████████████████| 365/365 [00:03<00:00, 94.93it/s]
2023-03-31 11:22:44,611 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_customers'...
2023-03-31 11:22:44,611 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 11:22:44,612 | INFO     | tap-jaffle-shop      | Starting conversion.
2023-03-31 11:22:44,614 | INFO     | tap-jaffle-shop      | Completed conversion.
2023-03-31 11:22:44,619 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.006695985794067383, "tags": {"stream": "tap_jaffle_shop-raw_customers", "context": {}, "status": "succeeded"}}
2023-03-31 11:22:44,619 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 931, "tags": {"stream": "tap_jaffle_shop-raw_customers", "context": {}}}
2023-03-31 11:22:44,641 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "batch_count", "value": 1, "tags": {"context": null, "stream": "tap_jaffle_shop-raw_customers"}}
2023-03-31 11:22:44,641 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_items'...
2023-03-31 11:22:44,641 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 11:22:44,642 | INFO     | tap-jaffle-shop      | Starting conversion.
2023-03-31 11:22:44,900 | INFO     | tap-jaffle-shop      | Completed conversion.
2023-03-31 11:22:45,246 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.6041810512542725, "tags": {"stream": "tap_jaffle_shop-raw_items", "context": {}, "status": "succeeded"}}
2023-03-31 11:22:45,246 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 96829, "tags": {"stream": "tap_jaffle_shop-raw_items", "context": {}}}
2023-03-31 11:22:46,637 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "batch_count", "value": 1, "tags": {"context": null, "stream": "tap_jaffle_shop-raw_items"}}
2023-03-31 11:22:46,637 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_orders'...
2023-03-31 11:22:46,637 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 11:22:46,637 | INFO     | tap-jaffle-shop      | Starting conversion.
2023-03-31 11:22:46,908 | INFO     | tap-jaffle-shop      | Completed conversion.
2023-03-31 11:22:47,122 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.4840810298919678, "tags": {"stream": "tap_jaffle_shop-raw_orders", "context": {}, "status": "succeeded"}}
2023-03-31 11:22:47,122 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 59097, "tags": {"stream": "tap_jaffle_shop-raw_orders", "context": {}}}
2023-03-31 11:22:48,284 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "batch_count", "value": 1, "tags": {"context": null, "stream": "tap_jaffle_shop-raw_orders"}}
2023-03-31 11:22:48,284 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_products'...
2023-03-31 11:22:48,284 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 11:22:48,285 | INFO     | tap-jaffle-shop      | Starting conversion.
2023-03-31 11:22:48,285 | INFO     | tap-jaffle-shop      | Completed conversion.
2023-03-31 11:22:48,285 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.0007927417755126953, "tags": {"stream": "tap_jaffle_shop-raw_products", "context": {}, "status": "succeeded"}}
2023-03-31 11:22:48,285 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 10, "tags": {"stream": "tap_jaffle_shop-raw_products", "context": {}}}
2023-03-31 11:22:48,287 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "batch_count", "value": 1, "tags": {"context": null, "stream": "tap_jaffle_shop-raw_products"}}
2023-03-31 11:22:48,287 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_stores'...
2023-03-31 11:22:48,287 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 11:22:48,288 | INFO     | tap-jaffle-shop      | Starting conversion.
2023-03-31 11:22:48,288 | INFO     | tap-jaffle-shop      | Completed conversion.
2023-03-31 11:22:48,288 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.0006358623504638672, "tags": {"stream": "tap_jaffle_shop-raw_stores", "context": {}, "status": "succeeded"}}
2023-03-31 11:22:48,289 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 5, "tags": {"stream": "tap_jaffle_shop-raw_stores", "context": {}}}
2023-03-31 11:22:48,290 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "batch_count", "value": 1, "tags": {"context": null, "stream": "tap_jaffle_shop-raw_stores"}}
2023-03-31 11:22:48,290 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_supplies'...
2023-03-31 11:22:48,290 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 11:22:48,290 | INFO     | tap-jaffle-shop      | Starting conversion.
2023-03-31 11:22:48,291 | INFO     | tap-jaffle-shop      | Completed conversion.
2023-03-31 11:22:48,291 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.0009648799896240234, "tags": {"stream": "tap_jaffle_shop-raw_supplies", "context": {}, "status": "succeeded"}}
2023-03-31 11:22:48,292 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 65, "tags": {"stream": "tap_jaffle_shop-raw_supplies", "context": {}}}
2023-03-31 11:22:48,293 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "batch_count", "value": 1, "tags": {"context": null, "stream": "tap_jaffle_shop-raw_supplies"}}
poetry run meltano invoke tap-jaffle-shop > outfile.singer.jsonl  16.12s user 2.18s system 104% cpu 17.570 total
meltanolabs-tap-jaffle-shop-py3.10aj@ajs-macbook-pro tap-jaffle-shop %

Non-batch execution log (2min and 12sec)

meltanolabs-tap-jaffle-shop-py3.10aj@ajs-macbook-pro tap-jaffle-shop % time poetry run meltano invoke tap-jaffle-shop > outfile.singer.jsonl
2023-03-31T19:33:19.423783Z [info     ] Environment 'test' is active
[SIM]: 100%|███████████████████████████████████████████████████████████████████████████████| 365/365 [00:03<00:00, 98.90it/s]
2023-03-31 12:33:33,292 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_customers'...
2023-03-31 12:33:33,293 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 12:33:33,671 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.3783752918243408, "tags": {"stream": "tap_jaffle_shop-raw_customers", "context": {}, "status": "succeeded"}}
2023-03-31 12:33:33,672 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 939, "tags": {"stream": "tap_jaffle_shop-raw_customers", "context": {}}}
2023-03-31 12:33:33,672 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_items'...
2023-03-31 12:33:33,672 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 12:34:26,443 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 52.77098894119263, "tags": {"stream": "tap_jaffle_shop-raw_items", "context": {}, "status": "succeeded"}}
2023-03-31 12:34:26,444 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 96110, "tags": {"stream": "tap_jaffle_shop-raw_items", "context": {}}}
2023-03-31 12:34:26,444 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_orders'...
2023-03-31 12:34:26,444 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 12:35:26,445 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 57722, "tags": {"stream": "tap_jaffle_shop-raw_orders", "context": {}}}
2023-03-31 12:35:28,365 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 61.9208881855011, "tags": {"stream": "tap_jaffle_shop-raw_orders", "context": {}, "status": "succeeded"}}
2023-03-31 12:35:28,365 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 1794, "tags": {"stream": "tap_jaffle_shop-raw_orders", "context": {}}}
2023-03-31 12:35:28,366 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_products'...
2023-03-31 12:35:28,366 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 12:35:28,375 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.009012937545776367, "tags": {"stream": "tap_jaffle_shop-raw_products", "context": {}, "status": "succeeded"}}
2023-03-31 12:35:28,375 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 10, "tags": {"stream": "tap_jaffle_shop-raw_products", "context": {}}}
2023-03-31 12:35:28,375 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_stores'...
2023-03-31 12:35:28,375 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 12:35:28,379 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.0038499832153320312, "tags": {"stream": "tap_jaffle_shop-raw_stores", "context": {}, "status": "succeeded"}}
2023-03-31 12:35:28,380 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 5, "tags": {"stream": "tap_jaffle_shop-raw_stores", "context": {}}}
2023-03-31 12:35:28,380 | INFO     | tap-jaffle-shop      | Beginning full_table sync of 'tap_jaffle_shop-raw_supplies'...
2023-03-31 12:35:28,380 | INFO     | tap-jaffle-shop      | Tap has custom mapper. Using 1 provided map(s).
2023-03-31 12:35:28,445 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "timer", "metric": "sync_duration", "value": 0.06451582908630371, "tags": {"stream": "tap_jaffle_shop-raw_supplies", "context": {}, "status": "succeeded"}}
2023-03-31 12:35:28,445 | INFO     | singer_sdk.metrics   | INFO METRIC: {"metric_type": "counter", "metric": "record_count", "value": 65, "tags": {"stream": "tap_jaffle_shop-raw_supplies", "context": {}}}
poetry run meltano invoke tap-jaffle-shop > outfile.singer.jsonl  125.34s user 4.13s system 98% cpu 2:12.02 total

cc @edgarrmondragon, @kgpayne

The text was updated successfully, but these errors were encountered:

aaronsteers · 2023-03-31T19:42:51Z

I'm running the viztracer analysis steps documented here: https://sdk.meltano.com/en/latest/dev_guide.html#testing-performance

time poetry run viztracer tap-jaffle-shop > outfile.singer.jsonl

edgarrmondragon · 2023-03-31T19:46:21Z

If the record volume is large, my guess is that non-batch is spending a lot of time serializing data. We did ship meltano/sdk#1196 a while ago, so if serialization is the reason, we may need to start looking into faster methods.

aaronsteers · 2023-03-31T23:52:16Z

If the record volume is large, my guess is that non-batch is spending a lot of time serializing data. We did ship meltano/sdk#1196 a while ago, so if serialization is the reason, we may need to start looking into faster methods.

What is so unexpected to me is why RecordMessage dicts would add anything significant above-and-beyond what the jsonl batch files need...

I'd expect most of the dict transformation cost to also hit the jsonl implementation. This implementation uses pandas to create the initial dict, which might be faster than other methods of creating/modifying dicts... but even then, the dict still needs to be converted to a json string in both cases. The batch implementation has almost zero cost, which is just really surprising to me in contrast.

aaronsteers · 2023-03-31T23:59:23Z

Update: On looking more into the code, I can see there are several methods that get called from within _write_record_message() but not from within _sync_batches().

I'll enumerate and analyze these as a next step.

Update (2): It appears that the slowdown is in _generate_record_messages().

Specifically: conform_record_data_types() and pop_deselected_record_properties() are contributing about one minute each to the runtime. If I comment out both, I get approx. 20 seconds. If I comment out either of them, I get approx 1:15. If I leave both in, I get approx. 2:20.

aaronsteers · 2023-04-01T00:51:30Z

Okay, I think I have a plan.

conform_record_data_types() is a safeguard which can be skipped by the developer or the user if they are sure that conformance is pre-handled.
- Proposal: Add a dev-level option to skip type conformance. Optionally devs can pass this to users as config option.
conform_record_data_types() also contains a check for unmapped properties, which likewise can be skipped if we're sure the dataset is already compliant with the schema.
- Proposal: Same as (1) above.
pop_deselected_record_properties() can be skipped if we've not deselected anything. But otherwise, it needs to be applied.
- Proposal: No change. See below.
pop_deselected_record_properties() is necessary if we're not selective everything. However, the current logic has branching and selection mask check logic happening for each record. If we pre-compile a handling function and store it at the beginning of the sync, this could significantly improve performance.
- Proposal: Pre-compile a transformation function for each stream which will pop deselected properties. Similar to stream map functions, this function would use branching logic upfront but then after compilation it would simply apply the stored transforms. For example, if 5 properties need to be dropped, the function would contain 5 "pop" operations, but zero comparison or branching code.
- POC: Before implementing, I want to confirm this will make a difference. I'll probably create an arbitrary transformation function to apply, with one or two pop operations, to confirm the intended improvements.

edgarrmondragon · 2023-04-01T00:56:04Z

Proposal: Add a dev-level option to skip type conformance. Optionally devs can pass this to users as config option.

@aaronsteers that exists in the form of Stream.TYPE_CONFORMANCE_LEVEL.

aaronsteers · 2023-04-01T01:10:31Z

Thanks! Yeah, I was looking at that. Much appreciated.

aaronsteers · 2023-04-01T01:27:00Z

Actually - Looks like the above two methods were a red herring.

After making pop_deselected_record_properties() a no-op, I was still seeing the performance hit.

Then I compared the performance of this:

    pop_deselected_record_properties(record, self.schema, self.mask, self.logger)

With this:

    _ = (record, self.schema, self.mask, self.logger)

And the performance hit was identical.

Turns out, my PandasStream class was calculating schema dynamically each time. I may have thought it was called only during discovery or at the beginning of the stream, which is why I didn't optimize it or cache internally.

Very sorry for the false alarm. I'll post back shortly the performance with schema internally cached. I expect I'll see a huge speedup after that.

aaronsteers · 2023-04-01T01:41:08Z

ceacde8#diff-361f0bf4b027dd5af43140c9c304f69595460090c289b5b610b9ed00e199b81b

This seems to resolve the performance issues. Runtimes are now consistently <20 seconds.

Resolving.

aaronsteers closed this as completed Apr 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug performance differences #5

Debug performance differences #5

aaronsteers commented Mar 31, 2023 •

edited

Loading

aaronsteers commented Mar 31, 2023 •

edited

Loading

edgarrmondragon commented Mar 31, 2023

aaronsteers commented Mar 31, 2023

aaronsteers commented Mar 31, 2023 •

edited

Loading

aaronsteers commented Apr 1, 2023

edgarrmondragon commented Apr 1, 2023

aaronsteers commented Apr 1, 2023

aaronsteers commented Apr 1, 2023 •

edited

Loading

aaronsteers commented Apr 1, 2023 •

edited

Loading

Debug performance differences #5

Debug performance differences #5

Comments

aaronsteers commented Mar 31, 2023 • edited Loading

aaronsteers commented Mar 31, 2023 • edited Loading

edgarrmondragon commented Mar 31, 2023

aaronsteers commented Mar 31, 2023

aaronsteers commented Mar 31, 2023 • edited Loading

aaronsteers commented Apr 1, 2023

edgarrmondragon commented Apr 1, 2023

aaronsteers commented Apr 1, 2023

aaronsteers commented Apr 1, 2023 • edited Loading

aaronsteers commented Apr 1, 2023 • edited Loading

aaronsteers commented Mar 31, 2023 •

edited

Loading

aaronsteers commented Mar 31, 2023 •

edited

Loading

aaronsteers commented Mar 31, 2023 •

edited

Loading

aaronsteers commented Apr 1, 2023 •

edited

Loading

aaronsteers commented Apr 1, 2023 •

edited

Loading