Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change(taps): Don't resend SCHEMA messages if they match the most recent schema sent #1061

Open
aaronsteers opened this issue Oct 11, 2022 · 8 comments

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 11, 2022

Feature scope

Taps (catalog, state, stream maps, etc.)

Problem Description

For taps with parent-child relationships, I believe we currently send a new SCHEMA message at the start of each child stream from a given context.

For streams that have a relatively low number of child items per parent ('low' meaning anything less than 1K or 10K), the additional SCHEMA messages may be unnecessarily hinting to targets that they should flush/drain their caches.

Proposal Description

The proposal here would be to maintain some type of cache of last-sent SCHEMA message, per stream_name or stream_alias (if stream maps are applied) and then to skip sending the SCHEMA message if the new schema matches exactly to the last one that was sent.

Target-side mitigation

While the sending of extra SCHEMA messages in the tap is problematic in many cases because it triggers unwanted behaviors in the targets, there's a target-side mitigation as well: which is simply to perform the same checks on the target side and only flush batches if the new SCHEMA message is different from the prior STATE message received for that stream name.

@aaronsteers aaronsteers added kind/Feature New feature or request valuestream/SDK labels Oct 11, 2022
@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 11, 2022

@edgarrmondragon - Can you confirm that the current behavior is accurately described above?

@aaronsteers aaronsteers changed the title change: Don't resend SCHEMA messages if they match the most recent schema sent change(taps): Don't resend SCHEMA messages if they match the most recent schema sent Oct 11, 2022
@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Oct 11, 2022

@aaronsteers Maybe we could make it clear that it's the child's SCHEMA message that's emitted for every parent record.

@aaronsteers
Copy link
Contributor Author

@edgarrmondragon - That could probably work... I think generally the methods to send SCHEMA messages are intentionally private - so I worry slightly about the parent stream reaching in and calling those. (Stream maps make this more complex also, since it is possible more than one SCHEMA message is needed per child stream iteration.

I was leaning towards a class-level cache or a global cache for deduping, since that keeps the implementation mostly unchanged otherwise.

@edgarrmondragon
Copy link
Collaborator

@aaronsteers a class-level cache makes sense, maybe by hashing the schema in JSON string form with sorted keys? Otherwise hashing a dictionary may have bad performance and not be reliable.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 11, 2022

For a higher-frequency message, I would be more wary of performance. But because SCHEMA messages are only sent infrequently, I think the perf cost will be less of an issue.

In terms of reliability, there may be better means, but in the past I've had good experience with memoization.cached, which accepts dict values as cache index keys.

https://pypi.org/project/memoization/

@alberto-of
Copy link

alberto-of commented Jan 29, 2023

Monkey-patching solution

class BaseStream(Stream):
    def _write_state_message(self) -> None:
        pass

My problem is with STATE messages. SCHEMA messages are 1:1 to streams.

@laurentS
Copy link
Contributor

laurentS commented Feb 9, 2023

Just chiming in here after hitting this problem with tap-github: For about 5500 records, I have ~450 SCHEMA messages, which have a fairly high impact on the target side (postgres), as the db will generate a DDL SQL query for each such message. After dumping the messages to a text file, I ran that file through target-postgres before and after removing redundant SCHEMA messages. It cut processing time in half on the db side.
I agree that the target could filter them, but I feel that there is no need to generate them on the tap side in the first place, given the performance impact this can have.

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@stale stale bot removed the stale label Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants