-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add end_date
support in generic tap config
#922
Comments
If the end_date feature allowed an arbitrary python or SQL expression, that would vastly expand the usefulness of this and would solve your use-case item 4, "Intentionally skipping over records which have not reached a minimal cool-off period," which I am working on solving in my Meltano setup right now. If the end_date statement were python, it would be really really helpful to have access to the datetime library in the execution environment or some equivalent (a datetime object would probably suffice too). Of course, SQL comes with some built-in datetime handling; python does not... or rather it doesnt in the simpleeval execution environment that's used in stream-maps expressions. To expand on the use-case I'm hoping this feature could help solve: I'm using the transferwise postgres tap to incrementally load from a read replica of a production DB. (One implication of this is that I can't alter the source data at all.) The table I'm having trouble with represents "daily visits", so a row is created for each user each day upon their first visit to the site, and the values in it are updated regularly throughout the day if they return. The table is guaranteed to be finalized by EOD UTC. One way this could be solved in theory is if the source table had an "updated_at" field which could be used for incremental loading, but this table doesnt have that field (since Meltano can't handle this use-case right now, the solution we're pursuing is adding that field to this table). It seems like it should be straigthforward to solve— I just, somewhere in Meltano, need to be able to write a python or sql expression that selects only rows where There was a discussion of this problem in slack as well. |
@qbatten - Thanks for adding this detail. The question or if/how to accept dynamic expressions into the config is something we may need to think more about. I'd want to be especially careful with that. The idea of adding dynamic datemath expression extensions into the Stream Maps capability is an easy "yes" in terms of deciding if we should add it. You can search |
On it's own, the |
Requested by: |
@aaronsteers thoughts on this as part of a 1.0? |
Yes, I think it would be great to get in by 1.0. It opens up new options for us and our users. |
Related, due to cooloff requirements on the source API, there's a WIP PR here to add this to
This specific comment includes a proposal for relative |
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
There are many use cases which could benefit from a generic and SDK-handled
end_date
config option, to match thestart_date
config option that is specified in the Singer Spec, and which most taps already support.A few use cases:
start_date
value such asJan 1 2022
, which was not inclusive of all time periods.)Spec
end_date
config input. When provided, records received from the source would be ignored and/or filtered out if they were further than the providedend_date
.end_date
.utcnow()
calculation at the start of execution.)start
andend
filters can get exactly the records needed.end
filter can cancel early their paginated calls if the output is known to be sorted and if we've already passed theend_date
limit.end
filter and also do not generate a sorted output may be forced to continue paginating through all records until the end of stream. This means that extraction will be significantly slower; even though the tap will only send along matching records, the API still has to paginate through all records to find all records that match.Spec Questions (TBD)
Still TBD is whether the
end_date
should be inclusive or exclusive. Probably exclusive is the correct behavior, so thatJan 1 2022 0:00:00.000
(for instance) could be used as theend_date
on one pipeline and thestart_date
on another. If we ask users to provide an inclusive date, users are likely to provide something likeDec 31 2021 11:59:59
, which (depending on precision of the source system) is subject to gaps - and therefor subject to unintentional data loss.If we go with an exclusive logic, and given that
start_date
is inclusive, then the logic would be:start_date
<=record_date
<end_date
Caveats/warnings
For merge-insert operations, especially when operations are run in parallel, it it is important to note that the latest time block should always be loaded last. This is because over the course of a parallel execution, the same record may appear in a historic period and also in the latest time window. In order to not lose the most recent changes, the final/current time period should be loaded last or (safer yet), the final/current time block should be rerun after the prior periods have been extracted and loaded.
A theoretical example: a load for "2021" and "2022 YTD" are running in parallel A
customers
table recordABC Customer
has not been updated since Dec 2021. It is updated at the same time as the load is running, and "moves" into the 2022 YTD bucket after already being picked up in the 2021 bucket. If 2021 loads to the warehouse after the 2022 YTD bucket is loaded, the older change will overwrite the newer one, causing data loss.The way to resolve this is to either wait until backfills have run before running the most recent period, or to rerun the latest period so that the newer version of the record once again becomes the primary/correct version of the record in the warehouse.
Since these challenges are reasonably addressed with some best practice documentation and/or additional intelligence added to the orchestrator that wants to run these in parallel (such as Meltano), there doesn't seem to be a strong reason not to build this feature.
See also
The text was updated successfully, but these errors were encountered: