-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add time range parameters to sync script #483
Conversation
from src.logger import set_log | ||
|
||
log = set_log(__name__) | ||
|
||
|
||
def compute_time_range( | ||
start_time: datetime, end_time: datetime | ||
) -> list[tuple[datetime, datetime]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't there used to be a type AccountingPeriod
? If that class has been abolished, it might not hurt to add a basic declaration:
AccountingPeriod = tuple[datetime, datetime]
that can be passed around the project for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I am still not sure how to reasonably merge the two code bases of solver-rewards and dune-sync (v1).
Accounting period seems to be tailored to rewards with defaulting to weeks and having pretty printing for dune tables. The accounting period also has some implicit constraints on having a time of 00:00:00 in some parts of the code.
Making the concept of accounting periods consistent throughout the code would definitely help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, I block number ranges would be the most appropriate thing. This was the benefit of the old AccountingPeriod
class; it provided a bijection between timestamps and block numbers so computations could be performed on the more natural type.
Anyway, I suppose this is just a side note. Feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an ongoing internal debate on what the most natural concept for accounting periods is. Block ranges sound nice when considering how backend data is structured. But they become a bit unnatural if multiple chains are considered. For simplicity, we want to use time based periods everywhere, across chains and projects.
src/data_sync/sync_data.py
Outdated
|
||
async def sync_data_to_db( # pylint: disable=too-many-arguments | ||
async def sync_data_to_db( # pylint: disable=too-many-arguments, too-many-locals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Looks like this method has become quite overloaded...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the code (but not the responsibilities of the function) a bit. At least I do not need to exclude the pylint check anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
||
start_block = find_block_with_timestamp(node, start_time.timestamp()) | ||
if latest_block_time < end_time: | ||
end_block = int(latest_block["number"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a log here to let us know if this happens that the script is not actually using the end time because it's still in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an opinion on where this processing for time ranges should take place?
On the one hand, most of the logging and checking can also happen during initialization in ScriptArgs
. That might also remove the implicit changing of end times due to the restriction to finalized blocks.
On the other hand, a node would be required to do the checking for latest block. And it would be mixing abstractions. The accounting, on a high level, should be time based. That this is internally split into monthly ranges and blocks is a bit of an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to be honest. Since this requires actual implementation details and not checking whether or not the args are valid it feels cleaner to leave this here. For the sake of serializing the args any end time > start time would be valid, so it seems to me like this is the right place. But if moving this to the initialization makes it more maintainable then that could be a good argument to do this there instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I just added a log message.
New and removed dependencies detected. Learn more about Socket for GitHub ↗︎
🚮 Removed packages: pypi/async-timeout@4.0.3, pypi/exceptiongroup@1.2.2 |
I recompiled dependencies after adding |
src/data_sync/common.py
Outdated
|
||
log = set_log(__name__) | ||
|
||
|
||
def compute_time_range( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of function is a bit confusing as its input is actually the time range, but tbh i don't have any good suggestion for the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to partition_time_range
and added a comment at the call site to make it a bit clearer.
Overall, this splitting of time ranges is a bit confusing since it is leaking implementation details of chunking into the rest of the code. The current main has that problem too, though.
to make it easier to understand what is going on with time ranges
This PR adds start and end times to the sync data script.
The command line arguments
start_time
andend_time
are added to thesync_data
script. The script is called as, e.g.,Only data between
start_time
andend_time
is computed. This data is then upserted into the corresponding table of the month. I.e. if a row was not in the table already, it is inserted into the table. Rows from the new data replace rows of the old data if it exists. Old data which was not recomputed stays as is.The code is structured as follows:
The convention for the stat time to be inclusive and for the end time to be exclusive is used. This way the two ranges
(2024-12-30, 2025-01-02), (2025-01-02, 2025-01-07)
would give the same result as the range(2024-12-30, 2025-01-07)
. Though some overlap is required in casesend_time
is beyond the last finalized block.If no argument is supplied, the start of the month and the start of the next month are used as default for
start_time
andend_time
, respectively, to compute data for the full month, until the last finalized block.The previous month is not automatically recomputed on the first of the next month. Instead, one can use a time range which contains whatever time window for which data needs to be recomputed.
Potential TODOs