-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add boolean sync argument to get_path() #20
Conversation
Codecov ReportBase: 78.59% // Head: 78.49% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #20 +/- ##
==========================================
- Coverage 78.59% 78.49% -0.10%
==========================================
Files 5 5
Lines 369 372 +3
Branches 51 52 +1
==========================================
+ Hits 290 292 +2
Misses 65 65
- Partials 14 15 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Shouldn't a complete system synchronization operation be performed in the background and not on each "get" call? I don't think surfacing this in the API is correct - nor is requiring users to call (I also see that For now, would it make sense to expose a "list-based" API (e.g., A mirrored If this approach was adopted, it might make more sense to return a |
@kevin-bates Thank you for your input; I actually had considered exactly that same paths_by_id = file_id_manager.get_paths([record.id for record in records])
for record in records:
model = Model.from_orm(record)
model.path = paths_by_id[model.id] file_id_manager.sync_all()
for record in records:
model = Model.from_orm(record)
model.path = file_id_manager.get_path(model.id, sync=False) Which is more readable is debatable in my opinion. While At the end of the day, distinguishing identical procedures with different performance characteristics inherently rely on knowledge of implementation details. We believe it's better to be explicit and have developers be aware that |
Well, if you're willing to sacrifice consistency between the filesystem and the Files DB table, then sure. We could call Essentially this is a tradeoff between consistency and performance. My objection to adding this feature as a configurable trait is that I don't see a demonstrable usecase where you want a file path given a file ID and are somehow OK with it being out-of-date. If there is a usecase for that, I still believe that it should not be the default. We should prioritize our methods being as correct as possible out-of-the-box, while allowing users to opt-in to any features that would sacrifice correctness. |
Yes. Basically we will delete a record if we find a record with the same inode number but different timestamps, since we are assuming that the file at the current path is a completely different file from the one recorded. Otherwise if both the inode number and timestamps match, we update the record with the current path. I believe this is an implementation detail that just helps maintain consistency between the filesystem and files table. Don't think this should change the HTTP verb we use. |
The proposal that I had was to add a The big picture though is that the file id APIs will likely need to update state on a regular interval and we need to make sure we do that in a manner that allows multiple clients to call these APIs without calling |
I should note that the sync flag wouldn't be able to optimize the case of multiple clients calling these APIs in a manner that doesn't lead to calling |
Yes, and I believe you're optimizing for the wrong thing. You've decided to penalize the most common API calls ( (I apologize. I will no longer comment on implementation details. I will, however, continue to comment on API decisions since that is the contract by which other FileID Services implementations must adhere and they should be able to hide these kinds of decisions from their users.) |
Thinking a bit more overnight. If there are N RTC clients, those clients would need to poll
This way, each client won't have to poll the server to get all these updates. I think this aligns with how @kevin-bates is thinking in his last comment as well. |
Hm, even since this project was started, our north star was as follows:
If we do add a sync interval and set it to be the default (i.e.
No, I think your comments are super valuable! To be honest, you've been one of the only other contributors critiquing our design. We value your feedback a lot. |
Yes, this would offload the server quite a bit, since the current implementation is not |
Allows for
get_path()
to be called multiple times in serial without affecting performance.Whenever
get_path()
is called, the Files table must be synced with the local filesystem beforehand viasync_all()
. This scans over the entire file tree at the server root and essentially ensures each path is mapped to the correct file ID. This can be an expensive operation, and thus we want to minimize calls tosync_all()
.This PR allows users to provide
sync=False
and simply callsync_all()
manually beforehand. This is useful for scenarios where developers are callingget_path()
multiple times in serial, andsync_all()
does not need to be called between invocations. For example:Given that
sync_all()
represents almost all of the performance cost ofget_path()
, this method is N times faster than callingfile_id_manager.get_path(id)
in serial, where N is the length ofids
.