-
-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] How to detect files that were created before app starts #612
Comments
You could provide a DirectorySnapshot of the directory and an empty DirectorySnapshot to DirectorySnapshotDiff. This way all the files of the directory would be set as created. I did this with my own application, but if @BoboTiG allows me I'll create a PR so he can review the code and apply it to the library if he deems necessary. Also, to avoid processing the files the next time you start your application, you should pickle the DirectorySnapshot with the last processed content and recover it on the next application start. The resulting code should look something like this: if file_with_pickled_snapshot_exists():
previous_snapshot = recover_pickled_snapshot()
else:
previous_snapshot = EmptyDirectorySnapshot()
current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
pickle_snapshot(current_snapshot) Also you should probabily use the parameter |
@Ajordat yeah, open a PR. And even if it may not be merged, it will help others :) |
@Ajordat could you show a demo that we could run and try out thanks as the current way is using some database to store and retrieve, so not sure how this approach would differ. Also not sure what Is “pickle the DirectorySnapshot” |
I've just created the PR #613. If @BoboTiG believes that code might be useful for anybody else, you will be able to use the new class Regarding to pickling, it's the serialization of an object into bytes with the objective of later recovery. More according to your concern, just pickle the DirectorySnapshot with the processed changes and recover it later to avoid processing the same content. It should be something like this: try:
with open('directory_snapshot.pickle', 'rb') as file:
previous_snapshot = pickle.load(file)
except FileNotFoundError:
previous_snapshot = EmptyDirectorySnapshot()
current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
with open('directory_snapshot.pickle', 'wb') as file:
pickle.dump(current_snapshot, file) As you can see, if the file doesn't exist you make the diff using the EmptyDirectorySnapshot; whereas if it exists, you recover the pickled DirectorySnapshot to avoid processing the files present on the previous execution. |
I can use this now as it is?
|
Well, you should do a few things before:
import pickle
from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
from watchdog.utils.dirsnapshot import EmptyDirectorySnapshot
def handle_diff(diff: DirectorySnapshotDiff) -> None:
pass
Also, since you are making this question, are you sure you understand what that piece of code does? |
Closed automatically when merged the PR. I reopen and let @scheung38 handle the state. |
Not exactly sure could you demonstrate ? Say if
Sent with GitHawk |
Yes, that's exactly what would happen. I thought it was what you were asking for, wasn't it? |
Yes but I need to understand your logic first before trying. Appreciate it
This applies for files that are either created or modified before app starts correct?
|
Why EmptyDirectorySnapshot cannot be imported? I can import the other DirectorySnapshot, DirectorySnapshotDiff classes though EDIT: EmptyDirectorySnapshot is in master from what I can see but not in dirsnapshot.py? Sent with GitHawk |
It is part of a version not yet released. You have to install the version from the master branch instead of the one from PyPi. |
So “pip install watchdog” is not from master? Then company firewall might prevent pip install since python -m pip install git+https://github.com/gorakhargosh/watchdog —user Looking in indexes: http://CLIENT_URL/artifactory/api/pypi/pypi-repos/simple Collecting git+https://github.com/gorakhargosh/watchdog Error RPC failed; HTTP 403 |
Copy and pasted only the new EmptyDirectorySnapshot class:
Returns: Traceback: line 37, in previous_snapshot = pickle.load(file) _pickle.UnpicklingError: A load persistent id instruction was encountered, but no persistent load function was specified. |
Seems to work now, but does it work with CSV files? it seems CSV file are now corrupted next time opening in Excel? EDIT: but sometimes I still get the above error? And needing to restart PyCharm? my fault, should be opening rb and wb a file.pkl instead |
Sorry for the late reply: I work full time, I have other duties and it took me quite a good time to write this response. First, Moving to the code you've showed, it seems like you are missing to escape the backslash, may it be that the reason of your error? Also it seems like you are trying to use a csv file and open it. That's a big indicator of a missunderstanding. I'll try to explain it a bit more in-depth: The file that we use is just to store the DirectorySnapshot object for the next execution of your application, it is done in binary and it is not supposed to be readable for humans. What we store there is just the DirectorySnapshot with the data of which files and folders were inside a directory at a certain point in time (when the object is created). We need to store that information to avoid processing it the next time we start the application, but why? Here's the flow: └── folder
├── file_a.txt
└── file_b.txt
Now, what will happen if a file (
I hope you now have a better understanding of how the code works. |
Fully appreciated it thanks hence last comment was using a file.pkl and not the actual CSV file. Sent with GitHawk |
* Added DirectorySnapshotEmpty (gorakhargosh#612). * Added test to show (and test) the usage of DirectorySnapshotEmpty (gorakhargosh#612). * Added sphinx class for DirectorySnapshotEmpty. * Changed class name from DirectorySnapshotEmpty to EmptyDirectorySnapshot. * Added documentation. * Small doc fix. * Updated changelog.rst.
Win 10 or Linux: if CSV files are already present in directory before, then starting app would not detect these files, how to resolve this?
To be precise how to detect files created previously, with or without a time window, say within the last 30 days so that upon launching the app first time these are processed, but stopping the app and restarting next session these already processed files will be ignored,
Is this even possible? If not then maybe need to use something like Kafka to process files only once?
The text was updated successfully, but these errors were encountered: