Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to detect files that were created before app starts #612

Closed
scheung38 opened this issue Jan 15, 2020 · 17 comments
Closed

[Question] How to detect files that were created before app starts #612

scheung38 opened this issue Jan 15, 2020 · 17 comments

Comments

@scheung38
Copy link

scheung38 commented Jan 15, 2020

Win 10 or Linux: if CSV files are already present in directory before, then starting app would not detect these files, how to resolve this?

To be precise how to detect files created previously, with or without a time window, say within the last 30 days so that upon launching the app first time these are processed, but stopping the app and restarting next session these already processed files will be ignored,

Is this even possible? If not then maybe need to use something like Kafka to process files only once?

@Ajordat
Copy link
Contributor

Ajordat commented Jan 18, 2020

You could provide a DirectorySnapshot of the directory and an empty DirectorySnapshot to DirectorySnapshotDiff. This way all the files of the directory would be set as created.

I did this with my own application, but if @BoboTiG allows me I'll create a PR so he can review the code and apply it to the library if he deems necessary.

Also, to avoid processing the files the next time you start your application, you should pickle the DirectorySnapshot with the last processed content and recover it on the next application start.

The resulting code should look something like this:

if file_with_pickled_snapshot_exists():
    previous_snapshot = recover_pickled_snapshot()
else:
    previous_snapshot = EmptyDirectorySnapshot()

current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)
pickle_snapshot(current_snapshot)

Also you should probabily use the parameter ignore_device if you are planning on making the diff among different boots (PR #597).

@BoboTiG
Copy link
Collaborator

BoboTiG commented Jan 18, 2020

@Ajordat yeah, open a PR. And even if it may not be merged, it will help others :)

@scheung38
Copy link
Author

scheung38 commented Jan 19, 2020

@Ajordat could you show a demo that we could run and try out thanks as the current way is using some database to store and retrieve, so not sure how this approach would differ.

Also not sure what Is

“pickle the DirectorySnapshot”

Ajordat pushed a commit to Ajordat/watchdog that referenced this issue Jan 19, 2020
@Ajordat
Copy link
Contributor

Ajordat commented Jan 19, 2020

I've just created the PR #613. If @BoboTiG believes that code might be useful for anybody else, you will be able to use the new class EmptyDirectorySnapshot (I'm already using it on my own project).

Regarding to pickling, it's the serialization of an object into bytes with the objective of later recovery. More according to your concern, just pickle the DirectorySnapshot with the processed changes and recover it later to avoid processing the same content. It should be something like this:

try:
    with open('directory_snapshot.pickle', 'rb') as file:
        previous_snapshot = pickle.load(file)
except FileNotFoundError:
    previous_snapshot = EmptyDirectorySnapshot()

current_snapshot = DirectorySnapshot('/path')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)

with open('directory_snapshot.pickle', 'wb') as file:
    pickle.dump(current_snapshot, file)

As you can see, if the file doesn't exist you make the diff using the EmptyDirectorySnapshot; whereas if it exists, you recover the pickled DirectorySnapshot to avoid processing the files present on the previous execution.

@scheung38
Copy link
Author

scheung38 commented Jan 19, 2020 via email

@Ajordat
Copy link
Contributor

Ajordat commented Jan 19, 2020

Well, you should do a few things before:

  • Add the imports.
import pickle
from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
from watchdog.utils.dirsnapshot import EmptyDirectorySnapshot
  • Implement the function handle_diff to do whatever you want.
def handle_diff(diff: DirectorySnapshotDiff) -> None:
    pass

Also, since you are making this question, are you sure you understand what that piece of code does?

@BoboTiG
Copy link
Collaborator

BoboTiG commented Jan 19, 2020

Closed automatically when merged the PR. I reopen and let @scheung38 handle the state.

@BoboTiG BoboTiG reopened this Jan 19, 2020
@scheung38
Copy link
Author

scheung38 commented Jan 19, 2020

Not exactly sure could you demonstrate ? Say if

  1. filesA is created or modified before app starts
  2. App started and so it should pick up fileA
  3. App stops and restarts, now it should it ignore fileA since already processed it
  4. fileB is created and modified, now app starts and it should only process fileB and so on?

Sent with GitHawk

@Ajordat
Copy link
Contributor

Ajordat commented Jan 19, 2020

Yes, that's exactly what would happen. I thought it was what you were asking for, wasn't it?

@scheung38
Copy link
Author

scheung38 commented Jan 19, 2020 via email

@scheung38
Copy link
Author

scheung38 commented Jan 20, 2020

Why EmptyDirectorySnapshot cannot be imported?

I can import the other DirectorySnapshot, DirectorySnapshotDiff classes though

EDIT: EmptyDirectorySnapshot is in master from what I can see but not in dirsnapshot.py?

Sent with GitHawk

@BoboTiG
Copy link
Collaborator

BoboTiG commented Jan 20, 2020

Why EmptyDirectorySnapshot cannot be imported?

It is part of a version not yet released. You have to install the version from the master branch instead of the one from PyPi.

@scheung38
Copy link
Author

scheung38 commented Jan 20, 2020

So “pip install watchdog” is not from master?

Then company firewall might prevent pip install since

python -m pip install git+https://github.com/gorakhargosh/watchdog —user

Looking in indexes: http://CLIENT_URL/artifactory/api/pypi/pypi-repos/simple

Collecting git+https://github.com/gorakhargosh/watchdog

Error RPC failed; HTTP 403

@scheung38
Copy link
Author

scheung38 commented Jan 20, 2020

Copy and pasted only the new EmptyDirectorySnapshot class:

from watchdog.utils.dirsnapshot import DirectorySnapshot
from watchdog.utils.dirsnapshot import DirectorySnapshotDiff
import pickle


class EmptyDirectorySnapshot(object):
    """Class to implement an empty snapshot. This is used together with
    DirectorySnapshot and DirectorySnapshotDiff in order to get all the files/folders
    in the directory as created.
    """

    @staticmethod
    def path(_):
        """Mock up method to return the path of the received inode. As the snapshot
        is intended to be empty, it always returns None.
        :returns:
            None.
        """
        return None

    @property
    def paths(self):
        """Mock up method to return a set of file/directory paths in the snapshot. As
        the snapshot is intended to be empty, it always returns an empty set.
        :returns:
            An empty set.
        """
        return set()


def handle_diff(diffs):
    print(diffs)


try:
    with open('Y:\\data\sample.csv', 'rb') as file:
        previous_snapshot = pickle.load(file)
except FileNotFoundError:
    previous_snapshot = EmptyDirectorySnapshot()

current_snapshot = DirectorySnapshot('Y:\\data')
diff = DirectorySnapshotDiff(previous_snapshot, current_snapshot)
handle_diff(diff)

with open('Y:\\data\sample.csv', 'wb') as file:
    pickle.dump(current_snapshot, file)

Returns:

Traceback: line 37, in

previous_snapshot = pickle.load(file)

_pickle.UnpicklingError: A load persistent id instruction was encountered,

but no persistent load function was specified.

@scheung38
Copy link
Author

scheung38 commented Jan 20, 2020

Seems to work now, but does it work with CSV files? it seems CSV file are now corrupted next time opening in Excel?

EDIT: but sometimes I still get the above error? And needing to restart PyCharm?

my fault, should be opening rb and wb a file.pkl instead

@Ajordat
Copy link
Contributor

Ajordat commented Jan 20, 2020

Sorry for the late reply: I work full time, I have other duties and it took me quite a good time to write this response.

First, pip takes the latest release uploaded on PyPi so even if the master branch is updated with a commit, the PyPi repository isn't automatically updated.

Moving to the code you've showed, it seems like you are missing to escape the backslash, may it be that the reason of your error?

Also it seems like you are trying to use a csv file and open it. That's a big indicator of a missunderstanding. I'll try to explain it a bit more in-depth:

The file that we use is just to store the DirectorySnapshot object for the next execution of your application, it is done in binary and it is not supposed to be readable for humans. What we store there is just the DirectorySnapshot with the data of which files and folders were inside a directory at a certain point in time (when the object is created).

We need to store that information to avoid processing it the next time we start the application, but why? Here's the flow:

└── folder
    ├── file_a.txt
    └── file_b.txt
  1. Start the application that will look for changes in the directory /folder/. That's your application.
  2. Since there aren't any record of a previous execution (missing file directory_snapshot.pickle), we must process all files, so we take the EmptyDirectorySnapshot as reference.
  3. We take the DirectorySnapshot of the directory /folder/.
  4. We make the diff between both snapshots. Since the first one is empty, the result will be that all the files in the second snapshot will be detected as CREATED. This means both files inside the directory: /folder/file_a.txt and /folder/file_b.txt.
  5. We call the function handle_diff with the results of the operation.
  6. Since we have processed both files as created and we don't want to do it the next time the application starts, we store (pickle) the DirectorySnapshot that we previously took in a file (directory_snapshot.pickle). So we can avoid processing again the same files the next time.

Now, what will happen if a file (/folder/file_c.txt) gets added and that piece of code is executed again?

  1. The application will look if there's any record of a previous execution. It finds the file directory_snapshot.pickle and takes (unpickles) its contents. It gets the first DirectorySnapshot created on a previous execution as reference.
  2. We take the DirectorySnapshot of the directory /folder/.
  3. We make the diff between both snapshots. Since the files /folder/file_a.txt and /folder/file_b.txt exist in both snapshots and haven't been updated, they are ignored on the diff. This doesn't happen with the file /folder/file_c.txt, because it's new, it gets detected as CREATED.
  4. We call the function handle_diff with the results of the operation.
  5. Since we have processed the new file and we don't want to do it the next time, we store (pickle) the second DirectorySnapshot in a file (directory_snapshot.pickle). This way we can avoid processing the same file the next time.

I hope you now have a better understanding of how the code works.

@scheung38
Copy link
Author

scheung38 commented Jan 21, 2020

Fully appreciated it thanks hence last comment was using a file.pkl and not the actual CSV file.

Sent with GitHawk

CCP-Aporia pushed a commit to CCP-Aporia/watchdog that referenced this issue Aug 13, 2020
* Added DirectorySnapshotEmpty (gorakhargosh#612).

* Added test to show (and test) the usage of DirectorySnapshotEmpty (gorakhargosh#612).

* Added sphinx class for DirectorySnapshotEmpty.

* Changed class name from DirectorySnapshotEmpty to EmptyDirectorySnapshot.

* Added documentation.

* Small doc fix.

* Updated changelog.rst.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants