Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching certain function return values of Commit object #295

Open
cedric-audy opened this issue Jul 4, 2024 · 1 comment
Open

Caching certain function return values of Commit object #295

cedric-audy opened this issue Jul 4, 2024 · 1 comment
Labels
enhancement New feature or request PR welcome Issue is confirmed, but not fixed yet

Comments

@cedric-audy
Copy link

cedric-audy commented Jul 4, 2024

Hi all,

Being concerned with PyDriller's performance when I extract data from a large quantity of commits, I took a look at the code and realized that the function Commits._stats is called more than is necessary in some cases.

Example : I extract all the commits from the present repo and I want to have, for each commit, the number of inserted lines and the number of deleted lines. Both properties are actually getter functions that call on self._stats() before extracting the needed value from the (larger) returned object.

The solution I have found is this. In the case of extracting deletions and inserted, when I go into Commits.py and add the @lru_cache(maxsize=None) decorator to the _stats() function, it cuts the total time in half. If other properties that make use of _stats() are also extracted, the gains are even more important.

The decorator does the following : when an object's function is called with certain parameters, it caches the returned value. If the object's function is called again with the same parameters, it returns the cached results instead of doing the work again.

So I was thinking that there's quite a few places in the code where this strategy could be used. Because although I like the tool, it is quite slow at extracting all the commits' data of a repository such as this one, depending on the commit properties we are looking for.

Here's the change I made in Commit.py

from functools import lru_cache
# ...

    @lru_cache(maxsize=None)
    def _stats(self):
        if len(self.parents) == 0:
            text = self._conf.get('git').repo.git.diff_tree(self.hash, "--", numstat=True, root=True)
            text2 = ""
            for line in text.splitlines()[1:]:
                (insertions, deletions, filename) = line.split("\t")
                text2 += "%s\t%s\t%s\n" % (insertions, deletions, filename)
            text = text2
        else:
            text = self._conf.get('git').repo.git.diff(self._c_object.parents[0].hexsha, self._c_object.hexsha, "--", numstat=True, root=True)

        return self._list_from_string(text)
@ishepard
Copy link
Owner

Sorry, missed this!
Good catch indeed! Stats is never cached.
We could use a simple cache object (meaning saving the return object of the function in an object, like I do in other places).

@ishepard ishepard added enhancement New feature or request PR welcome Issue is confirmed, but not fixed yet labels Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR welcome Issue is confirmed, but not fixed yet
Projects
None yet
Development

No branches or pull requests

2 participants