Caching certain function return values of Commit object #295

cedric-audy · 2024-07-04T23:24:56Z

Hi all,

Being concerned with PyDriller's performance when I extract data from a large quantity of commits, I took a look at the code and realized that the function Commits._stats is called more than is necessary in some cases.

Example : I extract all the commits from the present repo and I want to have, for each commit, the number of inserted lines and the number of deleted lines. Both properties are actually getter functions that call on self._stats() before extracting the needed value from the (larger) returned object.

The solution I have found is this. In the case of extracting deletions and inserted, when I go into Commits.py and add the @lru_cache(maxsize=None) decorator to the _stats() function, it cuts the total time in half. If other properties that make use of _stats() are also extracted, the gains are even more important.

The decorator does the following : when an object's function is called with certain parameters, it caches the returned value. If the object's function is called again with the same parameters, it returns the cached results instead of doing the work again.

So I was thinking that there's quite a few places in the code where this strategy could be used. Because although I like the tool, it is quite slow at extracting all the commits' data of a repository such as this one, depending on the commit properties we are looking for.

Here's the change I made in Commit.py

from functools import lru_cache
# ...

    @lru_cache(maxsize=None)
    def _stats(self):
        if len(self.parents) == 0:
            text = self._conf.get('git').repo.git.diff_tree(self.hash, "--", numstat=True, root=True)
            text2 = ""
            for line in text.splitlines()[1:]:
                (insertions, deletions, filename) = line.split("\t")
                text2 += "%s\t%s\t%s\n" % (insertions, deletions, filename)
            text = text2
        else:
            text = self._conf.get('git').repo.git.diff(self._c_object.parents[0].hexsha, self._c_object.hexsha, "--", numstat=True, root=True)

        return self._list_from_string(text)

The text was updated successfully, but these errors were encountered:

ishepard · 2024-08-27T10:43:14Z

Sorry, missed this!
Good catch indeed! Stats is never cached.
We could use a simple cache object (meaning saving the return object of the function in an object, like I do in other places).

ishepard added enhancement New feature or request PR welcome Issue is confirmed, but not fixed yet labels Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching certain function return values of Commit object #295

Caching certain function return values of Commit object #295

cedric-audy commented Jul 4, 2024 •

edited

Loading

ishepard commented Aug 27, 2024

Caching certain function return values of Commit object #295

Caching certain function return values of Commit object #295

Comments

cedric-audy commented Jul 4, 2024 • edited Loading

ishepard commented Aug 27, 2024

cedric-audy commented Jul 4, 2024 •

edited

Loading