You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Being concerned with PyDriller's performance when I extract data from a large quantity of commits, I took a look at the code and realized that the function Commits._stats is called more than is necessary in some cases.
Example : I extract all the commits from the present repo and I want to have, for each commit, the number of inserted lines and the number of deleted lines. Both properties are actually getter functions that call on self._stats() before extracting the needed value from the (larger) returned object.
The solution I have found is this. In the case of extracting deletions and inserted, when I go into Commits.py and add the @lru_cache(maxsize=None) decorator to the _stats() function, it cuts the total time in half. If other properties that make use of _stats() are also extracted, the gains are even more important.
The decorator does the following : when an object's function is called with certain parameters, it caches the returned value. If the object's function is called again with the same parameters, it returns the cached results instead of doing the work again.
So I was thinking that there's quite a few places in the code where this strategy could be used. Because although I like the tool, it is quite slow at extracting all the commits' data of a repository such as this one, depending on the commit properties we are looking for.
Here's the change I made in Commit.py
from functools import lru_cache
# ...
@lru_cache(maxsize=None)
def _stats(self):
if len(self.parents) == 0:
text = self._conf.get('git').repo.git.diff_tree(self.hash, "--", numstat=True, root=True)
text2 = ""
for line in text.splitlines()[1:]:
(insertions, deletions, filename) = line.split("\t")
text2 += "%s\t%s\t%s\n" % (insertions, deletions, filename)
text = text2
else:
text = self._conf.get('git').repo.git.diff(self._c_object.parents[0].hexsha, self._c_object.hexsha, "--", numstat=True, root=True)
return self._list_from_string(text)
The text was updated successfully, but these errors were encountered:
Sorry, missed this!
Good catch indeed! Stats is never cached.
We could use a simple cache object (meaning saving the return object of the function in an object, like I do in other places).
Hi all,
Being concerned with PyDriller's performance when I extract data from a large quantity of commits, I took a look at the code and realized that the function Commits._stats is called more than is necessary in some cases.
Example : I extract all the commits from the present repo and I want to have, for each commit, the number of inserted lines and the number of deleted lines. Both properties are actually getter functions that call on self._stats() before extracting the needed value from the (larger) returned object.
The solution I have found is this. In the case of extracting
deletions
andinserted
, when I go into Commits.py and add the@lru_cache(maxsize=None)
decorator to the _stats() function, it cuts the total time in half. If other properties that make use of _stats() are also extracted, the gains are even more important.The decorator does the following : when an object's function is called with certain parameters, it caches the returned value. If the object's function is called again with the same parameters, it returns the cached results instead of doing the work again.
So I was thinking that there's quite a few places in the code where this strategy could be used. Because although I like the tool, it is quite slow at extracting all the commits' data of a repository such as this one, depending on the commit properties we are looking for.
Here's the change I made in Commit.py
The text was updated successfully, but these errors were encountered: