-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow repository browsing in 1.14.x #15707
Comments
This is because the algorithm was changed in 1.14 due to a problem with go-git causing significant memory issues. Thank you for the test cases though because they will provide tests to improve the current algorithm. If you are suffering significant slow downs here you can switch back to the gogit build by adding We would otherwise appreciate help in improving the performance of the algorithm for the pure git version. |
Thanks for the hint with TAGS. I don't have time to make more tests now but I found something interesting. When browsing my repository with gitea I see in htop following git processes:
These processes were running for about one minute so I have run the first git process by hand:
and it gave me 1087346 rows. I suppose the millions rows are then pass to the second git process. I have piped output from the first git to the other:
it takes about 15 seconds and shows that file swinka.txt is larger than 1 GB
so there is a lot of data to pass between gitea and git. So the question is: is it really needed for the first git process to return one milion rows? |
@tsowa unfortunately yes but it should be relatively fast - the issue will be that the structure of some repos will actually require that million of rows to be checked more than a few times. Determining which commit a file is related to is not a simple task in git - and although there's a commit graph we don't have a good way of querying it. (It shouldn't take 15s to pipe those two commands together - you're slowing things down by allocating file space - you should pipe the output to null btw.) There are a few more improvements to that function that can be made - for a start the function is not optimised for our collapsing of of directories containing a single document - and writing a commit graph reader would be part of that. The gogit backend does have a commitgraph reader but it is not frugal with memory at all. I need to spend some time making a reader that is much more frugal and stream like but I haven't had the time. (See the technical docs https://github.com/git/git/blob/master/Documentation/technical/commit-graph.txt) In the end though we need to move rendering of last commit info out of repo browsing and in to an ajax call. Again something I haven't had time to do. |
One question - have you disabled the commit cache? If so please re-enable it. |
It was enabled by default but the 'adapter' option was set to 'memory'. Now I have installed memcached and changed adapter to 'memcache' and a difference is visible. Opening https://gitea.ttmath.org/FreeBSD/ports for the first time took 79766ms and for the second time only 3063ms. Opening https://gitea.ttmath.org/FreeBSD/ports/src/branch/main/audio for the first time 141221ms and later 37205ms. But I see that you are calling a lot of git processes, I have created a small git wrapper in such a way:
I have moved original /usr/local/bin/git to /usr/local/bin/git.org and have compiled above program as /usr/local/bin/git. And it gives me git.log with all git operations and I see that sometimes you are calling the git binary 300 times in one request:
So it cannot be fast, this remains me of the old days when we were using cgi scripts. Is there a reason you are calling git directly instead of using a git library such as libgit2? |
Could you also count what's the git command Gitea invoked in these 335 commonds? This is because when browsing, Gitea will get last commit message for every dir/file on the ui. For v1.13.0, we use go-git which is a pure go git library, for v1.14.x, we have two versions for windows because the library have some memory problems. And maybe you could try to compile the go git version yourself to check what's the different between them. |
Hello there, looking at this from Codeberg's perspective (issue) As you can see, we're also suffering from the slow repository browsing which affects the overall performance of our machine. While setting up a Redis cache works well for us, we would like to improve the initial generation on cache misses, too. We suspect especially this command where each folder is checked for the latest commit, executing The idea makes sense to us: getting all commits, checking if the folder or file was touched. But the logic without gogit doesn't appear to stop after all necessary information was loaded, but rather continue serving up the entire history, even if all files or subfolders in a folder have already been hit by a recent commit. We assume that the process should be stopped before it ends if all information was provided. It looks like there's a lot of stuff to be improved with the native git backend, and some actions will probably always be slower because they cannot directly interface with git operations (e. g. directly working on git results while they are fetched instead of the piped input). It might a good idea to turn back to gogit for all systems when the memory issues are resolved, or, look for another git library (that is more native and maybe faster than gogit), but offers some better interface (go-bindings for libgit2?) Please let us know if we can provide further assistance in improving this performance issue. Some other random observations that might be interesting to you:
|
Take a look at #15891 |
@fnetX thanks for your long comment. It's worth remembering that the issue precipitating the pure git backend was memory use. I've submitted a patch to go-git which should cause much lower memory load. Until that is in and working correctly, go-git will happily load in huge blobs in to memory - storing them in caches even when you want to check the size of the object. It's really worth being clear that that is an absolutely intolerable situation. Further the issues you are highlighting in the last section are not new to the native git backend. They're present in a different way in the go-git backend just in a way you can't track. I've long advocated for changing to a more gitlab approach for this and/or for passing in request contexts to terminate things - I'm really happy to work on this - but I haven't had a chance to do this - and to be honest none of you are paying me. Now going on to the get last commit algorithm. All algorithms have a balance between memory and time. The current algorithm is highly optimised against memory use. If we are happy to use more memory that can be improved.
Looking at the length of time the rev-list process is running a bit of a distractor. Yes the go-git process can stop once it's finished looking at all the parents and the paths, but it's a question of memory and time spent tracking the parents. The greatest speed up in #15891 is actually preemptive passing the next tree ID to the git cat-file process as soon as we know what it's going to be. A large proportion of time appears to be spent waiting for go to fill the read buffer from the other process. This is where the go-git algorithm can be quicker as it avoids that by reading files in directly.
These are all longstanding issues and I am aware of them. I would love to spend time fixing these but I am limited in my time and availability. Honestly I wish you'd just talked to me directly. I'm always on Discord and could have told you and kept you abreast of what was going on and my progress in trying to speed this up. |
This comment has been minimized.
This comment has been minimized.
Hey, thank you very much for the explanation.
I somehow thought this was already in and just needed some further improvements, my bad.
Yes, we can mainly offer to being thankful as long as we aren't paid for anything either. ❤️
Yeah, the others told me that, too. But since Discord is a proprietary app that kept crashing my computer back when I last used it, I decided against this and went for dumping our findings somewhere, hoping they are of any use. Chose this issue over the thread on Codeberg as it seemed to better fit in this topic here.
Sounds like good news. Please tell us if there's anything we can do. |
Unfortunately this doesn't work. The idea was to use If I could figure out a way to not list the contents of the trees this would be definitely faster than the go-git version. |
What about adding |
Oh, you probably still want to have the full list of the folder you're looking at, just not of all the subfolders? |
yeah - I mean if we could just do that n times then it would be easy and fine but it's not like that. Also it's not quite -n1 consider the following tree:
If the wibble becomes the object with SHA So -n1 is still not right. |
Could you test #15891? In my limited testing this is faster for the root directory than the go-git native version. There is a still a slowdown problem in the subdirectories. |
Thank you. We haven't yet been able to properly backport it to our fork // rebase our patches to this pull. We'll look into it and test then. |
@zeripath Thanks, now testing bd1455a from your repo (cache is disabled): The speed up is visible, browsing directories is about 5 times faster than 1.14.x. Not as fast as cgit but much better than before. Good job. |
@tsowa does cgit even attempt to provide last commit information? |
I have a backport of the latest get-lastcommit-cache performance improvements on to 1.14 if people would like them. |
We have tested the backport you provided on codeberg-test.org and it has ~x3 loadtimes as go-git (15 to 17 seconds your pull vs. ~ 5 seconds go-git). We're using git version 2.29.2 - do you know if a more recent version might have a better performance or if there are other constraints that might decrease performance? |
well that's interesting - as my timings appear to be similar to those on go-git. are you sure you've built from the backport-improve-get-lastcommit branch? The version should be 1.14.2+33-g57d45e1c2 as in SHA 57d45e1c247eaafb3a3a92ab593c31356b472d6f Do you have commit graphs enabled for your repos? |
We deployed this branch which has your commits on top of our 1.14 patches cleanly: https://codeberg.org/Codeberg/gitea/src/branch/codeberg-try-puregit-improvements (just confirmed once more that the commit matches: 1.14.2+49-g7e9e3f364) Yes, you can browse commit graphs on Codeberg. |
Hmm... I am very confused as this is now just as fast as gogit for me and possibly faster in places. Tell me there's at least some improvement here for you? I'm almost at my limit for what I can do to speed this up any further. The main slowdowns in my testing were in filling the buffers between the pipes & adjusting when the subsequent reads occurred seemed to fix this for me - perhaps my processor is just fast enough that the earlier writes provide me just enough time to prevent the fill lock whereas on your processor that's not quite enough time. I just don't think there's any way to avoid it - I mean we could try an os.pipe instead of an io.pipe? I tried an nio.Pipe but it was just as slow. Certainly we can't switch to the same algorithm as the go-git variant as that would require even more communication and waiting for the cat-file-batch pipes to respond and fill. By commit graph I meant the git core.commitGraph functionality. It should be enabled by default but ... I've certainly seen repos on my system that don't have a commit graph even though they would clearly benefit. Ok I guess we're at a point of diminishing returns - and I might be better off looking at solving the problems in go-git and making last commit info stop slowing down rendering. |
I did tests with the linux kernel and nixpkgs repos after a gitea restart with cold cache. It seems to be factor 3 on both in favor of gogit. I have no idea why the optimizations do nothing on those repos. git(new) here is the backport of your optimizations. linux 1.14 gogit 0:21 nixpkgs 1.14 gogit 0:06 I also backported your optimization to 1.14 before with same results, but I blamed by lack of understanding of the code and a bad backport. |
@ashimokawa you'd need to backport a few other PRs to see the improvement - it's not just #15891 that is needed. I'm happy to give you a link to that backport. |
@ashimokawa are you able to retest #16059 and tell me which repo it fails on. No rush. I'll push for #16042 to be merged in the meantime |
It did not fail, it was just extremely slow, slower that anything I have ever tested in this context. |
@ashimokawa did you actually test from 67a1aa9fc55a36364b6c2e5bd2e230410a63e3fe ? or was it from 2cfd269688c56839ddc53b0563ad7aefd3a4da2a ? What was the repo that was so slow btw? |
The repo I was testing with was https://codeberg-test.org/bigrepos/nixpkgs with cold cache The branch I was testing was https://github.com/zeripath/gitea/tree/backport-use-git-log-raw It took 55 seconds (!) compared to plain 1.14 with pure git which only took ~17s go-git took 7.7s, your (previous) backport was the fastest with 6.7s so far but led to wrong data in some subdirs as @fnetX pointed out. |
@zeripath some test repos, cache is disabled zeripath:improve-get-lastcommit (cc6d732):https://giteanew.ttmath.org/FreeBSD/ports https://giteanew.ttmath.org/FreeBSD/ports/src/branch/main/audio https://giteanew.ttmath.org/FreeBSD/ports/src/branch/main/devel https://giteanew.ttmath.org/FreeBSD/ports/src/branch/main/polish zeripath:use-git-log-raw (67a1aa9f):https://giteanew2.ttmath.org/FreeBSD/ports https://giteanew2.ttmath.org/FreeBSD/ports/src/branch/main/audio https://giteanew2.ttmath.org/FreeBSD/ports/src/branch/main/devel https://giteanew2.ttmath.org/FreeBSD/ports/src/branch/main/polish gitea 1.13.7https://giteaold.ttmath.org/FreeBSD/ports https://giteaold.ttmath.org/FreeBSD/ports/src/branch/main/audio https://giteaold.ttmath.org/FreeBSD/ports/src/branch/main/devel https://giteaold.ttmath.org/FreeBSD/ports/src/branch/main/polish |
Thanks @tsowa So I've discovered something terrible with git log --raw and --name-only and --name-status that seems insurmountable. If there is a merge commit it appears that the files created in that merge don't appear. This might simply be a bug with git but it's kinda annoying. From within the git repository:
This might be the cause of slowdowns people are seeing. Same thing happens on 1.7.2! |
OK I've figured that out! |
OK I've pushed up another version of #16059 and its backport on to 1.14 to backport-use-git-log-raw. These are radically quicker for me on most of these repositories and examples.
I guess the next step is examining why git git/Documentation and nixpkgs/pkgs are pathological for 16059 and how that can be ameliorated. |
@zeripath I have updated https://giteanew2.ttmath.org to 5a90343b and can confirm: ports ~14 s The times probably depend on the kind of disks on the server, I have got old hdds 7200RPMs. |
Yeah my development laptop is a bit of a beast. The slow down appears to be with dealing with huge numbers of unseen parents. I'll take another look though. |
OK I've adjusted the heuristic slightly which appears to improve those three pathological cases. |
For me the performance is sufficient, thank you for your work. Any plan to put it to 1.14.3? |
well if you could comment on #16059 and say that you think this is now ready. If the codeberg peeps think it's better that would be good too. |
I tested nixpkgs and linux with https://github.com/zeripath/gitea/tree/backport-use-git-log-raw nixpkg#16059 3.4s linux#16059 65s Would need more testing to say anything, sometimes it is amazingly fast, sometimes almost 3x slower than go-git. But memory usage is way down :) |
Found another really bad one: nixpkgs/src/branch/master/pkgs/data/icons/beauty-line-icon-theme #16059 83s(!) So the PR is factor 40 slower vs go-git in this case, there is ONE file in the directory above |
That's just so weird - I've written a slight adjustment to only restart the log --name-status when there is genuinely no parent - hopefully that solves the problem above. It's still slower than go-git here which is weird as I can't really see any good reason for that but so it goes. |
Ah! I think the issue is: That should be <= not < ! |
use-git-log-raw (283959931) takes 6-7 second to render beauty-line-icon-theme: |
@zeripath It is good to improve pure git but I see that for this one it is even a regression from current pure git code (not to mention go-git which only takes less than 2 seconds) |
I think the next direction is Or just don't show last commit info if the tree entires is more than some value. |
I was missing out here a bit, didn't yet try to get the exact changes in the different algorithms, but I'm just thinking about the first one: With caching enabled (or at least a persistent cache), the information could already be generated for the full file tree. So the first access on repo shows some last commits and slowly loads all other information, but in the background the cache is already filled for the other files so that the whole commit log doesn't need to be scanned for each new subfolder access. I'm still sceptic with too much async JS rendering, I mainly don't use GitLab because I know it sometimes hangs my browser when loading all the frontend stuff. Not sure if it's related to the commit info, but using the GitLab frontend is a pain for me. |
The other problem with all those pure git implementation also is that if a user cancels a request the git process runs till completion and does not get killed. So if a user is impatient and presses reload, he can easily OOM an instance. |
@zeripath didn't you do some work to pass the context down to git exec commans ☝️ ? |
You're wrong that this is unique to the native git backend - it's present in the gogit backend where it is absolutely uncancelable and will happily eat all of your memory until it's done. The native git backend at least has cancellable "processes" on the process page. #16032, which is already in 1.15 and is the PR @6543 is talking about, passes down the request context into GetLastCommitPaths for both backends making them absolutely cancellable. However, if I could convince other maintainers to review #16063 - this PR pushes the generation of commit info out of the render loop defering it on to an uniqued queue structure which is clearly the best option. Two further additions would be good:
There are two PRs up for discussion here #16042 & #16059.
Assume our file of interest changed in C and B, A < B in terms of time, but had the same SHA in F too.
One of the problems with 1.14 git and #16042 is that
#16063 could easily be changed to pre cache the full repository but gitea will already attempt to precache large repos on pushes.
#16063 as currently constituted simply does |
I came across this GH issue while looking for information on the same problem I've been having. After briefly looking over the comments here I understand that the slowness is due to Gitea trying to load the last commit message on a per folder basis so it can display it on the rendered page. It looks like even though there is an option to cache the last commit, that doesn't really do much in all cases. However I have noticed some improvements by setting these options: Probably could see even more improvement by increase the TTL more at a trade off of possibly having wrong info.
I have one suggestion, especially for large repositories such as the FreeBSD ports repo. Maybe offer an option to disable the display of the commit message on a per-repo basis. Although it's a nice feature, it's kinda pointless on large repos IMHO. I don't think I even pay attention to it. Or you could have those commit messages indexed in the background to the database or Elasticsearch instead of generating them on the fly with git log Edit: just saw #16063 which is similar to what I just described. |
I'm going to close this as I believe these problems have been considerably improved on 1.15 and main. If specific problems remain please ask for a reopen but please provide some logs - or consider opening another issue with more details. |
I will say there is an improvement after upgrading to 1.15.0. Though it doesn't look like additional page loads after the initial are any 'faster'. Taking the ongoing example of the official FreeBSD ports git repository as a benchmark. When loaded into Gitea it takes about 2 seconds to render the sub directory Reloading the same page subsequent times take only slightly less time: It's a much better experience than 1.14.x. Thanks for the work towards improving this. Hopefully the ideas mentioned in #16063 and in my previous comment are still on the table. |
[x]
):Description
I saw a similar thread but there is "windows" in the title so I create a new issue. Gitea 1.14.x is much slower in repository browsing than Gitea 1.13.
Sample repo running with 1.14.1:
https://gitea.ttmath.org/FreeBSD/ports
Try to open any directory, for example:
https://gitea.ttmath.org/FreeBSD/ports/src/branch/main/audio
It takes between 50-150 seconds to open a page.
The same repo running with 1.13.7:
https://giteaold.ttmath.org/FreeBSD/ports
Try to open similar directory, for example:
https://giteaold.ttmath.org/FreeBSD/ports/src/branch/main/audio
I takes about 5 seconds.
You can see the same problem on try.gitea.io:
https://try.gitea.io/tsowa/FreeBSD_ports
But you have a cache so you have to find a directory which was not open before. Opening such a page takes 100-300 seconds.
Let me know if more info is needed.
The text was updated successfully, but these errors were encountered: