Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

radicle tuning #1726

Merged
merged 3 commits into from
Dec 17, 2024
Merged

radicle tuning #1726

merged 3 commits into from
Dec 17, 2024

Conversation

Byron
Copy link
Member

@Byron Byron commented Dec 17, 2024

Make adjustments needed to make the final integration into heartwood smoother.

Tasks

  • make all necessary changes
  • bypass flaky CI

Notes

  • gix-negotiate is pulled in even though it's not used at all. Some types make this crate necessary though and I don't think it's worth trying to factor it out.

It's only uses for validation of the ref-maps own state, which
is something that can be separate.
This reverts commit 65788e8.

It's happening again - CI is flaky once more.
@@ -109,7 +109,7 @@ title "gix (with repository)"

# for some reason, on CI the daemon always shuts down before we can connect,
# or isn't actually ready despite having accepted the first connection already.
(with "git:// protocol"
(not_on_ci with "git:// protocol"
Copy link
Member

@EliahKagan EliahKagan Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the journey test CI flakiness observed here be related to the journey test CI problems discussed in #1634 and worked around in #1725? Since, here, it is connecting, and just not seeing the same refs as are expected, my first thought is that this is separate. But if somehow a previous run of git-daemon could still be listening, then maybe it would be possible to connect to it and not are the expected refs.

In my experiments on the earlier breakage, I was unable to observe any process with an image name like git-daemon, and in your experiments commented in #1634, while it seemed like the port was not yet available to be bound by a new process, it was not actually still open for connections (which agrees with what I observed). So I suspect that the problem here, with connecting but seeing different refs, may be unrelated. But I figured I'd bring it up just in case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chiming in!

I think the successful connection is nc that connected to a server that is on the way down.

And then the gix even managed to connect, but it was doing so while the server shut down, so it unwound while the connection was used.

Error: An IO error occurred when talking to the server

Caused by:
    Broken pipe (os error 32)

This probably means that gix should offer gix serve soon to be able to be able to deal with sockets properly :D.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the successful connection is nc that connected to a server that is on the way down.

In that case, I think running git daemon in such a way that we can detect if it is immediately failing may make it possible to avoid this problem entirely. There are a few ways to do that:

  1. Keep having the shell create the asynchronous process, but in a different way that incorporates reporting what happened.
  2. Keep having the shell create the asynchronous process with & in the same way as that is done now, but check the status immediately.
  3. Run git daemon with --detach. I think this might be sufficient to ensure it synchronously fails if it cannot bind to the port, I'm not sure. This would then require different logic to stop the daemon.
  4. Run git daemon with --inetd. Something to look at if way 3 seems like it's on the right track but ends up being unsuitable.

I'll look into this further to see if it can be done in a simple and reliable way.

In addition, we may be able to have to choose a different port based on what is available. But I am unsure if that can easily be done in a portable (no pun intended) way. It would have the benefit of allowing the journey tests to be run in parallel, such as with each other on different builds, and of allowing them to be run at the same time as other tests or processes that use git-daemon with a different port. The latter seems potentially valuable.

And then the gix even managed to connect, but it was doing so while the server shut down, so it unwound while the connection was used.

Ah, I see. What's interesting is that, in my testing, git-daemon was never running (according to pgrep) when a new git -c ... daemon ... was about to be run, as well as not accepting any connections. However, it may be that the race condition that happened here could in principle have happened before but (at least in my testing) never did.

This probably means that gix should offer gix serve soon to be able to be able to deal with sockets properly :D.

That sounds like a great idea. I had the thought that it should not be specific to the git:// protocol, since there would be a benefit in testing to being able to use it to test multiple protocols. But if I understand the description in #307, that seems already to be the plan (or maybe the plan, at that time, even excluded git://--should #307 be updated?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Run git daemon with --detach. I think this might be sufficient to ensure it synchronously fails if it cannot bind to the port, I'm not sure. This would then require different logic to stop the daemon.

It can write a --pid-file which should make it trivial to take it down.

I tried this:

diff --git a/tests/helpers.sh b/tests/helpers.sh
index 2f435a553..d6233ccb6 100644
--- a/tests/helpers.sh
+++ b/tests/helpers.sh
@@ -60,10 +60,13 @@ function small-repo-in-sandbox() {
 }
 
 function launch-git-daemon() {
-    git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --user-path &>/dev/null &
-    daemon_pid=$!
+    if ! git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --pid-file=git-daemon.pid --user-path --detach; then
+      echo "failed to start daemon"
+      return 1
+    fi
+    local daemon_pid=$(cat git-daemon.pid)
     while ! nc -z localhost 9418; do
       sleep 0.1
     done
-    trap 'kill $daemon_pid' EXIT
+    trap "kill $daemon_pid" EXIT
 }

However, it never failed to startup, and still ran into the same issue repeatedly. The first time it usually worked, but the second launch now sees "Connection reset by peer" repeatedly.

                 [it] generates the correct output in JSON format
Connection to localhost port 9418 [tcp/git] succeeded!
        [with] version 1
           [with] NO output directory
              [with] no wanted refs
                 [it] generates the correct output
              [with] wanted refs
                 [it] generates the correct output1c1,5
< Error: None of the refspec(s) =refs/heads/main matched any of the 5 refs on the remote
\ No newline at end of file
---
> Error: Could not decode server reply
>
> Caused by:
>     0: Failed to read from line reader
>     1: Connection reset by peer (os error 54)
\ No newline at end of file
 - FAIL
$ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/debug/gix --no-verbose free pack receive -p 1 git://localhost/ -r =refs/heads/main
Output snapshot did not match snapshot at '/Users/byron/dev/github.com/GitoxideLabs/gitoxide/tests/snapshots/plumbing/no-repo/pack/receive/file-v-any-no-output-wanted-ref-p1'
Error: Could not decode server reply

Caused by:
    0: Failed to read from line reader
    1: Connection reset by peer (os error 54)

and of allowing them to be run at the same time as other tests or processes that use git-daemon with a different port. The latter seems potentially valuable.

That would be useful, but would also make it necessary to configure the port used for the git protocol. It's no big deal, just something I thought I'd mention.

at that time, even excluded git://--should #307 be updated?).

Right, I forgot about that issue 😅. But at least I have updated it - it's a server and it can reasonably handle multiple transports by abstracting over them, similar on how the client already does it. It's still far enough away to have to use git-daemon for the foreseeable future though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it never failed to startup, and still ran into the same issue repeatedly. The first time it usually worked, but the second launch now sees "Connection reset by peer" repeatedly.

Hmm.

Could the nc -z connection(s) actually be triggering a bug in git-daemon where, if it receives a connection with no data, it malfunctions later, such as by confusing later connections with the trivial nc connection and sending them RST packets?

Copy link
Member

@EliahKagan EliahKagan Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never understood why that worked in the first place (after all I had to deactivate the tests again on CI)

Yes. Although I ran it many times, however it affected things was probably nondeterministic and definitely brittle.

Since the only reason for disabling caching was to avoid this problem, and the relevant tests are turned back off on CI now, I'll open a PR shortly to reenable caching in the test-journey job.

Edit: I've opened #1728 for that.

as it's not really about the time between *.sh runs

It seems to me that the time between *.sh runs may be very relevant here.

The time between two launches of git-daemon is what might matter, but that's not affected by Rust-built binaries as these have already been built.

I do not mean immediately preceding test cases in the same run of the journey test script, but rather earlier runs of the journey tests in builds with different features.

On CI, within a single run of the same test-journey job, nontrivial cargo build commands that compile artifacts do run between separate invocations of journey.sh. The helper function that launches git-daemon is called at least once in each of the journey.sh runs that tests at least one thing related to git://.

So the timing between two launches of git-daemon is sometimes affected by how long those build commands take. This will happen when one of the launches is the last one before a nontrivial build command and the other launch is the first one after that same nontrivial build command. (I don't know if it happens at other times, but I don't think so.)

Details follow.

The per-feature recipes

On CI, several different feature sets are tested by running the journey tests, in the same CI job, and artifacts are compiled in between the runs. There are four feature sets for which journey.sh is run separately and, ahead of each one, there are explicit cargo build commands:

gitoxide/justfile

Lines 200 to 222 in 69ee6a3

# run journey tests (max)
journey-tests:
cargo build --features http-client-curl-rustls
cargo build -p gix-testtools --bin jtt
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} max
# run journey tests (max-pure)
journey-tests-pure:
cargo build --no-default-features --features max-pure
cargo build -p gix-testtools --bin jtt
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} max-pure
# run journey tests (small)
journey-tests-small:
cargo build --no-default-features --features small
cargo build -p gix-testtools
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} small
# run journey tests (lean-async)
journey-tests-async:
cargo build --no-default-features --features lean-async
cargo build -p gix-testtools
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} async

All four of those build + journey.sh recipes are run in sequence (though in a different order) from the ci-journey-tests recipe, which the test-journey CI job runs. The build commands are non-trivial; they build artifacts, and sometimes download source code. Without caching, I believe they take longer.

Verification in logs

That git:// protocol journey tests are run, then later artifacts are compiled, then later more git:// protocol journey tests run, can be verified in the logs of CI runs that have journey tests enabled, such as this run in the last commit on main that journey tests.

Specifically, in this log, produced by viewing and saving the full log from that CI run, git:// protocol tests are logged at lines 1236, 1295, 1783, 1815, 2128, and 2187, while builds that report compiling at least one artifact are logged at lines 717, 1055, 1431, 1651, and 1919. (Searching for git:// and Compiling may be easier, that checking those, though.)

A more useful view of the logs

The progression of the most relevant events is:

  1. A max-pure build.
  2. A ./tests/journey.sh run for max-pure, in which git-daemon is used in a gix remote refs test and again a gix free pack receive test.
  3. A small build.
  4. A ./tests/journey.sh run for small, in which git-daemon is not used.
  5. A lean-async build.
  6. A ./tests/journey.sh run for lean-async, in which git-daemon is used in a gix remote refs test and a gix free pack receive test.
  7. An all default features plus http-client-curl-rustls build.
  8. A ./tests/journey.sh run for all default features plus http-client-curl-rustls, in which git-daemon is used in a gix remote refs test and a gix-free-pack-receive test.

Within the individual steps listed above that run ./tests/journey.sh, nothing is built. But in between them, builds take place, and I suspect that the variation in how long those take is a factor that contributes to whatever strange behavior of git-daemon we are seeing here.

This progression also applies in failing runs

The log examined above is from a run that succeeded, so a bit more is needed to make the case that build times are plausibly relevant to failures, in the way that failures have actually occurred.

Consider the failing run from #1634, linked from #1634 (comment). This runs journey tests for different feature sets on in the same order (no changes to justfile have been made since somewhat before then), with the test of all default features plus http-client-curl-rustls coming last. GitHub Actions logs are eventually deleted, so just in case, here's a gist of that run's log. (It should not be confused with the above gist.)

The same progression is happening there: journey tests on builds with some feature sets run, including journey tests that run and use git-daemon, then other builds run, then the journey test of default features plus http-client-curl-rustls runs git-daemon and fails to communicate with it.

Interleaved builds cover some cache-eligible artifacts

I counted build steps are non-trivial above when they compiled at least one artifact. I did this to make clear that artifacts really do get built between runs of git-daemon in the journey tests, due to the way the journey tests are run in four sub-recipes defined in the justfile, each of which starts with cargo build commands.

But rust-cache avoids caching workspace crates (Swatinem/rust-cache#37 (comment), #1668 (comment)). So the kind of caching rust-cache provides is unlikely to have a significant effect on build times unless the artifacts that have to be rebuilt include artifacts for crates that are obtained from the registry rather than being part of this workspace.

It turns out they do. Probably most significantly, because it precedes the ./tests/journey.sh command in which failures occur in runs that have failures, the cargo build --features http-client-curl-rustls command builds 29 crates eligible for caching, as well as a few more interleaved with gix-* crates that are ineligible for caching. Of the crates eligible for caching that it builds, it even downloads 6 of them.

(I think there may be ways other than timing that this affects networking on the test machine. They would also be very strange and unintuitive. But the problem usually or always happens on GNU/Linux, at least on CI, and does not usually happen on macOS, at least in CI tests of macOS, as mentioned in #1634 (comment). So the cause, whatever it is, is something unintuitive.)

Synthesis (not really a conclusion)

I do very much agree that some bug in git-daemon, or some problem related to it that it not specific to gitoxide, is probably at the root of this.

But the order in which cargo build and ./tests/journey.sh commands run, as well as their observed effects, shows that some delays between git-daemon runs are affected by how much work cargo build steps do. Furthermore, a cargo build step between git daemon using tests that always succeed, and those that have been observed to fail, has to download some cache-eligible crates and to build artifacts for them and a number of other cache-eligible crates, when rust-cache is omitted.

Both due to those builds and more generally, timing could, as far as I know, still be a significant trigger.

I should be able to test how plausible this is by splitting the tests into one job per feature-set sub-recipe, or by changing the order in which the ci-journey-tests recipe lists those recipes.

Addendum 1: the 127.0.0.1 vs. ::1 connection (pun intended)

Although I didn't know until this PR that the workaround in #1725 would prove insufficient, I had hoped to find a better way, and while looking for one I tried forcing git-daemon to listen on the IPv6 ::1 rather than the IPv4 127.0.0.1. The most relevant of the changed code was this (where $port just expands to 9418):

git -c uploadpack.allowRefInWant=true \
daemon --verbose --base-path=. --export-all --user-path --listen=::1 &>/dev/null &
daemon_pid=$!
while ! nc -z ::1 "$port"; do
sleep 0.1
done

In #1634 (comment), I had made this partially inaccurate claim:

The reason I think--or, I should say, thought--gix is treating localhost to mean ::1, even when nc has treated it to mean 127.0.0.1, is that it looks like the problem went away when I added --listen=::1 to the git daemon command and modified the nc -z loop that follows it to connect to ::1. But I am not sure that was a correct observation, because when I do this now, it is taking forever, even on tests that had worked before with ::1: somehow, even where nc had reported connecting to ::1, it cannot do so when passed ::1.

I now believe that what was really happening is that whenever the problem connecting to git-daemon is going to occur, the changing of forcing ::1 causes nc to block forever. Runs that succeeded before appear to have succeeded, still to have succeeded, while the journey tests after cargo build --features http-client-curl-rustls have it block forever.

This is shown in the logs from the commit with the ::1 experiment (not to be confused with any other logs cited above). The "The operation was canceled" message documents automatic cancellation of the job after it blocked for 360 minutes.

One way to describe this is that forcing IPv6 causes the nc command, which is intended to ensure gix is able to connect to git-daemon, to fail when a gix command would fail. That is, if gix would not be able to use git-daemon, at least as tested nc under the ::1 modifications seems to be doing its intended job of blocking--just, the condition it is waiting for never happens. Whether this ::1 experiment really illuminated anything meaningful about the cause of the problem, I am not sure.

Addendum 2: rust-cache also affects code generation

This seems even weirder and less plausible, but maybe code in gitoxide crates or their dependencies really is the issue and the problem varies depending on whether intermediate artifacts are created.

The rust_cache action sets CARGO_INCREMENTAL=0. If unsafe code somewhere has undefined behavior and optimizations are turned on, then I think that could cause a bug due to UB to manifest. This seems unlikely for debug builds--such as those exercised in the journey tests. But some crates are configured to use opt-level = 3 even in debug builds.

(This is also something I should be able to test, by creating a situation where the tests run but the failures do not occur and then setting CARGO_INCREMENTAL=0 to see if it has any effect.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed analysis - stunning work!

Within the individual steps listed above that run ./tests/journey.sh, nothing is built. But in between them, builds take place, and I suspect that the variation in how long those take is a factor that contributes to whatever strange behavior of git-daemon we are seeing here.

As the removal of rust_cache has seemingly solved the problem, it very much felt like a timing issue, possibly also related to network use when downloading crates.

The rust_cache action sets CARGO_INCREMENTAL=0. If unsafe code somewhere has undefined behavior and optimizations are turned on, then I think that could cause a bug due to UB to manifest. This seems unlikely for debug builds--such as those exercised in the journey tests. But some crates are configured to use opt-level = 3 even in debug builds.

Even though I definitely like this explanation more, that artefacts differ in meaningful ways when a cache is involved.
It still doesn't make much sense to me though, especially since most failures seem to stem from a trivial connection error, something that I'd trust to be 'correct'.

But the problem usually or always happens on GNU/Linux, at least on CI, and does not usually happen on macOS, at least in CI tests of macOS, [..]

These days it's easy for me to reproduce the issue, it's basically not working for me anymore at all, most of the time.

Something that changed is that it now errors due to connection reset, not to connection refused.

Error: Could not decode server reply

Caused by:
    0: Failed to read from line reader
    1: Connection reset by peer (os error 54)

It's also always the second time the git-daemon starts up when it fails.

Just for fun I added this patch:

diff --git a/tests/journey/gix.sh b/tests/journey/gix.sh
index a7acfc8f7..b79a7a7c0 100644
--- a/tests/journey/gix.sh
+++ b/tests/journey/gix.sh
@@ -280,6 +280,7 @@ title "gix commit-graph"
         fi
         (not_on_ci with "git:// protocol"
           launch-git-daemon
+          sleep 10
           (with "version 1"
             (with "NO output directory"
               (with "no wanted refs"

It still failed, so timing shouldn't be an issue. I could also see, while it was waiting, that git-daemon was up and running.

Inversely, putting the sleep 10 before launch-git-daemon allowed to validate that the previous git-daemon wasn't running anymore.

Both invocations should be completely independent.

To see if it's related to remote refs or free pack receive, I replaced the free pack receive portion with a copy of remote refs. And that works.

A clear indication that the git-daemon is crashing or terminating the connection forcefully.

When replacing the remote refs part with a copy of the seemingly failing free pack receive part, it seems to work but hangs at this line:

Connection to localhost port 9418 [tcp/git] succeeded!
        [with] version 1
           [with] NO output directory
              [with] no wanted refs
                 [it] generates the correct output
              [with] wanted refs
                 [it] generates the correct output
           [with] output directory
              [it] generates the correct output
        [with] version 2
           [with] NO output directory
              [with] NO wanted refs
                 [it] generates the correct output
              [with] wanted refs
                 [it] generates the correct output
                 [when] ref does not exist
                    [it] fails with a detailed error message including what the server said

After cranking up debug-logging in the daemon I get this:

                    [it] fails with a detailed error message including what the server said07:29:41.100591 run-command.c:655       trace: run_command: REMOTE_ADDR='[::1]' REMOTE_PORT=57324 /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon --serve --verbose --base-path=. --export-all --user-path
07:29:41.102904 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon
07:29:41.103190 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[16726] Connection from [::1]:57324
07:29:41.103327 pkt-line.c:80           packet:          git< git-upload-pack /\0host=localhost\0\0version=2\0agent=git/oxide-0.68.0\0
[16726] Extended attribute "host": localhost
[16726] Extended attribute "protocol": version=2:agent=git/oxide-0.68.0
[16726] Request upload-pack for '/'
07:29:41.103487 run-command.c:655       trace: run_command: GIT_PROTOCOL=version=2:agent=git/oxide-0.68.0 git upload-pack --strict --timeout=0 .
[16726] 07:29:41.106433 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git
[16726] 07:29:41.106793 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[16726] 07:29:41.106960 git.c:460               trace: built-in: git upload-pack --strict --timeout=0 .
[16726] 07:29:41.107047 pkt-line.c:80           packet:  upload-pack> version 2
[16726] 07:29:41.107065 pkt-line.c:80           packet:  upload-pack> agent=git/2.39.5.(Apple.Git-154)
[16726] 07:29:41.107167 pkt-line.c:80           packet:  upload-pack> ls-refs=unborn
[16726] 07:29:41.107189 pkt-line.c:80           packet:  upload-pack> fetch=shallow wait-for-done ref-in-want
[16726] 07:29:41.107192 pkt-line.c:80           packet:  upload-pack> server-option
[16726] 07:29:41.107194 pkt-line.c:80           packet:  upload-pack> object-format=sha1
[16726] 07:29:41.107196 pkt-line.c:80           packet:  upload-pack> object-info
[16726] 07:29:41.107198 pkt-line.c:80           packet:  upload-pack> 0000
[16726] 07:29:41.107926 pkt-line.c:80           packet:  upload-pack< command=ls-refs
[16726] 07:29:41.107929 pkt-line.c:80           packet:  upload-pack< agent=git/oxide-0.68.0
[16726] 07:29:41.107931 pkt-line.c:80           packet:  upload-pack< 0001
[16726] 07:29:41.107933 pkt-line.c:80           packet:  upload-pack< symrefs
[16726] 07:29:41.107941 pkt-line.c:80           packet:  upload-pack< peel
[16726] 07:29:41.107943 pkt-line.c:80           packet:  upload-pack< unborn
[16726] 07:29:41.107945 pkt-line.c:80           packet:  upload-pack< ref-prefix refs/heads/
[16726] 07:29:41.107947 pkt-line.c:80           packet:  upload-pack< 0000
[16726] 07:29:41.108354 pkt-line.c:80           packet:  upload-pack> ee3c97678e89db4eab7420b04aef51758359f152 refs/heads/dev
[16726] 07:29:41.108388 pkt-line.c:80           packet:  upload-pack> 3f72b39ad1600e6dac63430c15e0d875e9d3f9d6 refs/heads/main
[16726] 07:29:41.108391 pkt-line.c:80           packet:  upload-pack> 0000
[16726] 07:29:41.108609 pkt-line.c:80           packet:  upload-pack< command=fetch
[16726] 07:29:41.108612 pkt-line.c:80           packet:  upload-pack< agent=git/oxide-0.68.0
[16726] 07:29:41.108613 pkt-line.c:80           packet:  upload-pack< 0001
[16726] 07:29:41.108692 pkt-line.c:80           packet:  upload-pack< thin-pack
[16726] 07:29:41.108694 pkt-line.c:80           packet:  upload-pack< ofs-delta
[16726] 07:29:41.108696 pkt-line.c:80           packet:  upload-pack< include-tag
[16726] 07:29:41.108698 pkt-line.c:80           packet:  upload-pack< done
[16726] 07:29:41.108699 pkt-line.c:80           packet:  upload-pack< 0000

The last invocation is a fetch request without any want <object-id> line, but the done line (as sent by gix) should indicate the negotiation ended, send me the pack. But because nothing is wanted, it seems to just hang - it's likely a deadlock - gix would now try to read the response, but the upload-pack program isn't sending anything, but apparently reads as well as it's also hanging there.

When kill -9 is applied to the upload-pack process, the test just passes and it continues to the second round. There it failed - in this case - because the previous git-daemon didn't shutdown.

I will investigate this further, as gix should probably not submit fetch requests without a single want.
Besides that, I don't think this will fix the actual issue, as thus far the first invocation of git-daemon seems to work much better, still.
It remains a very illusive issue.

Copy link
Member

@EliahKagan EliahKagan Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days it's easy for me to reproduce the issue, it's basically not working for me anymore at all, most of the time.

Something that changed is that it now errors due to connection reset, not to connection refused.

I wonder if a recently updated dependency is a factor, and what happens if --locked is used on a system where the problem recently began to occur more often.

I wonder also if a system (or Homebrew-provided) package is a factor. curl was recently updated on many systems, to fix a CVE (though I don't know how those specific changes could affect this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After my recent trial session I don't even feel it's worth trying those fringes of possible causes as it left me with something very clear to pursue: a hang when a fetch is performed without a single 'want'.

I also learned that gix remote refs works twice, but once gix free pack receive is involved, it starts failing and resetting the connection. If it is run first, it hangs, and if it runs second, the first connection to the daemon is reset. It also became clear that the daemon is up and operational and that it deliberately resets the connection.

This would make sense as free pack receive changed the most, it now uses an entirely new and much more faithful protocol implementation, and one that never got tested with the pure git://. It's probably a legitimate bug.

Now that I am writing this I also realise that I can turn on packetline tracing on the client so it would be clear what it says to the server before it shuts down.

I hope I get to it today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if this works.

The culprit seems to have been that free pack receive, albeit relying on gix-protocol, could still mess it up by forgetting to properly shut down the connection (by sending and EOF/flush packet), and it was able to request a pack without ever specifying an object to receive, causing a deadlock (upload pack kept reading even though the client indicate it was done). Especially the latter sounds like an attack vector for a server, as it feels a client could cause any amount of connections to be opened and held open, which is much more expensive for the server than it would be for the client.

If it was absolutely no effort, I'd definitely feel curious enough to try if it really does work like I think it would… and… I tried it, but via HTTPS I was only able to go into a negotiation loop. However, thinking about it, it's still trivial to hang the server forever by implementing the protocol incorrectly, so I suppose everything server must protect against that. The bug I ran into locally I couldn't reproduce using HTTP though.

Ultimately, it was quite useful to have an inferior implementation of fetch as implemented by gix free pack receive as it helped to make the underlying gix-protocol implementation more sturdy. It now incorporates more of what the gix::fetch implementation already checks.

@Byron Byron marked this pull request as ready for review December 17, 2024 18:35
@Byron Byron merged commit a542775 into main Dec 17, 2024
20 checks passed
@Byron Byron deleted the radicle-tuning branch December 17, 2024 18:35
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Dec 18, 2024
This reverts commit bf2abdb, the
supporting change in GitoxideLabs#1725 that refrained from caching build
artifacts from the `test-journey` job, since that no longer needs
to be avoided now that `git-daemon` journey tests aren't run on CI.

In 9566488 (GitoxideLabs#1634), journey tests that use `git-daemon` were
disabled on CI. GitoxideLabs#1725 tried reenabling them (65788e8) and fixing
them by no longer caching build artifacts with the `rust-cache`
action in the `test-journey` job (bf2abdb). This worked at the
time, but the exact reason it worked or how reliable it would be
were unknown. It stopped working shortly thereafter in 25b8480
(GitoxideLabs#1726), and those journey tests were re-disabled on CI in d1d3f7c,
which reverted 65788e8. However, bf2abdb was not reverted at that
time.

(The change in d1d3f7c from GitoxideLabs#1726, taken together with this
commit, effectively constitute a revert of PR GitoxideLabs#1725.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants