-
-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
radicle tuning #1726
radicle tuning #1726
Conversation
It's only uses for validation of the ref-maps own state, which is something that can be separate.
This reverts commit 65788e8. It's happening again - CI is flaky once more.
@@ -109,7 +109,7 @@ title "gix (with repository)" | |||
|
|||
# for some reason, on CI the daemon always shuts down before we can connect, | |||
# or isn't actually ready despite having accepted the first connection already. | |||
(with "git:// protocol" | |||
(not_on_ci with "git:// protocol" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the journey test CI flakiness observed here be related to the journey test CI problems discussed in #1634 and worked around in #1725? Since, here, it is connecting, and just not seeing the same refs as are expected, my first thought is that this is separate. But if somehow a previous run of git-daemon
could still be listening, then maybe it would be possible to connect to it and not are the expected refs.
In my experiments on the earlier breakage, I was unable to observe any process with an image name like git-daemon
, and in your experiments commented in #1634, while it seemed like the port was not yet available to be bound by a new process, it was not actually still open for connections (which agrees with what I observed). So I suspect that the problem here, with connecting but seeing different refs, may be unrelated. But I figured I'd bring it up just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chiming in!
I think the successful connection is nc
that connected to a server that is on the way down.
And then the gix
even managed to connect, but it was doing so while the server shut down, so it unwound while the connection was used.
Error: An IO error occurred when talking to the server
Caused by:
Broken pipe (os error 32)
This probably means that gix
should offer gix serve
soon to be able to be able to deal with sockets properly :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the successful connection is
nc
that connected to a server that is on the way down.
In that case, I think running git daemon
in such a way that we can detect if it is immediately failing may make it possible to avoid this problem entirely. There are a few ways to do that:
- Keep having the shell create the asynchronous process, but in a different way that incorporates reporting what happened.
- Keep having the shell create the asynchronous process with
&
in the same way as that is done now, but check the status immediately. - Run
git daemon
with--detach
. I think this might be sufficient to ensure it synchronously fails if it cannot bind to the port, I'm not sure. This would then require different logic to stop the daemon. - Run
git daemon
with--inetd
. Something to look at if way 3 seems like it's on the right track but ends up being unsuitable.
I'll look into this further to see if it can be done in a simple and reliable way.
In addition, we may be able to have to choose a different port based on what is available. But I am unsure if that can easily be done in a portable (no pun intended) way. It would have the benefit of allowing the journey tests to be run in parallel, such as with each other on different builds, and of allowing them to be run at the same time as other tests or processes that use git-daemon
with a different port. The latter seems potentially valuable.
And then the
gix
even managed to connect, but it was doing so while the server shut down, so it unwound while the connection was used.
Ah, I see. What's interesting is that, in my testing, git-daemon
was never running (according to pgrep
) when a new git -c ... daemon ...
was about to be run, as well as not accepting any connections. However, it may be that the race condition that happened here could in principle have happened before but (at least in my testing) never did.
This probably means that
gix
should offergix serve
soon to be able to be able to deal with sockets properly :D.
That sounds like a great idea. I had the thought that it should not be specific to the git://
protocol, since there would be a benefit in testing to being able to use it to test multiple protocols. But if I understand the description in #307, that seems already to be the plan (or maybe the plan, at that time, even excluded git://
--should #307 be updated?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Run
git daemon
with--detach
. I think this might be sufficient to ensure it synchronously fails if it cannot bind to the port, I'm not sure. This would then require different logic to stop the daemon.
It can write a --pid-file
which should make it trivial to take it down.
I tried this:
diff --git a/tests/helpers.sh b/tests/helpers.sh
index 2f435a553..d6233ccb6 100644
--- a/tests/helpers.sh
+++ b/tests/helpers.sh
@@ -60,10 +60,13 @@ function small-repo-in-sandbox() {
}
function launch-git-daemon() {
- git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --user-path &>/dev/null &
- daemon_pid=$!
+ if ! git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --pid-file=git-daemon.pid --user-path --detach; then
+ echo "failed to start daemon"
+ return 1
+ fi
+ local daemon_pid=$(cat git-daemon.pid)
while ! nc -z localhost 9418; do
sleep 0.1
done
- trap 'kill $daemon_pid' EXIT
+ trap "kill $daemon_pid" EXIT
}
However, it never failed to startup, and still ran into the same issue repeatedly. The first time it usually worked, but the second launch now sees "Connection reset by peer" repeatedly.
[it] generates the correct output in JSON format
Connection to localhost port 9418 [tcp/git] succeeded!
[with] version 1
[with] NO output directory
[with] no wanted refs
[it] generates the correct output
[with] wanted refs
[it] generates the correct output1c1,5
< Error: None of the refspec(s) =refs/heads/main matched any of the 5 refs on the remote
\ No newline at end of file
---
> Error: Could not decode server reply
>
> Caused by:
> 0: Failed to read from line reader
> 1: Connection reset by peer (os error 54)
\ No newline at end of file
- FAIL
$ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/debug/gix --no-verbose free pack receive -p 1 git://localhost/ -r =refs/heads/main
Output snapshot did not match snapshot at '/Users/byron/dev/github.com/GitoxideLabs/gitoxide/tests/snapshots/plumbing/no-repo/pack/receive/file-v-any-no-output-wanted-ref-p1'
Error: Could not decode server reply
Caused by:
0: Failed to read from line reader
1: Connection reset by peer (os error 54)
and of allowing them to be run at the same time as other tests or processes that use
git-daemon
with a different port. The latter seems potentially valuable.
That would be useful, but would also make it necessary to configure the port used for the git
protocol. It's no big deal, just something I thought I'd mention.
at that time, even excluded
git://
--should #307 be updated?).
Right, I forgot about that issue 😅. But at least I have updated it - it's a server and it can reasonably handle multiple transports by abstracting over them, similar on how the client already does it. It's still far enough away to have to use git-daemon
for the foreseeable future though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it never failed to startup, and still ran into the same issue repeatedly. The first time it usually worked, but the second launch now sees "Connection reset by peer" repeatedly.
Hmm.
Could the nc -z
connection(s) actually be triggering a bug in git-daemon
where, if it receives a connection with no data, it malfunctions later, such as by confusing later connections with the trivial nc
connection and sending them RST packets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never understood why that worked in the first place (after all I had to deactivate the tests again on CI)
Yes. Although I ran it many times, however it affected things was probably nondeterministic and definitely brittle.
Since the only reason for disabling caching was to avoid this problem, and the relevant tests are turned back off on CI now, I'll open a PR shortly to reenable caching in the test-journey
job.
Edit: I've opened #1728 for that.
as it's not really about the time between
*.sh
runs
It seems to me that the time between *.sh
runs may be very relevant here.
The time between two launches of
git-daemon
is what might matter, but that's not affected by Rust-built binaries as these have already been built.
I do not mean immediately preceding test cases in the same run of the journey test script, but rather earlier runs of the journey tests in builds with different features.
On CI, within a single run of the same test-journey
job, nontrivial cargo build
commands that compile artifacts do run between separate invocations of journey.sh
. The helper function that launches git-daemon
is called at least once in each of the journey.sh
runs that tests at least one thing related to git://
.
So the timing between two launches of git-daemon
is sometimes affected by how long those build commands take. This will happen when one of the launches is the last one before a nontrivial build command and the other launch is the first one after that same nontrivial build command. (I don't know if it happens at other times, but I don't think so.)
Details follow.
The per-feature recipes
On CI, several different feature sets are tested by running the journey tests, in the same CI job, and artifacts are compiled in between the runs. There are four feature sets for which journey.sh
is run separately and, ahead of each one, there are explicit cargo build
commands:
Lines 200 to 222 in 69ee6a3
# run journey tests (max) | |
journey-tests: | |
cargo build --features http-client-curl-rustls | |
cargo build -p gix-testtools --bin jtt | |
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} max | |
# run journey tests (max-pure) | |
journey-tests-pure: | |
cargo build --no-default-features --features max-pure | |
cargo build -p gix-testtools --bin jtt | |
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} max-pure | |
# run journey tests (small) | |
journey-tests-small: | |
cargo build --no-default-features --features small | |
cargo build -p gix-testtools | |
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} small | |
# run journey tests (lean-async) | |
journey-tests-async: | |
cargo build --no-default-features --features lean-async | |
cargo build -p gix-testtools | |
./tests/journey.sh {{ ein }} {{ gix }} {{ jtt }} async |
All four of those build + journey.sh
recipes are run in sequence (though in a different order) from the ci-journey-tests
recipe, which the test-journey
CI job runs. The build commands are non-trivial; they build artifacts, and sometimes download source code. Without caching, I believe they take longer.
Verification in logs
That git://
protocol journey tests are run, then later artifacts are compiled, then later more git://
protocol journey tests run, can be verified in the logs of CI runs that have journey tests enabled, such as this run in the last commit on main that journey tests.
Specifically, in this log, produced by viewing and saving the full log from that CI run, git://
protocol tests are logged at lines 1236, 1295, 1783, 1815, 2128, and 2187, while builds that report compiling at least one artifact are logged at lines 717, 1055, 1431, 1651, and 1919. (Searching for git://
and Compiling
may be easier, that checking those, though.)
A more useful view of the logs
The progression of the most relevant events is:
- A
max-pure
build. - A
./tests/journey.sh
run formax-pure
, in whichgit-daemon
is used in agix remote refs
test and again agix free pack receive
test. - A
small
build. - A
./tests/journey.sh
run forsmall
, in whichgit-daemon
is not used. - A
lean-async
build. - A
./tests/journey.sh
run forlean-async
, in whichgit-daemon
is used in agix remote refs
test and agix free pack receive
test. - An all default features plus
http-client-curl-rustls
build. - A
./tests/journey.sh
run for all default features plushttp-client-curl-rustls
, in whichgit-daemon
is used in agix remote refs
test and agix-free-pack-receive
test.
Within the individual steps listed above that run ./tests/journey.sh
, nothing is built. But in between them, builds take place, and I suspect that the variation in how long those take is a factor that contributes to whatever strange behavior of git-daemon
we are seeing here.
This progression also applies in failing runs
The log examined above is from a run that succeeded, so a bit more is needed to make the case that build times are plausibly relevant to failures, in the way that failures have actually occurred.
Consider the failing run from #1634, linked from #1634 (comment). This runs journey tests for different feature sets on in the same order (no changes to justfile
have been made since somewhat before then), with the test of all default features plus http-client-curl-rustls
coming last. GitHub Actions logs are eventually deleted, so just in case, here's a gist of that run's log. (It should not be confused with the above gist.)
The same progression is happening there: journey tests on builds with some feature sets run, including journey tests that run and use git-daemon
, then other builds run, then the journey test of default features plus http-client-curl-rustls
runs git-daemon
and fails to communicate with it.
Interleaved builds cover some cache-eligible artifacts
I counted build steps are non-trivial above when they compiled at least one artifact. I did this to make clear that artifacts really do get built between runs of git-daemon
in the journey tests, due to the way the journey tests are run in four sub-recipes defined in the justfile
, each of which starts with cargo build
commands.
But rust-cache
avoids caching workspace crates (Swatinem/rust-cache#37 (comment), #1668 (comment)). So the kind of caching rust-cache
provides is unlikely to have a significant effect on build times unless the artifacts that have to be rebuilt include artifacts for crates that are obtained from the registry rather than being part of this workspace.
It turns out they do. Probably most significantly, because it precedes the ./tests/journey.sh
command in which failures occur in runs that have failures, the cargo build --features http-client-curl-rustls
command builds 29 crates eligible for caching, as well as a few more interleaved with gix-*
crates that are ineligible for caching. Of the crates eligible for caching that it builds, it even downloads 6 of them.
(I think there may be ways other than timing that this affects networking on the test machine. They would also be very strange and unintuitive. But the problem usually or always happens on GNU/Linux, at least on CI, and does not usually happen on macOS, at least in CI tests of macOS, as mentioned in #1634 (comment). So the cause, whatever it is, is something unintuitive.)
Synthesis (not really a conclusion)
I do very much agree that some bug in git-daemon
, or some problem related to it that it not specific to gitoxide, is probably at the root of this.
But the order in which cargo build
and ./tests/journey.sh
commands run, as well as their observed effects, shows that some delays between git-daemon
runs are affected by how much work cargo build
steps do. Furthermore, a cargo build
step between git daemon
using tests that always succeed, and those that have been observed to fail, has to download some cache-eligible crates and to build artifacts for them and a number of other cache-eligible crates, when rust-cache
is omitted.
Both due to those builds and more generally, timing could, as far as I know, still be a significant trigger.
I should be able to test how plausible this is by splitting the tests into one job per feature-set sub-recipe, or by changing the order in which the ci-journey-tests
recipe lists those recipes.
Addendum 1: the 127.0.0.1
vs. ::1
connection (pun intended)
Although I didn't know until this PR that the workaround in #1725 would prove insufficient, I had hoped to find a better way, and while looking for one I tried forcing git-daemon
to listen on the IPv6 ::1
rather than the IPv4 127.0.0.1
. The most relevant of the changed code was this (where $port
just expands to 9418
):
Lines 77 to 82 in 387a106
git -c uploadpack.allowRefInWant=true \ | |
daemon --verbose --base-path=. --export-all --user-path --listen=::1 &>/dev/null & | |
daemon_pid=$! | |
while ! nc -z ::1 "$port"; do | |
sleep 0.1 | |
done |
In #1634 (comment), I had made this partially inaccurate claim:
The reason I think--or, I should say, thought--
gix
is treatinglocalhost
to mean::1
, even whennc
has treated it to mean127.0.0.1
, is that it looks like the problem went away when I added--listen=::1
to thegit daemon
command and modified thenc -z
loop that follows it to connect to::1
. But I am not sure that was a correct observation, because when I do this now, it is taking forever, even on tests that had worked before with::1
: somehow, even wherenc
had reported connecting to::1
, it cannot do so when passed::1
.
I now believe that what was really happening is that whenever the problem connecting to git-daemon
is going to occur, the changing of forcing ::1
causes nc
to block forever. Runs that succeeded before appear to have succeeded, still to have succeeded, while the journey tests after cargo build --features http-client-curl-rustls
have it block forever.
This is shown in the logs from the commit with the ::1
experiment (not to be confused with any other logs cited above). The "The operation was canceled" message documents automatic cancellation of the job after it blocked for 360 minutes.
One way to describe this is that forcing IPv6 causes the nc
command, which is intended to ensure gix
is able to connect to git-daemon
, to fail when a gix
command would fail. That is, if gix
would not be able to use git-daemon
, at least as tested nc
under the ::1
modifications seems to be doing its intended job of blocking--just, the condition it is waiting for never happens. Whether this ::1
experiment really illuminated anything meaningful about the cause of the problem, I am not sure.
Addendum 2: rust-cache
also affects code generation
This seems even weirder and less plausible, but maybe code in gitoxide crates or their dependencies really is the issue and the problem varies depending on whether intermediate artifacts are created.
The rust_cache
action sets CARGO_INCREMENTAL=0
. If unsafe code somewhere has undefined behavior and optimizations are turned on, then I think that could cause a bug due to UB to manifest. This seems unlikely for debug builds--such as those exercised in the journey tests. But some crates are configured to use opt-level = 3
even in debug builds.
(This is also something I should be able to test, by creating a situation where the tests run but the failures do not occur and then setting CARGO_INCREMENTAL=0
to see if it has any effect.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed analysis - stunning work!
Within the individual steps listed above that run
./tests/journey.sh
, nothing is built. But in between them, builds take place, and I suspect that the variation in how long those take is a factor that contributes to whatever strange behavior ofgit-daemon
we are seeing here.
As the removal of rust_cache
has seemingly solved the problem, it very much felt like a timing issue, possibly also related to network use when downloading crates.
The
rust_cache
action setsCARGO_INCREMENTAL=0
. If unsafe code somewhere has undefined behavior and optimizations are turned on, then I think that could cause a bug due to UB to manifest. This seems unlikely for debug builds--such as those exercised in the journey tests. But some crates are configured to useopt-level = 3
even in debug builds.
Even though I definitely like this explanation more, that artefacts differ in meaningful ways when a cache is involved.
It still doesn't make much sense to me though, especially since most failures seem to stem from a trivial connection error, something that I'd trust to be 'correct'.
But the problem usually or always happens on GNU/Linux, at least on CI, and does not usually happen on macOS, at least in CI tests of macOS, [..]
These days it's easy for me to reproduce the issue, it's basically not working for me anymore at all, most of the time.
Something that changed is that it now errors due to connection reset, not to connection refused.
Error: Could not decode server reply
Caused by:
0: Failed to read from line reader
1: Connection reset by peer (os error 54)
It's also always the second time the git-daemon
starts up when it fails.
Just for fun I added this patch:
diff --git a/tests/journey/gix.sh b/tests/journey/gix.sh
index a7acfc8f7..b79a7a7c0 100644
--- a/tests/journey/gix.sh
+++ b/tests/journey/gix.sh
@@ -280,6 +280,7 @@ title "gix commit-graph"
fi
(not_on_ci with "git:// protocol"
launch-git-daemon
+ sleep 10
(with "version 1"
(with "NO output directory"
(with "no wanted refs"
It still failed, so timing shouldn't be an issue. I could also see, while it was waiting, that git-daemon
was up and running.
Inversely, putting the sleep 10
before launch-git-daemon
allowed to validate that the previous git-daemon
wasn't running anymore.
Both invocations should be completely independent.
To see if it's related to remote refs
or free pack receive
, I replaced the free pack receive
portion with a copy of remote refs
. And that works.
A clear indication that the git-daemon
is crashing or terminating the connection forcefully.
When replacing the remote refs
part with a copy of the seemingly failing free pack receive
part, it seems to work but hangs at this line:
Connection to localhost port 9418 [tcp/git] succeeded!
[with] version 1
[with] NO output directory
[with] no wanted refs
[it] generates the correct output
[with] wanted refs
[it] generates the correct output
[with] output directory
[it] generates the correct output
[with] version 2
[with] NO output directory
[with] NO wanted refs
[it] generates the correct output
[with] wanted refs
[it] generates the correct output
[when] ref does not exist
[it] fails with a detailed error message including what the server said
After cranking up debug-logging in the daemon I get this:
[it] fails with a detailed error message including what the server said07:29:41.100591 run-command.c:655 trace: run_command: REMOTE_ADDR='[::1]' REMOTE_PORT=57324 /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon --serve --verbose --base-path=. --export-all --user-path
07:29:41.102904 exec-cmd.c:139 trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon
07:29:41.103190 exec-cmd.c:238 trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[16726] Connection from [::1]:57324
07:29:41.103327 pkt-line.c:80 packet: git< git-upload-pack /\0host=localhost\0\0version=2\0agent=git/oxide-0.68.0\0
[16726] Extended attribute "host": localhost
[16726] Extended attribute "protocol": version=2:agent=git/oxide-0.68.0
[16726] Request upload-pack for '/'
07:29:41.103487 run-command.c:655 trace: run_command: GIT_PROTOCOL=version=2:agent=git/oxide-0.68.0 git upload-pack --strict --timeout=0 .
[16726] 07:29:41.106433 exec-cmd.c:139 trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git
[16726] 07:29:41.106793 exec-cmd.c:238 trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[16726] 07:29:41.106960 git.c:460 trace: built-in: git upload-pack --strict --timeout=0 .
[16726] 07:29:41.107047 pkt-line.c:80 packet: upload-pack> version 2
[16726] 07:29:41.107065 pkt-line.c:80 packet: upload-pack> agent=git/2.39.5.(Apple.Git-154)
[16726] 07:29:41.107167 pkt-line.c:80 packet: upload-pack> ls-refs=unborn
[16726] 07:29:41.107189 pkt-line.c:80 packet: upload-pack> fetch=shallow wait-for-done ref-in-want
[16726] 07:29:41.107192 pkt-line.c:80 packet: upload-pack> server-option
[16726] 07:29:41.107194 pkt-line.c:80 packet: upload-pack> object-format=sha1
[16726] 07:29:41.107196 pkt-line.c:80 packet: upload-pack> object-info
[16726] 07:29:41.107198 pkt-line.c:80 packet: upload-pack> 0000
[16726] 07:29:41.107926 pkt-line.c:80 packet: upload-pack< command=ls-refs
[16726] 07:29:41.107929 pkt-line.c:80 packet: upload-pack< agent=git/oxide-0.68.0
[16726] 07:29:41.107931 pkt-line.c:80 packet: upload-pack< 0001
[16726] 07:29:41.107933 pkt-line.c:80 packet: upload-pack< symrefs
[16726] 07:29:41.107941 pkt-line.c:80 packet: upload-pack< peel
[16726] 07:29:41.107943 pkt-line.c:80 packet: upload-pack< unborn
[16726] 07:29:41.107945 pkt-line.c:80 packet: upload-pack< ref-prefix refs/heads/
[16726] 07:29:41.107947 pkt-line.c:80 packet: upload-pack< 0000
[16726] 07:29:41.108354 pkt-line.c:80 packet: upload-pack> ee3c97678e89db4eab7420b04aef51758359f152 refs/heads/dev
[16726] 07:29:41.108388 pkt-line.c:80 packet: upload-pack> 3f72b39ad1600e6dac63430c15e0d875e9d3f9d6 refs/heads/main
[16726] 07:29:41.108391 pkt-line.c:80 packet: upload-pack> 0000
[16726] 07:29:41.108609 pkt-line.c:80 packet: upload-pack< command=fetch
[16726] 07:29:41.108612 pkt-line.c:80 packet: upload-pack< agent=git/oxide-0.68.0
[16726] 07:29:41.108613 pkt-line.c:80 packet: upload-pack< 0001
[16726] 07:29:41.108692 pkt-line.c:80 packet: upload-pack< thin-pack
[16726] 07:29:41.108694 pkt-line.c:80 packet: upload-pack< ofs-delta
[16726] 07:29:41.108696 pkt-line.c:80 packet: upload-pack< include-tag
[16726] 07:29:41.108698 pkt-line.c:80 packet: upload-pack< done
[16726] 07:29:41.108699 pkt-line.c:80 packet: upload-pack< 0000
The last invocation is a fetch request without any want <object-id>
line, but the done
line (as sent by gix
) should indicate the negotiation ended, send me the pack. But because nothing is wanted, it seems to just hang - it's likely a deadlock - gix
would now try to read the response, but the upload-pack
program isn't sending anything, but apparently reads as well as it's also hanging there.
When kill -9
is applied to the upload-pack process, the test just passes and it continues to the second round. There it failed - in this case - because the previous git-daemon didn't shutdown.
I will investigate this further, as gix
should probably not submit fetch requests without a single want.
Besides that, I don't think this will fix the actual issue, as thus far the first invocation of git-daemon
seems to work much better, still.
It remains a very illusive issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These days it's easy for me to reproduce the issue, it's basically not working for me anymore at all, most of the time.
Something that changed is that it now errors due to connection reset, not to connection refused.
I wonder if a recently updated dependency is a factor, and what happens if --locked
is used on a system where the problem recently began to occur more often.
I wonder also if a system (or Homebrew-provided) package is a factor. curl
was recently updated on many systems, to fix a CVE (though I don't know how those specific changes could affect this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After my recent trial session I don't even feel it's worth trying those fringes of possible causes as it left me with something very clear to pursue: a hang when a fetch is performed without a single 'want'.
I also learned that gix remote refs
works twice, but once gix free pack receive
is involved, it starts failing and resetting the connection. If it is run first, it hangs, and if it runs second, the first connection to the daemon is reset. It also became clear that the daemon is up and operational and that it deliberately resets the connection.
This would make sense as free pack receive
changed the most, it now uses an entirely new and much more faithful protocol implementation, and one that never got tested with the pure git://
. It's probably a legitimate bug.
Now that I am writing this I also realise that I can turn on packetline tracing on the client so it would be clear what it says to the server before it shuts down.
I hope I get to it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if this works.
The culprit seems to have been that free pack receive
, albeit relying on gix-protocol
, could still mess it up by forgetting to properly shut down the connection (by sending and EOF/flush packet), and it was able to request a pack without ever specifying an object to receive, causing a deadlock (upload pack kept reading even though the client indicate it was done). Especially the latter sounds like an attack vector for a server, as it feels a client could cause any amount of connections to be opened and held open, which is much more expensive for the server than it would be for the client.
If it was absolutely no effort, I'd definitely feel curious enough to try if it really does work like I think it would… and… I tried it, but via HTTPS I was only able to go into a negotiation loop. However, thinking about it, it's still trivial to hang the server forever by implementing the protocol incorrectly, so I suppose everything server must protect against that. The bug I ran into locally I couldn't reproduce using HTTP though.
Ultimately, it was quite useful to have an inferior implementation of fetch
as implemented by gix free pack receive
as it helped to make the underlying gix-protocol
implementation more sturdy. It now incorporates more of what the gix::fetch
implementation already checks.
This reverts commit bf2abdb, the supporting change in GitoxideLabs#1725 that refrained from caching build artifacts from the `test-journey` job, since that no longer needs to be avoided now that `git-daemon` journey tests aren't run on CI. In 9566488 (GitoxideLabs#1634), journey tests that use `git-daemon` were disabled on CI. GitoxideLabs#1725 tried reenabling them (65788e8) and fixing them by no longer caching build artifacts with the `rust-cache` action in the `test-journey` job (bf2abdb). This worked at the time, but the exact reason it worked or how reliable it would be were unknown. It stopped working shortly thereafter in 25b8480 (GitoxideLabs#1726), and those journey tests were re-disabled on CI in d1d3f7c, which reverted 65788e8. However, bf2abdb was not reverted at that time. (The change in d1d3f7c from GitoxideLabs#1726, taken together with this commit, effectively constitute a revert of PR GitoxideLabs#1725.)
Make adjustments needed to make the final integration into
heartwood
smoother.Tasks
Notes
gix-negotiate
is pulled in even though it's not used at all. Some types make this crate necessary though and I don't think it's worth trying to factor it out.