Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: create new name-hash algorithm #1785

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Sep 9, 2024

I've been focused recently on understanding and mitigating the growth of a few internal repositories. Some of these are growing much larger than expected for the number of contributors, and there are multiple aspects to why this growth is so large.

This is part of the RFC I submitted [1] [2] involving the path-walk API, though this doesn't use the path-walk API directly. In full repack cases, it seems that the --full-name-hash option gets nearly as good compression as the --path-walk option introduced in that series. I continue to work on that feature as well, so we can review it after this series is complete.

[1] #1786

[2] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

The main issue plaguing these repositories is that deltas are not being computed against objects that appear at the same path. While the size of these files at tip is one aspect of growth that would prevent this issue, the changes to these files are reasonable and should result in good delta compression. However, Git is not discovering the connections across different versions of the same file.

One way to find some improvement in these repositories is to increase the window size, which was an initial indicator that the delta compression could be improved, but was not a clear indicator. After some digging (and prototyping some analysis tools) the main discovery was that the current name-hash algorithm only considers the last 16 characters in the path name and has some naturally-occurring collisions within that scope.

This series introduces a new name-hash algorithm, but does not replace the existing one. There are cases, such as packing a single snapshot of a repository, where the existing algorithm outperforms the new one.

However, my findings show that when a repository has many versions of files at the same path (and especially when there are many name-hash collisions) then there are significant gains to be made using the new algorithm.

(This table is updated in v2 with even more private examples that were found while communicating findings internally.)

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |
| Repo E   |     100,957 MB  |            22,979 MB  |
| Repo F   |       8,308 MB  |               746 MB  |
| Repo G   |       4,329 MB  |             3,643 MB  |

I include Repo G here as an example where the improvement is less drastic, since this repo does not demonstrate a very high rate of name-hash collisions; the collisions that exist seem to be in paths that are not changed very often. Thus, the standard name-hash algorithm is nearly as effective in these full repacks.

The main change in this series is in patch 1, which adds the algorithm and the option to 'git pack-objects' and 'git repack'. The remaining patches are focused on creating more evidence around the value of the new name-hash algorithm and its effects on the packfiles created with it.

I will also try to make clear that I've been focused on client-side performance and size concerns. Based on discussions in v1, it appears that the following is true:

  • This feature is completely orthogonal to delta islands.
  • Changing the name-hash function can lead to compatibility issues with .bitmap files, as they store a name-hash value. Without augmenting the data structure to indicate which name-hash value was used at write time, the full-name-hash values should not be stored in the .bitmap files or used when reading .bitmap files and other objects. Thus, the full-name-hash is marked as incompatible with bitmaps for now.
  • The --path-walk option from the RFC ([1] and [2]) is likely incompatible with the delta-islands feature without significant work, so this suggests that the --full-name-hash is a better long-term solution for servers. This would still require the .bitmap modifications to make it work, but it's a smaller leap to get there.

Thanks, -Stolee

UPDATES IN V2

Thank you for all of the discussion on v1. Here are the things I've learned and how they have changed this patch series:

  • The test-tool name-hash helper change collides with the removal of test-tool oid-array, so I have rebased this series onto the latest master branch.
  • I learned (thanks, Taylor and Peff) that the .bitmap files store name-hash values. This means that the --full-name-hash option risks having different name-hash functions across bitmap reads and dynamic computations from the object walk. For this reason, the option is now made explicit to not work with bitmap walks. This could be corrected in the future with a modification to the .bitmap data structure to store a "name-hash version" value. This behavior is confirmed with a test.
  • A new test explicitly tests that git repack --full-name-hash passes the option to git pack-objects. This uses a new helper method, test_subcommand_flex that is more flexible than the existing test_subcommand.
  • To get more testing across different cases, add a new GIT_TEST_FULL_NAME_HASH environment variable. I point out which tests need modification when this is enabled.
  • The size-comparison tests were not portable with their use of du. Instead, use wc -c for the single pack-file that remains after a git repack -adf ... command.
  • The final patch still updates the test-tool name-hash helper, which was previously only used by a performance test. Now, use that test in a regular test to help guarantee that the functions do not change over time. This is directly related to the fact that these values can be stored in the .bitmap files and we need stable hash functions to keep compatibility with files written by previous versions of Git.

Other things that have happened include investigations into ways to adapt the full-name hash to improve upon the name-hash. I did some experimenting with increasing the size of 'struct object_entry' by using a 64-bit hash value (name-hash, then full-name-hash) for a single-pass compression or two 32-bit hash values for a two-pass compression process. I include my WIP branch at [3] to show what I tried, though the single-pass option did not present any improvements and the two-pass option seems to be broken to the point that the compression is substantially worse. (I'll try to elaborate on this in a reply to this cover letter.)

[3] derrickstolee/git@full-name...derrickstolee:git:full-name-wip

cc: gitster@pobox.com
cc: johannes.schindelin@gmx.de
cc: peff@peff.net
cc: ps@pks.im
cc: me@ttaylorr.com
cc: johncai86@gmail.com
cc: newren@gmail.com

@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Sep 9, 2024

Submitted as pull.1785.git.1725890210.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1785/derrickstolee/full-name-v1

To fetch this version to local tag pr-1785/derrickstolee/full-name-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1785/derrickstolee/full-name-v1

Copy link

gitgitgadget bot commented Sep 9, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/9/24 9:56 AM, Derrick Stolee via GitGitGadget wrote:

> However, my findings show that when a repository has many versions of files
> at the same path (and especially when there are many name-hash collisions)
> then there are significant gains to be made using the new algorithm.

Of course this table didn't render correctly. Here's a readable version:

| Repo     | Standard Repack | With --full-name-hash |
|----------|-----------------|-----------------------|
| fluentui |         438 MB  |               168 MB  |
| Repo B   |       6,255 MB  |               829 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |

Thanks,
-Stolee

@@ -15,7 +15,8 @@ SYNOPSIS
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is not meant to be cryptographic at all, but uniformly distributed
> across the possible hash values. This creates a hash that appears
> pseudorandom. There is no ability to consider similar file types as
> being close to each other.

Another consideration we had when designing the current mechanism,
which is more important than "compare .c files with each other", is
to handle the case where a file is moved across directory boundary
without changing its name.  These "hash collissions" are meant to be
a part of obtaining _good_ paring of blobs that ought to be similar
to each other.  In other words, we wanted them to collide so that we
do not have to be negatively affected by moves.

I am not saying that we should not update the pack name hash; I am
just saying that "consider similar file types" as if that is the
most important aspect of the current hash, is misleading.

Thanks.


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/9/24 1:45 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >> This is not meant to be cryptographic at all, but uniformly distributed
>> across the possible hash values. This creates a hash that appears
>> pseudorandom. There is no ability to consider similar file types as
>> being close to each other.
> > Another consideration we had when designing the current mechanism,
> which is more important than "compare .c files with each other", is
> to handle the case where a file is moved across directory boundary
> without changing its name.  These "hash collissions" are meant to be
> a part of obtaining _good_ paring of blobs that ought to be similar
> to each other.  In other words, we wanted them to collide so that we
> do not have to be negatively affected by moves.
> > I am not saying that we should not update the pack name hash; I am
> just saying that "consider similar file types" as if that is the
> most important aspect of the current hash, is misleading.

Thank you for this extra aspect, which has clarified some of my
thinking and I will use in future versions.

Thanks,
-Stolee

@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
SYNOPSIS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
> index 28003f18c92..c4a15fd0cb8 100644
> --- a/t/t0450/txt-help-mismatches
> +++ b/t/t0450/txt-help-mismatches
> @@ -45,7 +45,6 @@ rebase
>  remote
>  remote-ext
>  remote-fd
> -repack
>  reset
>  restore
>  rev-parse

This is very much appreciated ;-)

Copy link

gitgitgadget bot commented Sep 9, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> One way to find some improvement in these repositories is to increase the
> window size, which was an initial indicator that the delta compression could
> be improved, but was not a clear indicator. After some digging (and
> prototyping some analysis tools) the main discovery was that the current
> name-hash algorithm only considers the last 16 characters in the path name
> and has some naturally-occurring collisions within that scope.

Yes, as I explained in the other message, this "collision" is an
integral part of the design to allow us gather candidates together
that may yield good deltas among them.  In addition, header files
whose names end with ".h" tend to share a bit comment at the
beginning of them in many projects, and the proximity (not
"collision") of the hash value is used to make them delta candidates
with each other.

I do agree that considering files at the same path from different
(but close-by) revisions as the prime candidates is very important,
but if you spread the "collissions" very thin by using "uniform
distribution", I am afraid that you'd end up discarding anything but
the blobs at the same path, which may go too far.  Having name hash
value that are close by no longer has any meaning in such a system.

I hope you can find a solution that strikes a good balance at the
end of the series (I saw only the first step), but I suspect an easy
way to avoid the downsides you observed is to use both.  Compare
with a handful of blobs taken from nearby commits (the original
object order is roughly in traversal order, and you can take
advantage of that fact) from exactly the same path (using your
"uniform distribution") before comparing with the blobs with close
value (of the current function) like the current implementation
does, may go a long way.

Copy link

gitgitgadget bot commented Sep 9, 2024

This patch series was integrated into seen via git@abd4999.

@gitgitgadget gitgitgadget bot added the seen label Sep 9, 2024
Copy link

gitgitgadget bot commented Sep 10, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/9/24 2:06 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >> One way to find some improvement in these repositories is to increase the
>> window size, which was an initial indicator that the delta compression could
>> be improved, but was not a clear indicator. After some digging (and
>> prototyping some analysis tools) the main discovery was that the current
>> name-hash algorithm only considers the last 16 characters in the path name
>> and has some naturally-occurring collisions within that scope.
> > Yes, as I explained in the other message, this "collision" is an
> integral part of the design to allow us gather candidates together
> that may yield good deltas among them.  In addition, header files
> whose names end with ".h" tend to share a bit comment at the
> beginning of them in many projects, and the proximity (not
> "collision") of the hash value is used to make them delta candidates
> with each other.
> > I do agree that considering files at the same path from different
> (but close-by) revisions as the prime candidates is very important,
> but if you spread the "collissions" very thin by using "uniform
> distribution", I am afraid that you'd end up discarding anything but
> the blobs at the same path, which may go too far.  Having name hash
> value that are close by no longer has any meaning in such a system.

You are right that some "nearby" paths are lost in this change, and
this can be measured by trying to use this option in the pack-objects
process underneath a small 'git push'.

The thing that surprised me is just how effective this is for the
creation of large pack-files that include many versions of most
files. The cross-path deltas have less of an effect here, and the
benefits of avoiding name-hash collisions can be overwhelming in
many cases.

> I hope you can find a solution that strikes a good balance at the
> end of the series (I saw only the first step), but I suspect an easy
> way to avoid the downsides you observed is to use both.  Compare
> with a handful of blobs taken from nearby commits (the original
> object order is roughly in traversal order, and you can take
> advantage of that fact) from exactly the same path (using your
> "uniform distribution") before comparing with the blobs with close
> value (of the current function) like the current implementation
> does, may go a long way.

Funny you should say that, since the RFC I finally submitted [1]
actually does just that. The --path-walk option changes the object
walk to consider batches of objects based on their path, computes
deltas among that batch, and then does the normal name-hash pass
later. This seems to really strike the balance that you are
looking for and solves the issues where small pushes need to stay
small. It also fixes some problematic cases even when pushing a
single commit.

[1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

However, the --path-walk option requires significant implementation
of a "path walk API" and my RFC doesn't even do threading right.
The --path-walk version (probably) doesn't work with delta islands
or other features the same way as the drop-in change to the name-
hash heuristic can. For that reason, I think there is likely some
long-term value to the --full-name-hash option even though the
--path-walk option will be better in many cases.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Sep 10, 2024

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote:
> > I do agree that considering files at the same path from different
> > (but close-by) revisions as the prime candidates is very important,
> > but if you spread the "collissions" very thin by using "uniform
> > distribution", I am afraid that you'd end up discarding anything but
> > the blobs at the same path, which may go too far.  Having name hash
> > value that are close by no longer has any meaning in such a system.
>
> You are right that some "nearby" paths are lost in this change, and
> this can be measured by trying to use this option in the pack-objects
> process underneath a small 'git push'.
>
> The thing that surprised me is just how effective this is for the
> creation of large pack-files that include many versions of most
> files. The cross-path deltas have less of an effect here, and the
> benefits of avoiding name-hash collisions can be overwhelming in
> many cases.

I think that Junio's suggestion is pretty interesting (though please
take my comments here with a grain of salt, since I haven't read the
other series yet, and am not sure how much of this is redundant).

Imagine computing both the full and existing name-hash values for each
blob/tree in the pack. Then objects would be sorted in the delta
selection window by similar full-name hash and similar regular name hash
values.

I'm not sure which value you'd actually record in the pack, though.
Ideally there is a hash function which captures some information about
the full path as well as the final path component, so we could use a
single value here, though I suspect the implementation would be more
complicated than what is presented here.

> > I hope you can find a solution that strikes a good balance at the
> > end of the series (I saw only the first step), but I suspect an easy
> > way to avoid the downsides you observed is to use both.  Compare
> > with a handful of blobs taken from nearby commits (the original
> > object order is roughly in traversal order, and you can take
> > advantage of that fact) from exactly the same path (using your
> > "uniform distribution") before comparing with the blobs with close
> > value (of the current function) like the current implementation
> > does, may go a long way.
>
> Funny you should say that, since the RFC I finally submitted [1]
> actually does just that. The --path-walk option changes the object
> walk to consider batches of objects based on their path, computes
> deltas among that batch, and then does the normal name-hash pass
> later. This seems to really strike the balance that you are
> looking for and solves the issues where small pushes need to stay
> small. It also fixes some problematic cases even when pushing a
> single commit.

Interesting.

> [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

> However, the --path-walk option requires significant implementation
> of a "path walk API" and my RFC doesn't even do threading right.
> The --path-walk version (probably) doesn't work with delta islands
> or other features the same way as the drop-in change to the name-
> hash heuristic can. For that reason, I think there is likely some
> long-term value to the --full-name-hash option even though the
> --path-walk option will be better in many cases.

I suspect that this is going to be a significant sticking point. Not
supporting multi-threading is work-able for GitHub (since we set
pack.threads=1 today), but lacking support for delta-islands makes this
a non-starter to run at GitHub.

Do you imagine that the --path-walk option could be made to work with
delta islands? I'm not all that worried about who does that work, but
more interested at this point in whether or not it's even possible to
implement.

Thanks,
Taylor

Copy link

gitgitgadget bot commented Sep 10, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> The thing that surprised me is just how effective this is for the
> creation of large pack-files that include many versions of most
> files. The cross-path deltas have less of an effect here, and the
> benefits of avoiding name-hash collisions can be overwhelming in
> many cases.

Yes, "make sure we notice a file F moving from directory A to B" is
inherently optimized for short span of history, i.e. a smallish push
rather than a whole history clone, where the definition of
"smallish" is that even if you create optimal delta chains, the
length of these delta chains will not exceed the "--depth" option.

If the history you are pushing modified A/F twice, renamed it to B/F
(with or without modification at the same time), then modified B/F
twice more, you'd want to pack the 5-commit segment and having to
artificially cut the delta chain that can contain all of these 5
blobs into two at the renaming commit is a huge loss.

Compared to that, a whole history clone is a very different story,
as we will have to chomp delta chains at some depth anyway.  Before
the rename, it is reasonable to assume that A/F have evolved
incrementally for number of revisions, and after that rename it is
expected B/F will evolve incrementally for number of revisions
before it gets renamed again.  It is just the matter of choosing
where in that long stretch of content evolution we would cut the
delta chain, and the commit that renamed the path may just be a
good, if not absolute optimal, point.

So I do agree that this is an important case to optimize for.  At
some point, even when taking only the blobs at the same path as
delta base candidates, your true best base may be outside of the
--window in the list of candidates (sorted by size in decreasing
order), but at that point you would be increasing window to find
better delta base, not to skip unrelated blobs that happened to have
thrown into the same hash bucket due to the design that optimizes
for different case, so we can say that it is worth spending the
extra cycle and memory, if you need a larger window to gain even
better packing result.

> Funny you should say that, since the RFC I finally submitted [1]
> actually does just that. The --path-walk option changes the object
> walk to consider batches of objects based on their path, computes
> deltas among that batch, and then does the normal name-hash pass
> later. This seems to really strike the balance that you are
> looking for and solves the issues where small pushes need to stay
> small. It also fixes some problematic cases even when pushing a
> single commit.

;-).

Copy link

gitgitgadget bot commented Sep 10, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>> The thing that surprised me is just how effective this is for the
>> creation of large pack-files that include many versions of most
>> files. The cross-path deltas have less of an effect here, and the
>> benefits of avoiding name-hash collisions can be overwhelming in
>> many cases.
>
> Yes, "make sure we notice a file F moving from directory A to B" is
> inherently optimized for short span of history, i.e. a smallish push
> rather than a whole history clone, where the definition of
> "smallish" is that even if you create optimal delta chains, the
> length of these delta chains will not exceed the "--depth" option.
>
> If the history you are pushing modified A/F twice, renamed it to B/F
> (with or without modification at the same time), then modified B/F
> twice more, you'd want to pack the 5-commit segment and having to
> artificially cut the delta chain that can contain all of these 5
> blobs into two at the renaming commit is a huge loss.

Which actually leads me to suspect that we probably do not even have
to expose the --full-name-hash option to the end users in "git repack".

If we are doing incremental that would fit within the depth setting,
it is likely that we would be better off without the full-name-hash
optimization, and if we are doing "repack -a" for the whole
repository, especially with "-f", it would make sense to do the
full-name-hash optimization.

If we can tell how large a chunk of history we are packing before we
actually start calling builtin/pack-objects.c:add_object_entry(), we
probably should be able to even select between with and without
full-name-hash automatically, but I do not think we know the object
count before we finish calling add_object_entry(), so unless we are
willing to compute and keep both while reading and pick between the
two after we finish reading the list of objects, or something, it
will require a major surgery to do so, I am afraid.

Copy link

gitgitgadget bot commented Sep 10, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/10/24 10:56 AM, Taylor Blau wrote:
> On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote:
>>> I do agree that considering files at the same path from different
>>> (but close-by) revisions as the prime candidates is very important,
>>> but if you spread the "collissions" very thin by using "uniform
>>> distribution", I am afraid that you'd end up discarding anything but
>>> the blobs at the same path, which may go too far.  Having name hash
>>> value that are close by no longer has any meaning in such a system.
>>
>> You are right that some "nearby" paths are lost in this change, and
>> this can be measured by trying to use this option in the pack-objects
>> process underneath a small 'git push'.
>>
>> The thing that surprised me is just how effective this is for the
>> creation of large pack-files that include many versions of most
>> files. The cross-path deltas have less of an effect here, and the
>> benefits of avoiding name-hash collisions can be overwhelming in
>> many cases.
> > I think that Junio's suggestion is pretty interesting (though please
> take my comments here with a grain of salt, since I haven't read the
> other series yet, and am not sure how much of this is redundant).
> > Imagine computing both the full and existing name-hash values for each
> blob/tree in the pack. Then objects would be sorted in the delta
> selection window by similar full-name hash and similar regular name hash
> values.
> > I'm not sure which value you'd actually record in the pack, though.
> Ideally there is a hash function which captures some information about
> the full path as well as the final path component, so we could use a
> single value here, though I suspect the implementation would be more
> complicated than what is presented here.

Is the name hash stored in the pack itself? I know that it is stored
in the 'struct object_entry' data in the packing data. While we could
add another uint32_t into that struct to store both hash values, this
would increase the memory requirements of repacking by four bytes per
object. The struct seemed to be very clear about trying as hard as
possible to avoid doing that.

But maybe an alternative could be replacing that 32-bit number with
an index into an array of paths that have their hash values stored
there.

This would still involve two passes, but might still be possible. I'll
think on this.

>>> I hope you can find a solution that strikes a good balance at the
>>> end of the series (I saw only the first step), but I suspect an easy
>>> way to avoid the downsides you observed is to use both.  Compare
>>> with a handful of blobs taken from nearby commits (the original
>>> object order is roughly in traversal order, and you can take
>>> advantage of that fact) from exactly the same path (using your
>>> "uniform distribution") before comparing with the blobs with close
>>> value (of the current function) like the current implementation
>>> does, may go a long way.
>>
>> Funny you should say that, since the RFC I finally submitted [1]
>> actually does just that. The --path-walk option changes the object
>> walk to consider batches of objects based on their path, computes
>> deltas among that batch, and then does the normal name-hash pass
>> later. This seems to really strike the balance that you are
>> looking for and solves the issues where small pushes need to stay
>> small. It also fixes some problematic cases even when pushing a
>> single commit.
> > Interesting.
> >> [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
> >> However, the --path-walk option requires significant implementation
>> of a "path walk API" and my RFC doesn't even do threading right.
>> The --path-walk version (probably) doesn't work with delta islands
>> or other features the same way as the drop-in change to the name-
>> hash heuristic can. For that reason, I think there is likely some
>> long-term value to the --full-name-hash option even though the
>> --path-walk option will be better in many cases.
> > I suspect that this is going to be a significant sticking point. Not
> supporting multi-threading is work-able for GitHub (since we set
> pack.threads=1 today), but lacking support for delta-islands makes this
> a non-starter to run at GitHub.
> > Do you imagine that the --path-walk option could be made to work with
> delta islands? I'm not all that worried about who does that work, but
> more interested at this point in whether or not it's even possible to
> implement.
This is part of the reason why I think the --full-name-hash option is
an interesting consideration. It doesn't have any obvious reason why
it couldn't work with features like delta islands, so it may provide
some quick wins in "large enough" repositories, or at least "large in
the right way".

I unfortunately don't know enough about how the delta islands feature
works to be confident in the possibility of integrating it with the
--path-walk option. At minimum, it would require two object walks:
the first would mark the objects and the second would do the delta
compression with those markings in mind.

But if there is a way to combine both approaches with a two-pass
delta compression technique, then this may be all avoided. I'll give
it a try.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Sep 10, 2024

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/10/24 4:36 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> >> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> The thing that surprised me is just how effective this is for the
>>> creation of large pack-files that include many versions of most
>>> files. The cross-path deltas have less of an effect here, and the
>>> benefits of avoiding name-hash collisions can be overwhelming in
>>> many cases.
>>
>> Yes, "make sure we notice a file F moving from directory A to B" is
>> inherently optimized for short span of history, i.e. a smallish push
>> rather than a whole history clone, where the definition of
>> "smallish" is that even if you create optimal delta chains, the
>> length of these delta chains will not exceed the "--depth" option.
>>
>> If the history you are pushing modified A/F twice, renamed it to B/F
>> (with or without modification at the same time), then modified B/F
>> twice more, you'd want to pack the 5-commit segment and having to
>> artificially cut the delta chain that can contain all of these 5
>> blobs into two at the renaming commit is a huge loss.
> > Which actually leads me to suspect that we probably do not even have
> to expose the --full-name-hash option to the end users in "git repack".
> > If we are doing incremental that would fit within the depth setting,
> it is likely that we would be better off without the full-name-hash
> optimization, and if we are doing "repack -a" for the whole
> repository, especially with "-f", it would make sense to do the
> full-name-hash optimization.

Depending on how much we learn from others testing the --full-name-hash
option, I could see the potential that -a could imply --full-name-hash.
I hesitate to introduce that in the first release with this option,
though.

> If we can tell how large a chunk of history we are packing before we
> actually start calling builtin/pack-objects.c:add_object_entry(), we
> probably should be able to even select between with and without
> full-name-hash automatically, but I do not think we know the object
> count before we finish calling add_object_entry(), so unless we are
> willing to compute and keep both while reading and pick between the
> two after we finish reading the list of objects, or something, it
> will require a major surgery to do so, I am afraid.

It's also possible that we could check the list of paths at HEAD to
see how many collisions the default name-hash gives. In cases like
the Git repository, there are very few collisions and thus we don't
need to use --full-name-hash. Restricting to just HEAD (or the
default ref) is not a complete analysis, but might be a good
heuristic.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Sep 10, 2024

This patch series was integrated into seen via git@0e183fd.

Copy link

gitgitgadget bot commented Sep 11, 2024

There are issues in commit ab5a3e5:
pack-objects: use 64-bit name hash
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

Copy link

gitgitgadget bot commented Sep 11, 2024

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Sep 10, 2024 at 05:05:09PM -0400, Derrick Stolee wrote:

> > I'm not sure which value you'd actually record in the pack, though.
> > Ideally there is a hash function which captures some information about
> > the full path as well as the final path component, so we could use a
> > single value here, though I suspect the implementation would be more
> > complicated than what is presented here.
> 
> Is the name hash stored in the pack itself? I know that it is stored
> in the 'struct object_entry' data in the packing data. While we could
> add another uint32_t into that struct to store both hash values, this
> would increase the memory requirements of repacking by four bytes per
> object. The struct seemed to be very clear about trying as hard as
> possible to avoid doing that.

It's stored in the .bitmap files, since otherwise a pack-objects which
uses bitmaps to serve a fetch would have no clue of their path names.
See the "HASH_CACHE" bitmap extension.

You generally don't want to make deltas out of two entries in the bitmap
(they're already in the same pack, so we'd usually skip them), but you
do want to consider making on-the-fly deltas against other objects.

I guess we may also consider deltas between objects in two packs that
are both covered by the same midx bitmap.

> But maybe an alternative could be replacing that 32-bit number with
> an index into an array of paths that have their hash values stored
> there.

Yes, that would work, though how big is that path array going to be?
Uncompressed linux.git is probably 3-4MB, which actually doesn't sound
_too_ bad. You could obviously go a long way with prefix compression,
too.

But if I understand the proposal, it is just replacing one 32-bit hash
with another. You could just store that in the bitmap instead (or if the
direction is to use both, introduce a new extension to store both).
Obviously you'll get lousy results if the bitmap reader does not use the
same algorithm for its non-bitmap objects, but I don't think this is
something you'd be flipping back and forth on.

> This is part of the reason why I think the --full-name-hash option is
> an interesting consideration. It doesn't have any obvious reason why
> it couldn't work with features like delta islands, so it may provide
> some quick wins in "large enough" repositories, or at least "large in
> the right way".
> 
> I unfortunately don't know enough about how the delta islands feature
> works to be confident in the possibility of integrating it with the
> --path-walk option. At minimum, it would require two object walks:
> the first would mark the objects and the second would do the delta
> compression with those markings in mind.

The delta islands code already does its own tree walk to propagate the
bits down (it does rely on the base walk's show_commit() to propagate
through the commits).

Once each object has its island bitmaps, I think however you choose to
come up with delta candidates (whether the current type/size/namehash
sorted list, or some path walking), you should be able to use it. It's
fundamentally just answering the question of "am I allowed to delta
between these two objects".

Of course the devil may be in the details. ;)

-Peff

Copy link

gitgitgadget bot commented Sep 11, 2024

This patch series was integrated into seen via git@1205ee7.

Copy link

gitgitgadget bot commented Sep 11, 2024

This patch series was integrated into seen via git@e8c1936.

Copy link

gitgitgadget bot commented Sep 11, 2024

This patch series was integrated into seen via git@3c327e2.

Copy link

gitgitgadget bot commented Sep 12, 2024

There are issues in commit ab5a3e5:
pack-objects: use 64-bit name hash
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

Copy link

gitgitgadget bot commented Sep 12, 2024

This patch series was integrated into seen via git@9ec06b9.

Copy link

gitgitgadget bot commented Sep 13, 2024

This patch series was integrated into seen via git@7369282.

Copy link

gitgitgadget bot commented Sep 13, 2024

This patch series was integrated into seen via git@0afba8c.

Copy link

gitgitgadget bot commented Sep 14, 2024

This patch series was integrated into seen via git@93a2055.

Copy link

gitgitgadget bot commented Sep 16, 2024

This patch series was integrated into seen via git@d3805ab.

Copy link

gitgitgadget bot commented Sep 17, 2024

This patch series was integrated into seen via git@6aca863.

Copy link

gitgitgadget bot commented Sep 26, 2024

This patch series was integrated into seen via git@bbedfd0.

Copy link

gitgitgadget bot commented Sep 26, 2024

There was a status update in the "Cooking" section about the branch ds/pack-name-hash-tweak on the Git mailing list:

In a repository with too many (more than --window size) similarly
named files, "git repack" would not find good delta-base candidate
and worse, it may not use a blob from exactly the same path as a
good delta-base candidate.  Optionally replace the name hash so
that only blobs at the same path and nothing else are used as
delta-base candidate.

On hold.
cf. <34346998-deac-4e1f-9d5f-218f664e9e08@gmail.com>
source: <pull.1785.v2.git.1726692381.gitgitgadget@gmail.com>

git-for-windows-ci pushed a commit to git-for-windows/git that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit to git-for-windows/git that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit to git-for-windows/git that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit to git-for-windows/git that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit to git-for-windows/git that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit to git-for-windows/git that referenced this pull request Sep 26, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
dscho added a commit to git-for-windows/git that referenced this pull request Sep 27, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
Copy link

gitgitgadget bot commented Sep 27, 2024

This patch series was integrated into seen via git@135921b.

Copy link

gitgitgadget bot commented Sep 27, 2024

This patch series was integrated into seen via git@9ebb56e.

Copy link

gitgitgadget bot commented Sep 30, 2024

There was a status update in the "Cooking" section about the branch ds/pack-name-hash-tweak on the Git mailing list:

In a repository with too many (more than --window size) similarly
named files, "git repack" would not find good delta-base candidate
and worse, it may not use a blob from exactly the same path as a
good delta-base candidate.  Optionally replace the name hash so
that only blobs at the same path and nothing else are used as
delta-base candidate.

On hold.
cf. <34346998-deac-4e1f-9d5f-218f664e9e08@gmail.com>
source: <pull.1785.v2.git.1726692381.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Oct 1, 2024

This patch series was integrated into seen via git@1b03dde.

Copy link

gitgitgadget bot commented Oct 1, 2024

There was a status update in the "Cooking" section about the branch ds/pack-name-hash-tweak on the Git mailing list:

In a repository with too many (more than --window size) similarly
named files, "git repack" would not find good delta-base candidate
and worse, it may not use a blob from exactly the same path as a
good delta-base candidate.  Optionally replace the name hash so
that only blobs at the same path and nothing else are used as
delta-base candidate.

On hold.
cf. <34346998-deac-4e1f-9d5f-218f664e9e08@gmail.com>
source: <pull.1785.v2.git.1726692381.gitgitgadget@gmail.com>

Copy link

gitgitgadget bot commented Oct 1, 2024

This patch series was integrated into seen via git@bb33886.

Copy link

gitgitgadget bot commented Oct 2, 2024

This patch series was integrated into seen via git@637f115.

Copy link

gitgitgadget bot commented Oct 2, 2024

There was a status update in the "Cooking" section about the branch ds/pack-name-hash-tweak on the Git mailing list:

In a repository with too many (more than --window size) similarly
named files, "git repack" would not find good delta-base candidate
and worse, it may not use a blob from exactly the same path as a
good delta-base candidate.  Optionally replace the name hash so
that only blobs at the same path and nothing else are used as
delta-base candidate.

On hold.
cf. <34346998-deac-4e1f-9d5f-218f664e9e08@gmail.com>
source: <pull.1785.v2.git.1726692381.gitgitgadget@gmail.com>

dscho added a commit to dscho/git that referenced this pull request Oct 3, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
Copy link

gitgitgadget bot commented Oct 3, 2024

This patch series was integrated into seen via git@e4c3466.

Copy link

gitgitgadget bot commented Oct 3, 2024

This patch series was integrated into seen via git@b65a280.

dscho added a commit to dscho/git that referenced this pull request Oct 4, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit to git-for-windows/git that referenced this pull request Oct 4, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit to git-for-windows/git that referenced this pull request Oct 4, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
git-for-windows-ci pushed a commit to git-for-windows/git that referenced this pull request Oct 4, 2024
This is an updated version of gitgitgadget#1785, intended for early
consumption into Git for Windows.

The idea here is to add a new `--full-name-hash` option to `git
pack-objects` and `git repack`. This adjusts the name-hash value used
for finding delta bases in such a way that uses the full path name with
a lower likelihood of collisions than the default name-hash algorithm.
In many repositories with name-hash collisions and many versions of
those paths, this can significantly reduce the size of a full repack. It
can also help in certain cases of `git push`, but only if the pack is
already artificially inflated by name-hash collisions; cases that find
"sibling" deltas as better choices become worse with `--full-name-hash`.

Thus, this option is currently recommended for full repacks of large
repos, and on client machines without reachability bitmaps.

Some care is taken to ignore this option when using bitmaps, either
writing bitmaps or using a bitmap walk during reads. The bitmap file
format contains name-hash values, but no way to indicate which function
is used, so compatibility is a concern for bitmaps. Future work could
explore this idea.

After this PR is merged, then the more-involved `--path-walk` option may
be considered.
Copy link

gitgitgadget bot commented Oct 4, 2024

This patch series was integrated into seen via git@7826487.

Copy link

gitgitgadget bot commented Oct 4, 2024

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> I've been focused recently on understanding and mitigating the growth of a
> few internal repositories. Some of these are growing much larger than
> expected for the number of contributors, and there are multiple aspects to
> why this growth is so large.
> ...
> Derrick Stolee (6):
>   pack-objects: add --full-name-hash option
>   repack: test --full-name-hash option
>   pack-objects: add GIT_TEST_FULL_NAME_HASH
>   git-repack: update usage to match docs
>   p5313: add size comparison test
>   test-tool: add helper for name-hash values

Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g.

  https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886

shows that t5310 and t5334 are having problems.

I randomly picked this topic and ejected it out of 'seen', and the
result is fairing a bit better. t5310 that failed 228/228 subtests
in the above run is now passing.  I didn't run this topic alone
under the leak checker, so it is entirely possible that there are
some unfortunate interactions with other topics in flight.

t5334 still fails 8/16 subtests just like the above run, exonerating
this topic for its breakage.

  https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880

Copy link

gitgitgadget bot commented Oct 4, 2024

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote:
> t5334 still fails 8/16 subtests just like the above run, exonerating
> this topic for its breakage.
>
>   https://github.com/git/git/actions/runs/11186737635/job/31102378478#step:4:1880

This is not all too surprising, since part two of the incremental MIDX
topic introduces some new leaks which were not seen by Patrick's recent
leakfixes. So I expect that this broke with 9d4855eef3 (midx-write: fix
leaking buffer, 2024-09-30), which marked t5534 as leak-free.

And that's true, without the patches from tb/incremental-midx-part-2.
The leak stems from the fact that the 'bitmap_index' struct now has a
new optional "base" pointer, which points to another 'bitmap_index'
corresponding to the previous MIDX layer.

The fix is fairly straightforward, and should be limited to:

--- 8< ---
Subject: [PATCH] fixup! pack-bitmap.c: open and store incremental bitmap
 layers

The incremental MIDX bitmap work was done prior to 9d4855eef3
(midx-write: fix leaking buffer, 2024-09-30), and causes test failures
in t5334 in a post-9d4855eef3 world.

The leak looks like:

    Direct leak of 264 byte(s) in 1 object(s) allocated from:
        #0 0x7f6bcd87eaca in calloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:90
        #1 0x55ad1428e8a4 in xcalloc wrapper.c:151
        #2 0x55ad14199e16 in prepare_midx_bitmap_git pack-bitmap.c:742
        #3 0x55ad14199447 in open_midx_bitmap_1 pack-bitmap.c:507
        #4 0x55ad14199cca in open_midx_bitmap pack-bitmap.c:704
        #5 0x55ad14199d44 in open_bitmap pack-bitmap.c:717
        #6 0x55ad14199dc2 in prepare_bitmap_git pack-bitmap.c:733
        #7 0x55ad1419e496 in test_bitmap_walk pack-bitmap.c:2698
        #8 0x55ad14047b0b in cmd_rev_list builtin/rev-list.c:629
        #9 0x55ad13f71cd6 in run_builtin git.c:487
        #10 0x55ad13f72132 in handle_builtin git.c:756
        #11 0x55ad13f72380 in run_argv git.c:826
        #12 0x55ad13f728f4 in cmd_main git.c:961
        #13 0x55ad1407d3ae in main common-main.c:64
        #14 0x7f6bcd5f0c89 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
        #15 0x7f6bcd5f0d44 in __libc_start_main_impl ../csu/libc-start.c:360
        #16 0x55ad13f6ff90 in _start (git+0x1ef90) (BuildId: 3e63cdd415f1d185b21da3035cb48332510dddce)

, and is a result of us not freeing the resources corresponding to the
bitmap's base layer, if one was present.

Rectify that leak by calling the newly-introduced free_bitmap_index()
function on the base layer to ensure that its resources are also freed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 2b4c3972e5..fe0df2b6c3 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -3045,6 +3045,7 @@ void free_bitmap_index(struct bitmap_index *b)
 		close_midx_revindex(b->midx);
 	}
 	free_pseudo_merge_map(&b->pseudo_merges);
+	free_bitmap_index(b->base);
 	free(b);
 }

--
2.47.0.rc1.154.ge6e38f19f35.dirty
--- >8 ---

That should apply on top of 'seen' easily, and will at least fix the
leak failures you pointed out here.

When the integration branches are rewound after 2.47.0 is tagged, I'll
send a new version of this topic based on Patrick's work here.

Thanks,
Taylor

Copy link

gitgitgadget bot commented Oct 4, 2024

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Oct 04, 2024 at 02:17:30PM -0700, Junio C Hamano wrote:

> > Derrick Stolee (6):
> >   pack-objects: add --full-name-hash option
> >   repack: test --full-name-hash option
> >   pack-objects: add GIT_TEST_FULL_NAME_HASH
> >   git-repack: update usage to match docs
> >   p5313: add size comparison test
> >   test-tool: add helper for name-hash values
> 
> Recent CI jobs on 'seen' has been failing the leak-check jobs, e.g.
> 
>   https://github.com/git/git/actions/runs/11184876759/job/31096601111#step:4:1886
> 
> shows that t5310 and t5334 are having problems.
> 
> I randomly picked this topic and ejected it out of 'seen', and the
> result is fairing a bit better. t5310 that failed 228/228 subtests
> in the above run is now passing.  I didn't run this topic alone
> under the leak checker, so it is entirely possible that there are
> some unfortunate interactions with other topics in flight.

Maybe squash this into the final patch of that series?

diff --git a/t/helper/test-name-hash.c b/t/helper/test-name-hash.c
index 15fb8f853c..e4ecd159b7 100644
--- a/t/helper/test-name-hash.c
+++ b/t/helper/test-name-hash.c
@@ -19,5 +19,6 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
 		printf("%10"PRIu32"\t%10"PRIu32"\t%s\n", name_hash, full_hash, line.buf);
 	}
 
+	strbuf_release(&line);
 	return 0;
 }

That seems to be enough to clear t5310 on "seen". It was not noticed in
the original topic because t5310 was not yet marked as leak-free in its
base. That happened in fa016423c7 (revision: fix leaking parents when
simplifying commits, 2024-09-26)

-Peff

Copy link

gitgitgadget bot commented Oct 5, 2024

This patch series was integrated into seen via git@3ffe012.

Copy link

gitgitgadget bot commented Oct 5, 2024

There was a status update in the "Cooking" section about the branch ds/pack-name-hash-tweak on the Git mailing list:

In a repository with too many (more than --window size) similarly
named files, "git repack" would not find good delta-base candidate
and worse, it may not use a blob from exactly the same path as a
good delta-base candidate.  Optionally replace the name hash so
that only blobs at the same path and nothing else are used as
delta-base candidate.

On hold.
cf. <34346998-deac-4e1f-9d5f-218f664e9e08@gmail.com>
source: <pull.1785.v2.git.1726692381.gitgitgadget@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant