pack-objects filtering #2

jeffhostetler · 2017-09-07T16:40:10Z

No description provided.

Teach pack-objects to use the filtering provided by the traverse_commit_list_filtered() interface to omit unwanted objects from the resulting packfile. This feature is intended for partial clone/fetch. Filtering requires the use of the "--stdout" option. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

Add help text for new object filtering options to pack-objects documentation. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

This is to address concerns raised by ThreadSanitizer on the mailing list about threaded unprotected R/W access to map.size with my previous "disallow rehash" change (0607e10). See: https://public-inbox.org/git/adb37b70139fd1e2bac18bfd22c8b96683ae18eb.1502780344.git.martin.agren@gmail.com/ Add API to hashmap to disable item counting and thus automatic rehashing. Also include API to later re-enable them. When item counting is disabled, the map.size field is invalid. So to prevent accidents, the field has been renamed and an accessor function hashmap_get_size() has been added. All direct references to this field have been been updated. And the name of the field changed to map.private_size to communicate this. Here is the relevant output from ThreadSanitizer showing the problem: WARNING: ThreadSanitizer: data race (pid=10554) Read of size 4 at 0x00000082d488 by thread T2 (mutexes: write M16): #0 hashmap_add hashmap.c:209 #1 hash_dir_entry_with_parent_and_prefix name-hash.c:302 #2 handle_range_dir name-hash.c:347 #3 handle_range_1 name-hash.c:415 #4 lazy_dir_thread_proc name-hash.c:471 #5 <null> <null> Previous write of size 4 at 0x00000082d488 by thread T1 (mutexes: write M31): #0 hashmap_add hashmap.c:209 #1 hash_dir_entry_with_parent_and_prefix name-hash.c:302 #2 handle_range_dir name-hash.c:347 #3 handle_range_1 name-hash.c:415 #4 handle_range_dir name-hash.c:380 #5 handle_range_1 name-hash.c:415 #6 lazy_dir_thread_proc name-hash.c:471 #7 <null> <null> Martin gives instructions for running TSan on test t3008 in this post: https://public-inbox.org/git/CAN0heSoJDL9pWELD6ciLTmWf-a=oyxe4EXXOmCKvsG5MSuzxsA@mail.gmail.com/ Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The function ce_write_entry() uses a 'self-initialised' variable construct, for the symbol 'saved_namelen', to suppress a gcc '-Wmaybe-uninitialized' warning, given that the warning is a false positive. For the purposes of this discussion, the ce_write_entry() function has three code blocks of interest, that look like so: /* block #1 */ if (ce->ce_flags & CE_STRIP_NAME) { saved_namelen = ce_namelen(ce); ce->ce_namelen = 0; } /* block #2 */ /* * several code blocks that contain, among others, calls * to copy_cache_entry_to_ondisk(ondisk, ce); */ /* block #3 */ if (ce->ce_flags & CE_STRIP_NAME) { ce->ce_namelen = saved_namelen; ce->ce_flags &= ~CE_STRIP_NAME; } The warning implies that gcc thinks it is possible that the first block is not entered, the calls to copy_cache_entry_to_ondisk() could toggle the CE_STRIP_NAME flag on, thereby entering block #3 with saved_namelen unset. However, the copy_cache_entry_to_ondisk() function does not write to ce->ce_flags (it only reads). gcc could easily determine this, since that function is local to this file, but it obviously doesn't. In order to suppress this warning, we make it clear to the reader (human and compiler), that block #3 will only be entered when the first block has been entered, by introducing a new 'stripped_name' boolean variable. We also take the opportunity to change the type of 'saved_namelen' to 'unsigned int' to match ce->ce_namelen. Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

…ev() In 'builtin/name-rev.c' in the name_rev() function there is a loop iterating over all parents of the given commit, and the loop body looks like this: if (parent_number > 1) { if (generation > 0) // branch #1 new_name = ... else // branch #2 new_name = ... name_rev(parent, new_name, ...); } else { // branch #3 name_rev(...); } These conditions are not covered properly in the test suite. As far as purely test coverage goes, they are all executed several times over in 't6120-describe.sh'. However, they don't directly influence the command's output, because the repository used in that test script contains several branches and tags pointing somewhere into the middle of the commit DAG, and thus result in a better name for the to-be-named commit. This can hide bugs: e.g. by replacing the 'new_name' parameter of the first recursive name_rev() call with 'tip_name' (effectively making both branch #1 and #2 a noop) 'git name-rev --all' shows thousands of bogus names in the Git repository, but the whole test suite still passes successfully. In an early version of a later patch in this series I managed to mess up all three branches (at once!), but the test suite still passed. So add a new test case that operates on the following history: A--------------master \ / \----------M2 \ / \---M1-C \ / B and names the commit 'B' to make sure that all three branches are crucial to determine 'B's name: - There is only a single ref, so all names are based on 'master', without any undesired interference from other refs. - Each time name_rev() follows the second parent of a merge commit, it appends "^2" to the name. Following 'master's second parent right at the start ensures that all commits on the ancestry path from 'master' to 'B' have a different base name from the original 'tip_name' of the very first name_rev() invocation. Currently, while name_rev() is recursive, it doesn't matter, but it will be necessary to properly cover all three branches after the recursion is eliminated later in this series. - Following 'M2's second parent makes sure that branch #2 (i.e. when 'generation = 0') affects 'B's name. - Following the only parent of the non-merge commit 'C' ensures that branch #3 affects 'B's name, and that it increments 'generation'. - Coming from 'C' 'generation' is 1, thus following 'M1's second parent makes sure that branch #1 affects 'B's name. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Recent versions of the gcc and clang Address Sanitizer produce test failures related to regexec(). This triggers with gcc-10 and clang-8 (but not gcc-9 nor clang-7). Running: make CC=gcc-10 SANITIZE=address test results in failures in t4018, t3206, and t4062. The cause seems to be that when built with ASan, we use a different version of regexec() than normal. And this version doesn't understand the REG_STARTEND flag. Here's my evidence supporting that. The failure in t4062 is an ASan warning: expecting success of 4062.2 '-G matches': git diff --name-only -G "^(0{64}){64}$" HEAD^ >out && test 4096-zeroes.txt = "$(cat out)" ================================================================= ==672994==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7fa76f672000 at pc 0x7fa7726f75b6 bp 0x7ffe41bdda70 sp 0x7ffe41bdd220 READ of size 4097 at 0x7fa76f672000 thread T0 #0 0x7fa7726f75b5 (/lib/x86_64-linux-gnu/libasan.so.6+0x4f5b5) #1 0x562ae0c9c40e in regexec_buf /home/peff/compile/git/git-compat-util.h:1117 #2 0x562ae0c9c40e in diff_grep /home/peff/compile/git/diffcore-pickaxe.c:52 #3 0x562ae0c9cc28 in pickaxe_match /home/peff/compile/git/diffcore-pickaxe.c:166 [...] In this case we're looking in a buffer which was mmap'd via reuse_worktree_file(), and whose size is 4096 bytes. But libasan's regex tries to look at byte 4097 anyway! If we tweak Git like this: diff --git a/diff.c b/diff.c index 8e2914c031..cfae60c120 100644 --- a/diff.c +++ b/diff.c @@ -3880,7 +3880,7 @@ static int reuse_worktree_file(struct index_state *istate, */ if (ce_uptodate(ce) || (!lstat(name, &st) && !ie_match_stat(istate, ce, &st, 0))) - return 1; + return 0; return 0; } to use a regular buffer (with a trailing NUL) instead of an mmap, then the complaint goes away. The other failures are actually diff output with an incorrect funcname header. If I instrument xdiff to show the funcname matching like so: diff --git a/xdiff-interface.c b/xdiff-interface.c index 8509f9ea22..f6c3dc1986 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -197,6 +197,7 @@ struct ff_regs { struct ff_reg { regex_t re; int negate; + char *printable; } *array; }; @@ -218,7 +219,12 @@ static long ff_regexp(const char *line, long len, for (i = 0; i < regs->nr; i++) { struct ff_reg *reg = regs->array + i; - if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) { + int ret = regexec_buf(&reg->re, line, len, 2, pmatch, 0); + warning("regexec %s:\n regex: %s\n buf: %.*s", + ret == 0 ? "matched" : "did not match", + reg->printable, + (int)len, line); + if (!ret) { if (reg->negate) return -1; break; @@ -264,6 +270,7 @@ void xdiff_set_find_func(xdemitconf_t *xecfg, const char *value, int cflags) expression = value; if (regcomp(&reg->re, expression, cflags)) die("Invalid regexp to look for hunk header: %s", expression); + reg->printable = xstrdup(expression); free(buffer); value = ep + 1; } then when compiling with ASan and gcc-10, running the diff from t4018.66 produces this: $ git diff -U1 cpp-skip-access-specifiers warning: regexec did not match: regex: ^[ ]*[A-Za-z_][A-Za-z_0-9]*:[[:space:]]*($|/[/*]) buf: private: warning: regexec matched: regex: ^((::[[:space:]]*)?[A-Za-z_].*)$ buf: private: diff --git a/cpp-skip-access-specifiers b/cpp-skip-access-specifiers index 4d4a9db..ebd6f42 100644 --- a/cpp-skip-access-specifiers +++ b/cpp-skip-access-specifiers @@ -6,3 +6,3 @@ private: void DoSomething(); int ChangeMe; }; void DoSomething(); - int ChangeMe; + int IWasChanged; }; That first regex should match (and is negated, so it should be telling us _not_ to match "private:"). But it wouldn't if regexec() is looking at the whole buffer, and not just the length-limited line we've fed to regexec_buf(). So this is consistent again with REG_STARTEND being ignored. The correct output (compiling without ASan, or gcc-9 with Asan) looks like this: warning: regexec matched: regex: ^[ ]*[A-Za-z_][A-Za-z_0-9]*:[[:space:]]*($|/[/*]) buf: private: [...more lines that we end up not using...] warning: regexec matched: regex: ^((::[[:space:]]*)?[A-Za-z_].*)$ buf: class RIGHT : public Baseclass diff --git a/cpp-skip-access-specifiers b/cpp-skip-access-specifiers index 4d4a9db..ebd6f42 100644 --- a/cpp-skip-access-specifiers +++ b/cpp-skip-access-specifiers @@ -6,3 +6,3 @@ class RIGHT : public Baseclass void DoSomething(); - int ChangeMe; + int IWasChanged; }; So it really does seem like libasan's regex engine is ignoring REG_STARTEND. We should be able to work around it by compiling with NO_REGEX, which would use our local regexec(). But to make matters even more interesting, this isn't enough by itself. Because ASan has support from the compiler, it doesn't seem to intercept our call to regexec() at the dynamic library level. It actually recognizes when we are compiling a call to regexec() and replaces it with ASan-specific code at that point. And unlike most of our other compat code, where we might have git_mmap() or similar, the actual symbol name in the compiled compat/regex code is regexec(). So just compiling with NO_REGEX isn't enough; we still end up in libasan! We can work around that by having the preprocessor replace regexec with git_regexec (both in the callers and in the actual implementation), and we truly end up with a call to our custom regex code, even when compiling with ASan. That's probably a good thing to do anyway, as it means anybody looking at the symbols later (e.g., in a debugger) would have a better indication of which function is which. So we'll do the same for the other common regex functions (even though just regexec() is enough to fix this ASan problem). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Signed-off-by: Alessandro Menti <alessandro.menti@alessandromenti.it>

* 'update-italian-translation' of github.com:AlessandroMenti/git-po: l10n: it.po: update for Git 2.27.0 round #2

l10n-2.27.0-rnd2 * tag 'l10n-2.27.0-rnd2' of git://github.com/git-l10n/git-po: (23 commits) l10n: zh_TW.po: v2.27.0 round 2 (0 untranslated) l10n: zh_TW.po: v2.27.0 round 1 (0 untranslated) l10n: de.po: Fix typo in the German translation of octopus l10n: de.po: Update German translation for Git 2.27.0 l10n: it.po: update for Git 2.27.0 round #2 l10n: tr: v2.27.0 round 2 l10n: fr.po v2.27.0 rnd 2 l10n: bg.po: Updated Bulgarian translation (4875t) l10n: Update Catalan translation l10n: sv.po: Update Swedish translation (4875t0f0u) l10n: vi(4875t): Updated Vietnamses translation for 2.27.0rd2 l10n: zh_CN: for git v2.27.0 l10n round 1~2 l10n: git.pot: v2.27.0 round 2 (+1) l10n: Update Catalan translation l10n: vi(4874t): Updated Vietnamses translation for 2.27.0 l10n: es: 2.27.0 round 1 l10n: bg.po: Updated Bulgarian translation (4868t) l10n: fr v2.27.0 rnd 1 l10n: sv.po: Update Swedish translation (4839t0f0u) l10n: tr: v2.27.0 round 1 ...

The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes. Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label). This is currenly implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset. At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty. Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Unify the 'chunk_ids' and 'chunk_sizes' arrays into an array of 'struct chunk_info'. This will allow more cleanups in the following patches. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The early part of t9100 creates an unusual "doubled" history in the "git-svn" ref. When we get to t9100.17, it looks like this: $ git log --oneline --graph git-svn [...] * efd0303 detect node change from file to directory #2 |\ * | 3e727c0 detect node change from file to directory #2 |/ * 3b00468 try a deep --rmdir with a commit |\ * | b4832d8 try a deep --rmdir with a commit |/ * f0d7bd5 import for git svn Each commit we make with "git commit" is paired with one from "git svn set-tree", with the latter as a merge of the first and its grandparent. Later, t9100.17 wants to check that "git svn fetch" gets the same trees. And it does, but just one copy of each. So it uses rev-list to get the tree of each commit and pipes it to "uniq" to drop the duplicates. Our input isn't sorted, but it will find adjacent duplicates. This works reliably because the order of commits from rev-list always shows the duplicates next to each other. For any one of those merges, we could choose to show its duplicate or the grandparent first. But barring clocks running backwards, the duplicate will always have a time equal to or greater than the grandparent. Even if equal, we break ties by showing the first-parent first, so the duplicates remain adjacent. But this would break if the timestamps stopped moving in chronological order. Normally we would rely on test_tick for this, but we have _two_ sources of time here: - "git commit" creates one commit based on GIT_COMMITTER_DATE (which respects test_tick) - the "svn set-tree" one is based on subversion, which does not have an easy way to specify a timestamp So using test_tick actually breaks the test, because now the duplicates are far in the past, and we'll show the grandparent before the duplicate. And likewise, a proposed change to set GIT_COMMITTER_DATE in all scripts will break it. We _could_ fix this by sorting before removing duplicates, but presumably it's a useful part of the test to make sure the trees appear in the same order in both spots. Likewise, we could use something like: perl -ne 'print unless $seen{$_}++' to remove duplicates without impacting the order. But that doesn't work either, because there are actually multiple (non-duplicate) commits with the same trees (we change a file mode and then change it back). So we'd actually have to de-duplicate the combination of subject and tree. Which then further throws off t9100.18, which compares the tree hashes exactly; we'd have to strip the result back down. Since this test _isn't_ buggy, the simplest thing is to just work around the proposed change by documenting our expectation that git-created commits are correctly interleaved using the current time. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The changed-path Bloom filter is improved using ideas from an independent implementation. * sg/commit-graph-cleanups: commit-graph: simplify write_commit_graph_file() #2 commit-graph: simplify write_commit_graph_file() #1 commit-graph: simplify parse_commit_graph() #2 commit-graph: simplify parse_commit_graph() #1 commit-graph: clean up #includes diff.h: drop diff_tree_oid() & friends' return value commit-slab: add a function to deep free entries on the slab commit-graph-format.txt: all multi-byte numbers are in network byte order commit-graph: fix parsing the Chunk Lookup table tree-walk.c: don't match submodule entries for 'submod/anything'

In demultiplex_sideband(), there are two oddities when we check an incoming packet: - if it has zero length, then we assume it's a flush packet. This means we fail to notice the difference between a real flush and a true zero-length packet that's missing its sideband designator. It's not a huge problem in practice because we'd never send a zero-length data packet (even our keepalives are otherwise-empty sideband-1 packets). But it would be nice to detect and report the error, since it's likely to cause other confusion (we think the other side flushed, but they do not). - we try to detect packets missing their designator by checking for "if (len < 1)". But this will never trigger for "len == 0"; we've already detected that and left the function before then. It _could_ detect a negative "len" parameter. But in that case, the error message is wrong. The issue is not "no sideband" but rather "eof while reading the packet". However, this can't actually be triggered in practice, because neither of the two callers uses pkt_read's GENTLE_ON_EOF flag. Which means they'd die with "the remote end hung up unexpectedly" before we even get here. So this truly is dead code. We can improve these cases by passing in a pkt-line status to the demultiplexer, and by having recv_sideband() use GENTLE_ON_EOF. This gives us two improvements: - we can now reliably detect flush packets, and will report a normal packet missing its sideband designator as an error - we'll report an eof with a more detailed "protocol error: eof while reading sideband packet", rather than the generic "the remote end hung up unexpectedly" - when we see an eof, we'll flush the sideband scratch buffer, which may provide some hints from the remote about why they hung up (though note we already flush on newlines, so it's likely that most such messages already made it through) In some sense this patch goes against fbd76cd (sideband: reverse its dependency on pkt-line, 2019-01-16), which caused the sideband code not to depend on the pkt-line code. But that commit was really just trying to deal with the circular header dependency. The two modules are conceptually interlinked, and it was just trying to keep things compiling. And indeed, there's a sticking point in this patch: because pkt-line.h includes sideband.h, we can't add the reverse include we need for the sideband code to have an "enum packet_read_status" parameter. Nor can we forward declare it, because you can't forward declare an enum in C. However, C does guarantee that enums fit in an int, so we can just use that type. One alternative would be for the callers to check themselves that they got something sane from the pkt-line code. But besides duplicating logic, this gets quite tricky. Any error condition requires flushing the sideband #2 scratch buffer, which only demultiplex_sideband() knows how to do. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Test 5572.63 ("branch has no merge base with remote-tracking counterpart") was introduced in 4d36f88 (submodule: do not pass null OID to setup_revisions, 2018-05-24), as a regression test for the bug this commit was fixing (preventing a 'fatal: bad object' error when the current branch and the remote-tracking branch we are pulling have no merge-base). However, the commit message for 4d36f88 does not describe in which real-life situation this bug was encountered. The brief discussion on the mailing list [1] does not either. The regression test is not really representative of a real-life scenario: both the local repository and its upstream have only a single commit, and the "no merge-base" scenario is simulated by recreating this root commit in the local repository using 'git commit-tree' before calling 'git pull --rebase --recurse-submodules'. The rebase succeeds and results in the local branch being reset to the same root commit as the upstream branch. The fix in 4d36f88 modifies 'submodule.c::submodule_touches_in_range' so that if 'excl_oid' is null, which is the case when the 'git merge-base --fork-point' invocation in 'builtin/pull.c::get_rebase_fork_point' errors (no fork-point), then instead of 'incl_oid --not excl_oid' being passed to setup_revisions, only 'incl_oid' is passed, and 'submodule_touches_in_range' examines 'incl_oid' and all its ancestors to verify that they do not touch the submodule. In test 5572.63, the recreated lone root commit in the local repository is thus the only commit being examined by 'submodule_touches_in_range', and this commit *adds* the submodule. However, 'submodule_touches_in_range' *succeeds* because 'combine-diff.c::diff_tree_combined' (see the backtrace below) returns early since this commit is the root commit and has no parents. #0 diff_tree_combined at combine-diff.c:1494 #1 0x0000000100150cbe in diff_tree_combined_merge at combine-diff.c:1649 #2 0x00000001002c7147 in collect_changed_submodules at submodule.c:869 #3 0x00000001002c7d6f in submodule_touches_in_range at submodule.c:1268 #4 0x00000001000ad58b in cmd_pull at builtin/pull.c:1040 In light of all this, add a note in t5572 documenting this peculiar test. [1] https://lore.kernel.org/git/20180524204729.19896-1-jonathantanmy@google.com/t/#u Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The previous change reduced time spent in strlen() while comparing consecutive paths in verify_cache(), but we can do better. The conditional checks the existence of a directory separator at the correct location, but only after doing a string comparison. Swap the order to be logically equivalent but perform fewer string comparisons. To test the effect on performance, I used a repository with over three million paths in the index. I then ran the following command on repeat: git -c index.threads=1 commit --amend --allow-empty --no-edit Here are the measurements over 10 runs after a 5-run warmup: Benchmark #1: v2.30.0 Time (mean ± σ): 854.5 ms ± 18.2 ms Range (min … max): 825.0 ms … 892.8 ms Benchmark #2: Previous change Time (mean ± σ): 833.2 ms ± 10.3 ms Range (min … max): 815.8 ms … 849.7 ms Benchmark #3: This change Time (mean ± σ): 815.5 ms ± 18.1 ms Range (min … max): 795.4 ms … 849.5 ms This change is 2% faster than the previous change and 5% faster than v2.30.0. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

This is modelled on the version of handle_directory_level_conflicts() from merge-recursive.c, but is massively simplified due to the following factors: * strmap API provides simplifications over using direct hashmap * we have a dirs_removed field in struct rename_info that we have an easy way to populate from collect_merge_info(); this was already used in compute_rename_counts() and thus we do not need to check for condition #2. * The removal of condition #2 by handling it earlier in the code also obviates the need to check for condition #3 -- if both sides renamed a directory, meaning that the directory no longer exists on either side, then neither side could have added any new files to that directory, and thus there are no files whose locations we need to move due to such a directory rename. In fact, the same logic that makes condition #3 irrelevant means condition #1 is also irrelevant so we could drop this function. However, it is cheap to check if both sides rename the same directory, and doing so can save future computation. So, simply remove any directories that both sides renamed from the list of directory renames. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Add some timing instrumentation for both merge-ort and diffcore-rename; I used these to measure and optimize performance in both, and several future patch series will build on these to reduce the timings of some select testcases. === Setup === The primary testcase I used involved rebasing a random topic in the linux kernel (consisting of 35 patches) against an older version. I added two variants, one where I rename a toplevel directory, and another where I only rebase one patch instead of the whole topic. The setup is as follows: $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git $ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e $ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34 $ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e $ git switch -c 5.4-renames v5.4 $ git mv drivers pilots # Introduce over 26,000 renames $ git commit -m "Rename drivers/ to pilots/" $ git config merge.renameLimit 30000 $ git config merge.directoryRenames true === Testcases === Now with REBASE standing for either "git rebase [--merge]" (using merge-recursive) or "test-tool fast-rebase" (using merge-ort), the testcases are: Testcase #1: no-renames $ git checkout v5.4^0 $ REBASE --onto HEAD base hwmon-updates Note: technically the name is misleading; there are some renames, but very few. Rename detection only takes about half the overall time. Testcase #2: mega-renames $ git checkout 5.4-renames^0 $ REBASE --onto HEAD base hwmon-updates Testcase #3: just-one-mega $ git checkout 5.4-renames^0 $ REBASE --onto HEAD base hwmon-just-one === Timing results === Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames, 10 runs for the other two cases): merge-recursive merge-ort no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s A single re-run of each with some breakdowns: --- no-renames --- merge-recursive merge-ort overall runtime: 19.302 s 14.257 s inexact rename detection: 7.603 s 7.906 s everything else: 11.699 s 6.351 s --- mega-renames --- merge-recursive merge-ort overall runtime: 5950.195 s 5499.672 s inexact rename detection: 5746.309 s 5487.120 s everything else: 203.886 s 17.552 s --- just-one-mega --- merge-recursive merge-ort overall runtime: 151.001 s 158.582 s inexact rename detection: 143.448 s 157.835 s everything else: 7.553 s 0.747 s === Timing observations === 0) Maximum speedup The "everything else" row represents the maximum speedup we could achieve if we were to somehow infinitely parallelize inexact rename detection, but leave everything else alone. The fact that this is so much smaller than the real runtime (even in the case with virtually no renames) makes it clear just how overwhelmingly large the time spent on rename detection can be. 1) no-renames 1a) merge-ort is faster than merge-recursive, which is nice. However, this still should not be considered good enough. Although the "merge" backend to rebase (merge-recursive) is sometimes faster than the "apply" backend, this is one of those cases where it is not. In fact, even merge-ort is slower. The "apply" backend can complete this testcase in 6.940 s ± 0.485 s which is about 2x faster than merge-ort and 3x faster than merge-recursive. One goal of the merge-ort performance work will be to make it faster than git-am on this (and similar) testcases. 2) mega-renames 2a) Obviously rename detection is a huge cost; it's where most the time is spent. We need to cut that down. If we could somehow infinitely parallelize it and drive its time to 0, the merge-recursive time would drop to about 204s, and the merge-ort time would drop to about 17s. I think this particular stat shows I've subtly baked a couple performance improvements into merge-ort and into fast-rebase already. 3) just-one-mega 3a) not much to say here, it just gives some flavor for how rebasing only one patch compares to rebasing 35. === Goals === This patch is obviously just the beginning. Here are some of my goals that this measurement will help us achieve: * Drive the cost of rename detection down considerably for merges * After the above has been achieved, see if there are other slowness factors (which would have previously been overshadowed by rename detection costs) which we can then focus on and also optimize. * Ensure our rebase testcase that requires little rename detection is noticeably faster with merge-ort than with apply-based rebase. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Taylor Blau <ttaylorr@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Specify the format of the on-disk reverse index 'pack-*.rev' file, as well as prepare the code for the existence of such files. The reverse index maps from pack relative positions (i.e., an index into the array of object which is sorted by their offsets within the packfile) to their position within the 'pack-*.idx' file. Today, this is done by building up a list of (off_t, uint32_t) tuples for each object (the off_t corresponding to that object's offset, and the uint32_t corresponding to its position in the index). To convert between pack and index position quickly, this array of tuples is radix sorted based on its offset. This has two major drawbacks: First, the in-memory cost scales linearly with the number of objects in a pack. Each 'struct revindex_entry' is sizeof(off_t) + sizeof(uint32_t) + padding bytes for a total of 16. To observe this, force Git to load the reverse index by, for e.g., running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking for a single object in a fresh clone of the kernel, Git needs to allocate 120+ MB of memory in order to hold the reverse index in memory. Second, the cost to sort also scales with the size of the pack. Luckily, this is a linear function since 'load_pack_revindex()' uses a radix sort, but this cost still must be paid once per pack per process. As an example, it takes ~60x longer to print the _size_ of an object as it does to print that entire object's _contents_: Benchmark #1: git.compile cat-file --batch <obj Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms] Range (min … max): 3.2 ms … 3.7 ms 726 runs Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms] Range (min … max): 193.7 ms … 224.4 ms 13 runs Instead, avoid computing and sorting the revindex once per process by writing it to a file when the pack itself is generated. The format is relatively straightforward. It contains an array of uint32_t's, the length of which is equal to the number of objects in the pack. The ith entry in this table contains the index position of the ith object in the pack, where "ith object in the pack" is determined by pack offset. One thing that the on-disk format does _not_ contain is the full (up to) eight-byte offset corresponding to each object. This is something that the in-memory revindex contains (it stores an off_t in 'struct revindex_entry' along with the same uint32_t that the on-disk format has). Omit it in the on-disk format, since knowing the index position for some object is sufficient to get a constant-time lookup in the pack-*.idx file to ask for an object's offset within the pack. This trades off between the on-disk size of the 'pack-*.rev' file for runtime to chase down the offset for some object. Even though the lookup is constant time, the constant is heavier, since it can potentially involve two pointer walks in v2 indexes (one to access the 4-byte offset table, and potentially a second to access the double wide offset table). Consider trying to map an object's pack offset to a relative position within that pack. In a cold-cache scenario, more page faults occur while switching between binary searching through the reverse index and searching through the *.idx file for an object's offset. Sure enough, with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after 'sync'ing), printing out the entire object's contents is still marginally faster than printing its size: Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms] Range (min … max): 21.4 ms … 23.5 ms 41 runs Benchmark #2: git.compile cat-file --batch <obj >/dev/null Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms] Range (min … max): 15.6 ms … 18.2 ms 45 runs (Numbers taken in the kernel after cheating and using the next patch to generate a reverse index). There are a couple of approaches to improve cold cache performance not pursued here: - We could include the object offsets in the reverse index format. Predictably, this does result in fewer page faults, but it triples the size of the file, while simultaneously duplicating a ton of data already available in the .idx file. (This was the original way I implemented the format, and it did show `--batch-check='%(objectsize:disk)'` winning out against `--batch`.) On the other hand, this increase in size also results in a large block-cache footprint, which could potentially hurt other workloads. - We could store the mapping from pack to index position in more cache-friendly way, like constructing a binary search tree from the table and writing the values in breadth-first order. This would result in much better locality, but the price you pay is trading O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you can no longer directly index the table). So, neither of these approaches are taken here. (Thankfully, the format is versioned, so we are free to pursue these in the future.) But, cold cache performance likely isn't interesting outside of one-off cases like asking for the size of an object directly. In real-world usage, Git is often performing many operations in the revindex (i.e., asking about many objects rather than a single one). The trade-off is worth it, since we will avoid the vast majority of the cost of generating the revindex that the extra pointer chase will look like noise in the following patch's benchmarks. This patch describes the format and prepares callers (like in pack-revindex.c) to be able to read *.rev files once they exist. An implementation of the writer will appear in the next patch, and callers will gradually begin to start using the writer in the patches that follow after that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

…sponse query_result can be be an empty strbuf (STRBUF_INIT) - in that case trying to read 3 bytes triggers a buffer overflow read (as query_result.buf = '\0'). Therefore we need to check query_result's length before trying to read 3 bytes. This overflow was introduced in: 940b94f (fsmonitor: log invocation of FSMonitor hook to trace2, 2021-02-03) It was found when running the test-suite against ASAN, and can be most easily reproduced with the following command: make GIT_TEST_OPTS="-v" DEFAULT_TEST_TARGET="t7519-status-fsmonitor.sh" \ SANITIZE=address DEVELOPER=1 test ==2235==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0000019e6e5e at pc 0x00000043745c bp 0x7fffd382c520 sp 0x7fffd382bcc8 READ of size 3 at 0x0000019e6e5e thread T0 #0 0x43745b in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:842:7 #1 0x43786d in bcmp /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:887:10 #2 0x80b146 in fsmonitor_is_trivial_response /home/ahunt/oss-fuzz/git/fsmonitor.c:192:10 #3 0x80b146 in query_fsmonitor /home/ahunt/oss-fuzz/git/fsmonitor.c:175:7 #4 0x80a749 in refresh_fsmonitor /home/ahunt/oss-fuzz/git/fsmonitor.c:267:21 #5 0x80bad1 in tweak_fsmonitor /home/ahunt/oss-fuzz/git/fsmonitor.c:429:4 #6 0x90f040 in read_index_from /home/ahunt/oss-fuzz/git/read-cache.c:2321:3 #7 0x8e5d08 in repo_read_index_preload /home/ahunt/oss-fuzz/git/preload-index.c:164:15 #8 0x52dd45 in prepare_index /home/ahunt/oss-fuzz/git/builtin/commit.c:363:6 #9 0x52a188 in cmd_commit /home/ahunt/oss-fuzz/git/builtin/commit.c:1588:15 #10 0x4ce77e in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #11 0x4ccb18 in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #12 0x4cb01c in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #13 0x4cb01c in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 git-for-windows#14 0x6aca8d in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 git-for-windows#15 0x7fb027bf5349 in __libc_start_main (/lib64/libc.so.6+0x24349) git-for-windows#16 0x4206b9 in _start /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:120 0x0000019e6e5e is located 2 bytes to the left of global variable 'strbuf_slopbuf' defined in 'strbuf.c:51:6' (0x19e6e60) of size 1 'strbuf_slopbuf' is ascii string '' 0x0000019e6e5e is located 126 bytes to the right of global variable 'signals' defined in 'sigchain.c:11:31' (0x19e6be0) of size 512 SUMMARY: AddressSanitizer: global-buffer-overflow /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:842:7 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) Shadow bytes around the buggy address: 0x000080334d70: f9 f9 f9 f9 00 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 0x000080334d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x000080334d90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x000080334da0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x000080334db0: 00 00 00 00 00 00 00 00 00 00 00 00 f9 f9 f9 f9 =>0x000080334dc0: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9[f9]01 f9 f9 f9 0x000080334dd0: f9 f9 f9 f9 03 f9 f9 f9 f9 f9 f9 f9 02 f9 f9 f9 0x000080334de0: f9 f9 f9 f9 00 f9 f9 f9 f9 f9 f9 f9 04 f9 f9 f9 0x000080334df0: f9 f9 f9 f9 01 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 0x000080334e00: f9 f9 f9 f9 00 00 00 00 f9 f9 f9 f9 01 f9 f9 f9 0x000080334e10: f9 f9 f9 f9 04 f9 f9 f9 f9 f9 f9 f9 00 f9 f9 f9 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Acked-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

shorten_unambiguous_ref() returns an allocated string. We have to track it separately from the const refname. This leak has existed since: 9ab55da (git symbolic-ref --delete $symref, 2012-10-21) This leak was found when running t0001 with LSAN, see also LSAN output below: Direct leak of 19 byte(s) in 1 object(s) allocated from: #0 0x486514 in strdup /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0x9ab048 in xstrdup /home/ahunt/oss-fuzz/git/wrapper.c:29:14 #2 0x8b452f in refs_shorten_unambiguous_ref /home/ahunt/oss-fuzz/git/refs.c #3 0x8b47e8 in shorten_unambiguous_ref /home/ahunt/oss-fuzz/git/refs.c:1287:9 #4 0x679fce in check_symref /home/ahunt/oss-fuzz/git/builtin/symbolic-ref.c:28:14 #5 0x679ad8 in cmd_symbolic_ref /home/ahunt/oss-fuzz/git/builtin/symbolic-ref.c:70:9 #6 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #7 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #8 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #9 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #10 0x69cc6e in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #11 0x7f98388a4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

dwim_ref() allocs a new string into ref. Instead of setting to NULL to discard it, we can FREE_AND_NULL. This leak appears to have been introduced in: 4cf76f6 (builtin/reset: compute checkout metadata for reset, 2020-03-16) This leak was found when running t0001 with LSAN, see also LSAN output below: Direct leak of 5 byte(s) in 1 object(s) allocated from: #0 0x486514 in strdup /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0x9a7108 in xstrdup /home/ahunt/oss-fuzz/git/wrapper.c:29:14 #2 0x8add6b in expand_ref /home/ahunt/oss-fuzz/git/refs.c:670:12 #3 0x8ad777 in repo_dwim_ref /home/ahunt/oss-fuzz/git/refs.c:644:22 #4 0x6394af in dwim_ref /home/ahunt/oss-fuzz/git/./refs.h:162:9 #5 0x637e5c in cmd_reset /home/ahunt/oss-fuzz/git/builtin/reset.c:426:4 #6 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #7 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #8 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #9 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #10 0x69c5ce in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #11 0x7f57ebb9d349 in __libc_start_main (/lib64/libc.so.6+0x24349) Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Most of these pointers can safely be freed when cmd_clone() completes, therefore we make sure to free them. The one exception is that we have to UNLEAK(repo) because it can point either to argv[0], or a malloc'd string returned by absolute_pathdup(). We also have to free(path) in the middle of cmd_clone(): later during cmd_clone(), path is unconditionally overwritten with a different path, triggering a leak. Freeing the first path immediately after use (but only in the case where it contains data) seems like the cleanest solution, as opposed to freeing it unconditionally before path is reused for another path. This leak appears to have been introduced in: f38aa83 (use local cloning if insteadOf makes a local URL, 2014-07-17) These leaks were found when running t0001 with LSAN, see also an excerpt of the LSAN output below (the full list is omitted because it's far too long, and mostly consists of indirect leakage of members of the refs we are freeing). Direct leak of 178 byte(s) in 1 object(s) allocated from: #0 0x49a53d in malloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0x9a6ff4 in do_xmalloc /home/ahunt/oss-fuzz/git/wrapper.c:41:8 #2 0x9a6fca in xmalloc /home/ahunt/oss-fuzz/git/wrapper.c:62:9 #3 0x8ce296 in copy_ref /home/ahunt/oss-fuzz/git/remote.c:885:8 #4 0x8d2ebd in guess_remote_head /home/ahunt/oss-fuzz/git/remote.c:2215:10 #5 0x51d0c5 in cmd_clone /home/ahunt/oss-fuzz/git/builtin/clone.c:1308:4 #6 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #7 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #8 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #9 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #10 0x69c45e in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #11 0x7f6a459d5349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 165 byte(s) in 1 object(s) allocated from: #0 0x49a53d in malloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0x9a6fc4 in do_xmalloc /home/ahunt/oss-fuzz/git/wrapper.c:41:8 #2 0x9a6f9a in xmalloc /home/ahunt/oss-fuzz/git/wrapper.c:62:9 #3 0x8ce266 in copy_ref /home/ahunt/oss-fuzz/git/remote.c:885:8 #4 0x51e9bd in wanted_peer_refs /home/ahunt/oss-fuzz/git/builtin/clone.c:574:21 #5 0x51cfe1 in cmd_clone /home/ahunt/oss-fuzz/git/builtin/clone.c:1284:17 #6 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #7 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #8 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #9 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #10 0x69c42e in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #11 0x7f8fef0c2349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 178 byte(s) in 1 object(s) allocated from: #0 0x49a53d in malloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0x9a6ff4 in do_xmalloc /home/ahunt/oss-fuzz/git/wrapper.c:41:8 #2 0x9a6fca in xmalloc /home/ahunt/oss-fuzz/git/wrapper.c:62:9 #3 0x8ce296 in copy_ref /home/ahunt/oss-fuzz/git/remote.c:885:8 #4 0x8d2ebd in guess_remote_head /home/ahunt/oss-fuzz/git/remote.c:2215:10 #5 0x51d0c5 in cmd_clone /home/ahunt/oss-fuzz/git/builtin/clone.c:1308:4 #6 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #7 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #8 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #9 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #10 0x69c45e in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #11 0x7f6a459d5349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 165 byte(s) in 1 object(s) allocated from: #0 0x49a6b2 in calloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0x9a72f2 in xcalloc /home/ahunt/oss-fuzz/git/wrapper.c:140:8 #2 0x8ce203 in alloc_ref_with_prefix /home/ahunt/oss-fuzz/git/remote.c:867:20 #3 0x8ce1a2 in alloc_ref /home/ahunt/oss-fuzz/git/remote.c:875:9 #4 0x72f63e in process_ref_v2 /home/ahunt/oss-fuzz/git/connect.c:426:8 #5 0x72f21a in get_remote_refs /home/ahunt/oss-fuzz/git/connect.c:525:8 #6 0x979ab7 in handshake /home/ahunt/oss-fuzz/git/transport.c:305:4 #7 0x97872d in get_refs_via_connect /home/ahunt/oss-fuzz/git/transport.c:339:9 #8 0x9774b5 in transport_get_remote_refs /home/ahunt/oss-fuzz/git/transport.c:1388:4 #9 0x51cf80 in cmd_clone /home/ahunt/oss-fuzz/git/builtin/clone.c:1271:9 #10 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #11 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #12 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #13 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 git-for-windows#14 0x69c45e in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 git-for-windows#15 0x7f6a459d5349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 105 byte(s) in 1 object(s) allocated from: #0 0x49a859 in realloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0x9a71f6 in xrealloc /home/ahunt/oss-fuzz/git/wrapper.c:126:8 #2 0x93622d in strbuf_grow /home/ahunt/oss-fuzz/git/strbuf.c:98:2 #3 0x937a73 in strbuf_addch /home/ahunt/oss-fuzz/git/./strbuf.h:231:3 #4 0x939fcd in strbuf_add_absolute_path /home/ahunt/oss-fuzz/git/strbuf.c:911:4 #5 0x69d3ce in absolute_pathdup /home/ahunt/oss-fuzz/git/abspath.c:261:2 #6 0x51c688 in cmd_clone /home/ahunt/oss-fuzz/git/builtin/clone.c:1021:10 #7 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #8 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #9 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #10 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #11 0x69c45e in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #12 0x7f6a459d5349 in __libc_start_main (/lib64/libc.so.6+0x24349) Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Make sure that we release the temporary strbuf during dwim_branch() for all codepaths (and not just for the early return). This leak appears to have been introduced in: f60a7b7 (worktree: teach "add" to check out existing branches, 2018-04-24) Note that UNLEAK(branchname) is still needed: the returned result is used in add(), and is stored in a pointer which is used to point at one of: - a string literal ("HEAD") - member of argv (whatever the user specified in their invocation) - or our newly allocated string returned from dwim_branch() Fixing the branchname leak isn't impossible, but does not seem worthwhile given that add() is called directly from cmd_main(), and cmd_main() returns immediately thereafter - UNLEAK is good enough. This leak was found when running t0001 with LSAN, see also LSAN output below: Direct leak of 60 byte(s) in 1 object(s) allocated from: #0 0x49a859 in realloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0x9ab076 in xrealloc /home/ahunt/oss-fuzz/git/wrapper.c:126:8 #2 0x939fcd in strbuf_grow /home/ahunt/oss-fuzz/git/strbuf.c:98:2 #3 0x93af53 in strbuf_splice /home/ahunt/oss-fuzz/git/strbuf.c:239:3 #4 0x83559a in strbuf_check_branch_ref /home/ahunt/oss-fuzz/git/object-name.c:1593:2 #5 0x6988b9 in dwim_branch /home/ahunt/oss-fuzz/git/builtin/worktree.c:454:20 #6 0x695f8f in add /home/ahunt/oss-fuzz/git/builtin/worktree.c:525:19 #7 0x694a04 in cmd_worktree /home/ahunt/oss-fuzz/git/builtin/worktree.c:1036:10 #8 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #9 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #10 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #11 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #12 0x69caee in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #13 0x7f7b7dd10349 in __libc_start_main (/lib64/libc.so.6+0x24349) Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The primary goal of this change is to stop leaking init_db_template_dir. This leak can happen because: 1. git_init_db_config() allocates new memory into init_db_template_dir without first freeing the existing value. 2. init_db_template_dir might already contain data, either because: 2.1 git_config() can be invoked twice with this callback in a single process - at least 2 allocations are likely. 2.2 A single git_config() allocation can invoke the callback multiple times for a given key (see further explanation in the function docs) - each of those calls will trigger another leak. The simplest fix for the leak would be to free(init_db_template_dir) before overwriting it. Instead we choose to convert to fetching init.templatedir via git_config_get_value() as that is more explicit, more efficient, and avoids allocations (the returned result is owned by the config cache, so we aren't responsible for freeing it). If we remove init_db_template_dir, git_init_db_config() ends up being responsible only for forwarding core.* config values to platform_core_config(). However platform_core_config() already ignores non-core.* config values, so we can safely remove git_init_db_config() and invoke git_config() directly with platform_core_config() as the callback. The platform_core_config forwarding was originally added in: 2878533 (mingw: respect core.hidedotfiles = false in git-init again, 2019-03-11 And I suspect the potential for a leak existed since the original implementation of git_init_db_config in: 90b4518 (Add `init.templatedir` configuration variable., 2010-02-17) LSAN output from t0001: Direct leak of 73 byte(s) in 1 object(s) allocated from: #0 0x49a859 in realloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0x9a7276 in xrealloc /home/ahunt/oss-fuzz/git/wrapper.c:126:8 #2 0x9362ad in strbuf_grow /home/ahunt/oss-fuzz/git/strbuf.c:98:2 #3 0x936eaa in strbuf_add /home/ahunt/oss-fuzz/git/strbuf.c:295:2 #4 0x868112 in strbuf_addstr /home/ahunt/oss-fuzz/git/./strbuf.h:304:2 #5 0x86a8ad in expand_user_path /home/ahunt/oss-fuzz/git/path.c:758:2 #6 0x720bb1 in git_config_pathname /home/ahunt/oss-fuzz/git/config.c:1287:10 #7 0x5960e2 in git_init_db_config /home/ahunt/oss-fuzz/git/builtin/init-db.c:161:11 #8 0x7255b8 in configset_iter /home/ahunt/oss-fuzz/git/config.c:1982:7 #9 0x7253fc in repo_config /home/ahunt/oss-fuzz/git/config.c:2311:2 #10 0x725ca7 in git_config /home/ahunt/oss-fuzz/git/config.c:2399:2 #11 0x593e8d in create_default_files /home/ahunt/oss-fuzz/git/builtin/init-db.c:225:2 #12 0x5935c6 in init_db /home/ahunt/oss-fuzz/git/builtin/init-db.c:449:11 #13 0x59588e in cmd_init_db /home/ahunt/oss-fuzz/git/builtin/init-db.c:714:9 git-for-windows#14 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 git-for-windows#15 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 git-for-windows#16 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 git-for-windows#17 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 git-for-windows#18 0x69c4de in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 git-for-windows#19 0x7f23552d6349 in __libc_start_main (/lib64/libc.so.6+0x24349) Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

preprocess_options() allocates new strings for help messages for OPTION_ALIAS. Therefore we also need to clean those help messages up when freeing the returned options. First introduced in: 7c28058 (parse-options: teach "git cmd -h" to show alias as alias, 2020-03-16) The preprocessed options themselves no longer contain any indication that a given option is/was an alias - therefore we add a new flag to indicate former aliases. (An alternative approach would be to look back at the original options to determine which options are aliases - but that seems like a fragile approach. Or we could even look at the alias_groups list - which might be less fragile, but would be slower as it requires nested looping.) As far as I can tell, parse_options() is only ever used once per command, and the help messages are small - hence this leak has very little impact. This leak was found while running t0001. LSAN output can be found below: Direct leak of 65 byte(s) in 1 object(s) allocated from: #0 0x49a859 in realloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0x9aae36 in xrealloc /home/ahunt/oss-fuzz/git/wrapper.c:126:8 #2 0x939d8d in strbuf_grow /home/ahunt/oss-fuzz/git/strbuf.c:98:2 #3 0x93b936 in strbuf_vaddf /home/ahunt/oss-fuzz/git/strbuf.c:392:3 #4 0x93b7ff in strbuf_addf /home/ahunt/oss-fuzz/git/strbuf.c:333:2 #5 0x86747e in preprocess_options /home/ahunt/oss-fuzz/git/parse-options.c:666:3 #6 0x866ed2 in parse_options /home/ahunt/oss-fuzz/git/parse-options.c:847:17 #7 0x51c4a7 in cmd_clone /home/ahunt/oss-fuzz/git/builtin/clone.c:989:9 #8 0x4cd60d in run_builtin /home/ahunt/oss-fuzz/git/git.c:453:11 #9 0x4cb2da in handle_builtin /home/ahunt/oss-fuzz/git/git.c:704:3 #10 0x4ccc37 in run_argv /home/ahunt/oss-fuzz/git/git.c:771:4 #11 0x4cac29 in cmd_main /home/ahunt/oss-fuzz/git/git.c:902:19 #12 0x69c9fe in main /home/ahunt/oss-fuzz/git/common-main.c:52:11 #13 0x7fdac42d4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When queueing references in git-rev-list(1), we try to optimize parsing of commits via the commit-graph. To do so, we first look up the object's type, and if it is a commit we call `repo_parse_commit()` instead of `parse_object()`. This is quite inefficient though given that we're always uncompressing the object header in order to determine the type. Instead, we can opportunistically search the commit-graph for the object ID: in case it's found, we know it's a commit and can directly fill in the commit object without having to uncompress the object header. Expose a new function `lookup_commit_in_graph()`, which tries to find a commit in the commit-graph by ID, and convert `get_reference()` to use this function. This provides a big performance win in cases where we load references in a repository with lots of references pointing to commits. The following has been executed in a real-world repository with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s] Range (min … max): 4.409 s … 4.534 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s] Range (min … max): 3.061 s … 3.105 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When fetching, Git will by default print a list of all updated refs in a nicely formatted table. In order to come up with this table, Git needs to iterate refs twice: first to determine the maximum column width, and a second time to actually format these changed refs. While this table will not be printed in case the user passes `--quiet`, we still go out of our way and do all these steps. In fact, we even do more work compared to not passing `--quiet`: without the flag, we will skip all references in the column width computation which have not been updated, but if it is set we will now compute widths for all refs. Fix this issue by completely skipping both preparation of the format and formatting data for display in case the user passes `--quiet`, improving performance especially with many refs. The following benchmark shows a nice speedup for a quiet mirror-fetch in a repository with 2.3M refs: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 26.929 s ± 0.145 s [User: 24.194 s, System: 4.656 s] Range (min … max): 26.692 s … 27.068 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 25.189 s ± 0.094 s [User: 22.556 s, System: 4.606 s] Range (min … max): 25.070 s … 25.314 s 5 runs Summary 'HEAD: git-fetch' ran 1.07 ± 0.01 times faster than 'HEAD~: git-fetch' While at it, this patch also fixes `adjust_refcol_width()` such that it skips unchanged refs in case the user passed `--quiet`, where verbosity will be negative. While this function won't be called anymore if so, this brings the comment in line with actual code. Furthermore, needless `verbosity >= 0` checks are now removed in `store_updated_refs()`: we never print to the `note` buffer anymore in case `verbosity < 0`, so we won't end up in that code block anyway. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

…the `sparse-checkout` builtin This integrates the `sparse-checkout` builtin with the sparse index. The tricky part here is that we need to partially expand the index when we are modifying the sparse-checkout definition. Note that we modify the pattern list in a careful way: we create a `struct pattern_list` in-memory in `builtin/sparse-checkout.c` then apply those patterns to the index before writing the patterns to the sparse-checkout file. The `update_sparsity()` method does the work to assign the `SKIP_WORKTREE` bit appropriately, but this doesn't work if the files that are within the new sparse-checkout cone are still hidden behind a sparse directory. The new `expand_to_pattern_list()` method does the hard work of expanding the sparse directories that are now within the new patterns. This expands only as far as needed, possibly creating new sparse directory entries. This method does not contract existing files to sparse directories, and a big reason why is because of the check for ignored files as we delete those directories. The `clean_tracked_sparse_directories()` method is called after `update_sparsity()`, but we need to read the `A/B/.gitignore` file (or lack thereof) before we can delete `A/B/`. If we convert to sparse too quickly, then we lose this information and cause a full expansion. Most of the correctness is handled by existing tests in `t1092`, but I add checks for `ensure_not_expanded` in some hopefully interesting cases. As for performance, `git sparse-checkout set` can be slow if it needs to move a lot of files. However, no-op `git sparse-checkout set` (i.e. set the sparse-checkout cone to only include files at root, and do this on repeat) has these performance results on Linux in a monorepo with 2+ million files at `HEAD`: ``` Benchmark #1: baseline Time (mean ± σ): 10.465 s ± 0.018 s [User: 9.885 s, System: 0.573 s] Range (min … max): 10.450 s … 10.497 s 5 runs Benchmark #2: new code Time (mean ± σ): 68.9 ms ± 2.9 ms [User: 45.8 ms, System: 17.1 ms] Range (min … max): 63.4 ms … 74.0 ms 41 runs Summary 'new code' ran 151.89 ± 6.30 times faster than 'baseline' ```

During reflog expiry, the cmd_reflog_expire() function first iterates over all reflogs in logs/*, and then one-by-one acquires the lock for each one and expires it. This behavior has been with us since this command was implemented in 4264dc1 ("git reflog expire", 2006-12-19). Change this to stop calling lock_ref_oid_basic() with the OID we saw when we looped over the logs, instead have it pass the OID it managed to lock. This mostly mitigates a race condition where e.g. "git gc" will fail in a concurrently updated repository because the branch moved since "git reflog expire --all" was started. I.e. with: error: cannot lock ref '<refname>': ref '<refname>' is at <OID-A> but expected <OID-B> This behavior of passing in an "oid" was needed for an edge-case that I've untangled in this and preceding commits though, namely that we needed this OID because we'd: 1. Lookup the reflog name/OID via dwim_log() 2. With that OID, lock the reflog 3. Later in builtin/reflog.c we use the OID we looked as input to lookup_commit_reference_gently(), assured that it's equal to the OID we got from dwim_log(). We can be sure that this change is safe to make because between dwim_log (step #1) and lock_ref_oid_basic (step #2) there was no other logic relevant to the OID or expiry run in the cmd_reflog_expire() caller. We can thus treat that code as a black box, before and after this change it would get an OID that's been locked, the only difference is that now we mostly won't be failing to get the lock due to the TOCTOU race[0]. That failure was purely an implementation detail in how the "current OID" was looked up, it was divorced from the locking mechanism. What do we mean with "mostly"? It mostly mitigates it because we'll still run into cases where the ref is locked and being updated as we want to expire it, and other git processes wanting to update the refs will in turn race with us as we expire the reflog. That remaining race can in turn be mitigated with the core.filesRefLockTimeout setting, see 4ff0f01 ("refs: retry acquiring reference locks for 100ms", 2017-08-21). In practice if that value is high enough we'll probably never have ref updates or reflog expiry failing, since the clients involved will retry for far longer than the time any of those operations could take. See [1] for an initial report of how this impacted "git gc" and a large discussion about this change in early 2019. In particular patch looked good to Michael Haggerty, see his[2]. That message seems to not have made it to the ML archive, its content is quoted in full in my [3]. I'm leaving behind now-unused code the refs API etc. that takes the now-NULL "unused_oid" argument, and other code that can be simplified now that we never have on OID in that context, that'll be cleaned up in subsequent commits, but for now let's narrowly focus on fixing the "git gc" issue. As the modified assert() shows we always pass a NULL oid to reflog_expire() now. Unfortunately this sort of probabilistic contention is hard to turn into a test. I've tested this by running the following three subshells in concurrent terminals: ( rm -rf /tmp/git && git init /tmp/git && while true do head -c 10 /dev/urandom | hexdump >/tmp/git/out && git -C /tmp/git add out && git -C /tmp/git commit -m"out" done ) ( rm -rf /tmp/git-clone && git clone file:///tmp/git /tmp/git-clone && while git -C /tmp/git-clone pull do date done ) ( while git -C /tmp/git-clone reflog expire --all do date done ) Before this change the "reflog expire" would fail really quickly with the "but expected" error noted above. After this change both the "pull" and "reflog expire" will run for a while, but eventually fail because I get unlucky with core.filesRefLockTimeout (the "reflog expire" is in a really tight loop). As noted above that can in turn be mitigated with higher values of core.filesRefLockTimeout than the 100ms default. As noted in the commentary added in the preceding commit there's also the case of branches being racily deleted, that can be tested by adding this to the above: ( while git -C /tmp/git-clone branch topic master && git -C /tmp/git-clone branch -D topic do date done ) With core.filesRefLockTimeout set to 10 seconds (it can probably be a lot lower) I managed to run all four of these concurrently for about an hour, and accumulated ~125k commits, auto-gc's and all, and didn't have a single failure. The loops visibly stall while waiting for the lock, but that's expected and desired behavior. 0. https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use 1. https://lore.kernel.org/git/87tvg7brlm.fsf@evledraar.gmail.com/ 2. http://lore.kernel.org/git/b870a17d-2103-41b8-3cbc-7389d5fff33a@alum.mit.edu 3. https://lore.kernel.org/git/87pnqkco8v.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When updating our local refs based on the refs fetched from the remote, we need to iterate through all requested refs and load their respective commits such that we can determine whether they need to be appended to FETCH_HEAD or not. In cases where we're fetching from a remote with exceedingly many refs, resolving these refs can be quite expensive given that we repeatedly need to unpack object headers for each of the referenced objects. Speed this up by opportunistically trying to resolve object IDs via the commit graph. We only do so for any refs which are not in "refs/tags": more likely than not, these are going to be a commit anyway, and this lets us avoid having to unpack object headers completely in case the object is a commit that is part of the commit-graph. This significantly speeds up mirror-fetches in a real-world repository with 2.3M refs: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 56.482 s ± 0.384 s [User: 53.340 s, System: 5.365 s] Range (min … max): 56.050 s … 57.045 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 33.727 s ± 0.170 s [User: 30.252 s, System: 5.194 s] Range (min … max): 33.452 s … 33.871 s 5 runs Summary 'HEAD: git-fetch' ran 1.67 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When updating local refs after the fetch has transferred all objects, we do an object existence test as a safety guard to avoid updating a ref to an object which we don't have. We do so via `oid_object_info()`: if it returns an error, then we know the object does not exist. One side effect of `oid_object_info()` is that it parses the object's type, and to do so it must unpack the object header. This is completely pointless: we don't care for the type, but only want to assert that the object exists. Refactor the code to use `repo_has_object_file()`, which both makes the code's intent clearer and is also faster because it does not unpack object headers. In a real-world repo with 2.3M refs, this results in a small speedup when doing a mirror-fetch: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 33.686 s ± 0.176 s [User: 30.119 s, System: 5.262 s] Range (min … max): 33.512 s … 33.944 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 31.247 s ± 0.195 s [User: 28.135 s, System: 5.066 s] Range (min … max): 30.948 s … 31.472 s 5 runs Summary 'HEAD: git-fetch' ran 1.08 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The object ID iterator used by the connectivity checks returns the next object ID via an out-parameter and then uses a return code to indicate whether an item was found. This is a bit roundabout: instead of a separate error code, we can just return the next object ID directly and use `NULL` pointers as indicator that the iterator got no items left. Furthermore, this avoids a copy of the object ID. Refactor the iterator and all its implementations to return object IDs directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs: Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch Time (mean ± σ): 30.110 s ± 0.148 s [User: 27.161 s, System: 5.075 s] Range (min … max): 29.934 s … 30.406 s 10 runs Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch Time (mean ± σ): 29.899 s ± 0.109 s [User: 26.916 s, System: 5.104 s] Range (min … max): 29.696 s … 29.996 s 10 runs Summary '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran 1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch' While this 1% speedup could be labelled as statistically insignificant, the speedup is consistent on my machine. Furthermore, this is an end to end test, so it is expected that the improvement in the connectivity check itself is more significant. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In order to negotiate a packfile, we need to dereference refs to see which commits we have in common with the remote. To do so, we first look up the object's type -- if it's a tag, we peel until we hit a non-tag object. If we hit a commit eventually, then we return that commit. In case the object ID points to a commit directly, we can avoid the initial lookup of the object type by opportunistically looking up the commit via the commit-graph, if available, which gives us a slight speed bump of about 2% in a huge repository with about 2.3M refs: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 31.634 s ± 0.258 s [User: 28.400 s, System: 5.090 s] Range (min … max): 31.280 s … 31.896 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 31.129 s ± 0.543 s [User: 27.976 s, System: 5.056 s] Range (min … max): 30.172 s … 31.479 s 5 runs Summary 'HEAD: git-fetch' ran 1.02 ± 0.02 times faster than 'HEAD~: git-fetch' In case this fails, we fall back to the old code which peels the objects to a commit. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When fetching refs, we are doing two connectivity checks: - The first one is done such that we can skip fetching refs in the case where we already have all objects referenced by the updated set of refs. - The second one verifies that we have all objects after we have fetched objects. We always execute both connectivity checks, but this is wasteful in case the first connectivity check already notices that we have all objects locally available. Skip the second connectivity check in case we already had all objects available. This gives us a nice speedup when doing a mirror-fetch in a repository with about 2.3M refs where the fetching repo already has all objects: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 30.025 s ± 0.081 s [User: 27.070 s, System: 4.933 s] Range (min … max): 29.900 s … 30.111 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 25.574 s ± 0.177 s [User: 22.855 s, System: 4.683 s] Range (min … max): 25.399 s … 25.765 s 5 runs Summary 'HEAD: git-fetch' ran 1.17 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In a sparse index it is possible for the tree that is being verified to be freed while it is being verified. This happens when the index is sparse but the cache tree is not and index_name_pos() looks up a path from the cache tree that is a descendant of a sparse index entry. That triggers a call to ensure_full_index() which frees the cache tree that is being verified. Carrying on trying to verify the tree after this results in a use-after-free bug. Instead restart the verification if a sparse index is converted to a full index. This bug is triggered by a call to reset_head() in "git rebase --apply". Thanks to RenÃ© Scharfe and Derrick Stolee for their help analyzing the problem. ==74345==ERROR: AddressSanitizer: heap-use-after-free on address 0x606000001b20 at pc 0x557cbe82d3a2 bp 0x7ffdfee08090 sp 0x7ffdfee08080 READ of size 4 at 0x606000001b20 thread T0 #0 0x557cbe82d3a1 in verify_one /home/phil/src/git/cache-tree.c:863 #1 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #2 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #3 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #4 0x557cbe830a2b in cache_tree_verify /home/phil/src/git/cache-tree.c:910 #5 0x557cbea53741 in write_locked_index /home/phil/src/git/read-cache.c:3250 #6 0x557cbeab7fdd in reset_head /home/phil/src/git/reset.c:87 #7 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 #8 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 #9 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 #10 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 #11 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 #12 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 #13 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) git-for-windows#14 0x557cbe5bcb8d in _start (/home/phil/src/git/git+0x1b9b8d) 0x606000001b20 is located 0 bytes inside of 56-byte region [0x606000001b20,0x606000001b58) freed by thread T0 here: #0 0x7fdd4bacff19 in __interceptor_free /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:127 #1 0x557cbe82af60 in cache_tree_free /home/phil/src/git/cache-tree.c:35 #2 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #3 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #4 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #5 0x557cbeb2557a in ensure_full_index /home/phil/src/git/sparse-index.c:310 #6 0x557cbea45c4a in index_name_stage_pos /home/phil/src/git/read-cache.c:588 #7 0x557cbe82ce37 in verify_one /home/phil/src/git/cache-tree.c:850 #8 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #9 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #10 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #11 0x557cbe830a2b in cache_tree_verify /home/phil/src/git/cache-tree.c:910 #12 0x557cbea53741 in write_locked_index /home/phil/src/git/read-cache.c:3250 #13 0x557cbeab7fdd in reset_head /home/phil/src/git/reset.c:87 git-for-windows#14 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 git-for-windows#15 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 git-for-windows#16 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 git-for-windows#17 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 git-for-windows#18 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 git-for-windows#19 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git-for-windows#20 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) previously allocated by thread T0 here: #0 0x7fdd4bad0459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x557cbebc1807 in xcalloc /home/phil/src/git/wrapper.c:140 #2 0x557cbe82b7d8 in cache_tree /home/phil/src/git/cache-tree.c:17 #3 0x557cbe82b7d8 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:763 #4 0x557cbe82b837 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:764 #5 0x557cbe82b837 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:764 #6 0x557cbe8304e1 in prime_cache_tree /home/phil/src/git/cache-tree.c:779 #7 0x557cbeab7fa7 in reset_head /home/phil/src/git/reset.c:85 #8 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 #9 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 #10 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 #11 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 #12 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 #13 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git-for-windows#14 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

When "cat-file --batch-all-objects" iterates over each object, it knows where to find each one. But when we look up details of the object, we don't use that information at all. This patch teaches it to use the pack/offset pair when we're iterating over objects in a pack. This yields a measurable speed improvement (timings on a fully packed clone of linux.git): Benchmark #1: ./git.old cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)" Time (mean ± σ): 8.128 s ± 0.118 s [User: 7.968 s, System: 0.156 s] Range (min … max): 8.007 s … 8.301 s 10 runs Benchmark #2: ./git.new cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)" Time (mean ± σ): 4.294 s ± 0.064 s [User: 4.167 s, System: 0.125 s] Range (min … max): 4.227 s … 4.457 s 10 runs Summary './git.new cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)"' ran 1.89 ± 0.04 times faster than './git.old cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)" The implementation is pretty simple: we just call packed_object_info() instead of oid_object_info_extended() when we can. Most of the changes are just plumbing the pack/offset pair through the callstack. There is one subtlety: replace lookups are not handled by packed_object_info(). But since those are disabled for --batch-all-objects, and since we'll only have pack info when that option is in effect, we don't have to worry about that. There are a few limitations to this optimization which we could address with further work: - I didn't bother recording when we found an object loose. Technically this could save us doing a fruitless lookup in the pack index. But opening and mmap-ing a loose object is so expensive in the first place that this doesn't matter much. And if your repository is large enough to care about per-object performance, most objects are going to be packed anyway. - This works only in --unordered mode. For the sorted mode, we'd have to record the pack/offset pair as part of our oid-collection. That's more code, plus at least 16 extra bytes of heap per object. It would probably still be a net win in runtime, but we'd need to measure. - For --batch, this still helps us with getting the object metadata, but we still do a from-scratch lookup for the object contents. This probably doesn't matter that much, because the lookup cost will be much smaller relative to the cost of actually unpacking and printing the objects. For small objects, we could probably swap out read_object_file() for using packed_object_info() with a "object_info.contentp" to get the contents. But we'd still need to deal with streaming for larger objects. A better path forward here is to teach the initial oid_object_info_extended() / packed_object_info() calls to retrieve the contents of smaller objects while they are already being accessed. That would save the extra lookup entirely. But it's a non-trivial feature to add to the object_info code, so I left it for now. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In a sparse index it is possible for the tree that is being verified to be freed while it is being verified. This happens when the index is sparse but the cache tree is not and index_name_pos() looks up a path from the cache tree that is a descendant of a sparse index entry. That triggers a call to ensure_full_index() which frees the cache tree that is being verified. Carrying on trying to verify the tree after this results in a use-after-free bug. Instead restart the verification if a sparse index is converted to a full index. This bug is triggered by a call to reset_head() in "git rebase --apply". Thanks to René Scharfe and Derrick Stolee for their help analyzing the problem. ==74345==ERROR: AddressSanitizer: heap-use-after-free on address 0x606000001b20 at pc 0x557cbe82d3a2 bp 0x7ffdfee08090 sp 0x7ffdfee08080 READ of size 4 at 0x606000001b20 thread T0 #0 0x557cbe82d3a1 in verify_one /home/phil/src/git/cache-tree.c:863 #1 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #2 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #3 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #4 0x557cbe830a2b in cache_tree_verify /home/phil/src/git/cache-tree.c:910 #5 0x557cbea53741 in write_locked_index /home/phil/src/git/read-cache.c:3250 #6 0x557cbeab7fdd in reset_head /home/phil/src/git/reset.c:87 #7 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 #8 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 #9 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 #10 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 #11 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 #12 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 #13 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) git-for-windows#14 0x557cbe5bcb8d in _start (/home/phil/src/git/git+0x1b9b8d) 0x606000001b20 is located 0 bytes inside of 56-byte region [0x606000001b20,0x606000001b58) freed by thread T0 here: #0 0x7fdd4bacff19 in __interceptor_free /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:127 #1 0x557cbe82af60 in cache_tree_free /home/phil/src/git/cache-tree.c:35 #2 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #3 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #4 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #5 0x557cbeb2557a in ensure_full_index /home/phil/src/git/sparse-index.c:310 #6 0x557cbea45c4a in index_name_stage_pos /home/phil/src/git/read-cache.c:588 #7 0x557cbe82ce37 in verify_one /home/phil/src/git/cache-tree.c:850 #8 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #9 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #10 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #11 0x557cbe830a2b in cache_tree_verify /home/phil/src/git/cache-tree.c:910 #12 0x557cbea53741 in write_locked_index /home/phil/src/git/read-cache.c:3250 #13 0x557cbeab7fdd in reset_head /home/phil/src/git/reset.c:87 git-for-windows#14 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 git-for-windows#15 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 git-for-windows#16 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 git-for-windows#17 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 git-for-windows#18 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 git-for-windows#19 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git-for-windows#20 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) previously allocated by thread T0 here: #0 0x7fdd4bad0459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x557cbebc1807 in xcalloc /home/phil/src/git/wrapper.c:140 #2 0x557cbe82b7d8 in cache_tree /home/phil/src/git/cache-tree.c:17 #3 0x557cbe82b7d8 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:763 #4 0x557cbe82b837 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:764 #5 0x557cbe82b837 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:764 #6 0x557cbe8304e1 in prime_cache_tree /home/phil/src/git/cache-tree.c:779 #7 0x557cbeab7fa7 in reset_head /home/phil/src/git/reset.c:85 #8 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 #9 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 #10 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 #11 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 #12 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 #13 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git-for-windows#14 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Extend the trick we use to speed up the "clean" target to also extend to the "lint-docs" target. See 54df875 (Documentation/Makefile: conditionally include doc.dep, 2020-12-08) for the "clean" implementation. The "doc-lint" target only depends on *.txt files, so we don't need to generate GIT-VERSION-FILE etc. if that's all we're doing. This makes the "make lint-docs" target more than 2x as fast: $ git show HEAD~:Documentation/Makefile >Makefile.old $ hyperfine -L f ",.old" 'make -f Makefile{f} lint-docs' Benchmark #1: make -f Makefile lint-docs Time (mean ± σ): 100.2 ms ± 1.3 ms [User: 93.7 ms, System: 6.7 ms] Range (min … max): 98.4 ms … 103.1 ms 29 runs Benchmark #2: make -f Makefile.old lint-docs Time (mean ± σ): 220.0 ms ± 20.0 ms [User: 206.0 ms, System: 18.0 ms] Range (min … max): 206.6 ms … 267.5 ms 11 runs Summary 'make -f Makefile lint-docs' ran 2.19 ± 0.20 times faster than 'make -f Makefile.old lint-docs' Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Speed up the "lint-docs" target by making it non-.PHONY. Similar to my c234e8a (Makefile: make the "sparse" target non-.PHONY, 2021-09-23). We'll now create empty files corresponding to a dependency graph for each of these lint scripts. This speeds things up a bit[1], and makes the output correspond to any in-tree changes we have: $ touch git-add.txt; make lint-docs; make lint-docs GEN cmd-list.made GEN doc.dep LINT GITLINK git-add.txt LINT MAN END git-add.txt LINT MAN SEC git-add.txt make: Nothing to be done for 'lint-docs'. As with the "sparse" target changes this has a hard dependency on the use of ".DELETE_ON_ERROR" in the Makefile, added here in db10fc6 (doc: simplify Makefile using .DELETE_ON_ERROR, 2021-05-21). This method also depends on the output for us emitting any errors on STDERR (fixed in a preceding commit), as well us these scripts exiting with non-zero on any errors (which they were already doing). 1. $ git show HEAD~:Documentation/Makefile >Makefile.old $ hyperfine --warmup 2 -L f ",.old" 'make -j1 -f Makefile{f} lint-docs' Benchmark #1: make -j1 -f Makefile lint-docs Time (mean ± σ): 60.8 ms ± 1.4 ms [User: 58.7 ms, System: 2.5 ms] Range (min … max): 58.9 ms … 64.0 ms 48 runs Benchmark #2: make -j1 -f Makefile.old lint-docs Time (mean ± σ): 84.0 ms ± 1.5 ms [User: 78.6 ms, System: 5.7 ms] Range (min … max): 81.8 ms … 87.8 ms 35 runs Summary 'make -j1 -f Makefile lint-docs' ran 1.38 ± 0.04 times faster than 'make -j1 -f Makefile.old lint-docs' Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

* cleaning duplicate incorrect translations part 2 * update translation table * unfuzzy all entries * typos check in git commands and git flags * random line translations * updating all phrases #1 Signed-off-by: Daniel Santos <daniel@brilhante.top>

The function get_categories() is invoked in a loop over all commands. As it runs several processes, this takes an awful lot of time on Windows. To reduce the number of processes, move the process that filters empty lines to the other invoker of the function, where it is needed. The invocation of get_categories() in the loop does not need the empty line filtered away because the result is word-split by the shell, which eliminates the empty line automatically. Furthermore, use sort -u instead of sort | uniq to remove yet another process. [Ævar: on Linux this seems to speed things up a bit, although with hyperfine(1) the results are fuzzy enough to land within the confidence interval]: $ git show HEAD~:generate-cmdlist.sh >generate-cmdlist.sh.old $ hyperfine --warmup 1 -L s ,.old -p 'make clean' 'sh generate-cmdlist.sh{s} command-list.txt' Benchmark #1: sh generate-cmdlist.sh command-list.txt Time (mean ± σ): 371.3 ms ± 64.2 ms [User: 430.4 ms, System: 72.5 ms] Range (min … max): 320.5 ms … 517.7 ms 10 runs Benchmark #2: sh generate-cmdlist.sh.old command-list.txt Time (mean ± σ): 489.9 ms ± 185.4 ms [User: 724.7 ms, System: 141.3 ms] Range (min … max): 346.0 ms … 885.3 ms 10 runs Summary 'sh generate-cmdlist.sh command-list.txt' ran 1.32 ± 0.55 times faster than 'sh generate-cmdlist.sh.old command-list.txt' Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Change the "define_categories()" and "define_category_names()" functions to take the already-parsed output of "category_list()" as an argument, which brings our number of passes over "command-list.txt" from three to two. Then have "category_list()" itself take the output of "command_list()" as an argument, bringing the number of times we parse the file to one. Compared to the pre-image this speeds us up quite a bit: $ git show HEAD~:generate-cmdlist.sh >generate-cmdlist.sh.old $ hyperfine --warmup 10 -L v ,.old 'sh generate-cmdlist.sh{v} command-list.txt' Benchmark #1: sh generate-cmdlist.sh command-list.txt Time (mean ± σ): 22.9 ms ± 0.3 ms [User: 15.8 ms, System: 9.6 ms] Range (min … max): 22.5 ms … 24.0 ms 125 runs Benchmark #2: sh generate-cmdlist.sh.old command-list.txt Time (mean ± σ): 30.1 ms ± 0.4 ms [User: 24.4 ms, System: 17.5 ms] Range (min … max): 29.5 ms … 32.3 ms 96 runs Summary 'sh generate-cmdlist.sh command-list.txt' ran 1.32 ± 0.02 times faster than 'sh generate-cmdlist.sh.old command-list.txt' Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

* 'pt-PT' of github.com:git-l10n-pt-PT/git-po: l10n: pt_PT: cleaning duplicate translations (#2)

When fetching packfiles, we write a bunch of lockfiles for the packfiles we're writing into the repository. In order to not leave behind any cruft in case we exit or receive a signal, we register both an exit handler as well as signal handlers for common signals like SIGINT. These handlers will then unlink the locks and free the data structure tracking them. We have observed a deadlock in this logic though: (gdb) bt #0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f4932bea2cd in _int_free (av=0x7f4932f2eb20 <main_arena>, p=0x3e3e4200, have_lock=0) at malloc.c:3969 #2 0x00007f4932bee58c in __GI___libc_free (mem=<optimized out>) at malloc.c:2975 #3 0x0000000000662ab1 in string_list_clear () #4 0x000000000044f5bc in unlock_pack_on_signal () #5 <signal handler called> #6 _int_free (av=0x7f4932f2eb20 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4024 #7 0x00007f4932bee58c in __GI___libc_free (mem=<optimized out>) at malloc.c:2975 #8 0x000000000065afd5 in strbuf_release () #9 0x000000000066ddb9 in delete_tempfile () #10 0x0000000000610d0b in files_transaction_cleanup.isra () #11 0x0000000000611718 in files_transaction_abort () #12 0x000000000060d2ef in ref_transaction_abort () #13 0x000000000060d441 in ref_transaction_prepare () git-for-windows#14 0x000000000060e0b5 in ref_transaction_commit () git-for-windows#15 0x00000000004511c2 in fetch_and_consume_refs () git-for-windows#16 0x000000000045279a in cmd_fetch () git-for-windows#17 0x0000000000407c48 in handle_builtin () git-for-windows#18 0x0000000000408df2 in cmd_main () git-for-windows#19 0x00000000004078b5 in main () The process was killed with a signal, which caused the signal handler to kick in and try free the data structures after we have unlinked the locks. It then deadlocks while calling free(3P). The root cause of this is that it is not allowed to call certain functions in async-signal handlers, as specified by signal-safety(7). Next to most I/O functions, this list of disallowed functions also includes memory-handling functions like malloc(3P) and free(3P) because they may not be reentrant. As a result, if we execute such functions in the signal handler, then they may operate on inconistent state and fail in unexpected ways. Fix this bug by not calling non-async-signal-safe functions when running in the signal handler. We're about to re-raise the signal anyway and will thus exit, so it's not much of a problem to keep the string list of lockfiles untouched. Note that it's fine though to call unlink(2), so we'll still clean up the lockfiles correctly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In the tmp-objdir api, tmp_objdir_create will create a temporary directory but also register signal handlers responsible for removing the directory's contents and the directory itself. However, the function responsible for recursively removing the contents and directory, remove_dir_recurse() calls opendir(3) and closedir(3). This can be problematic because these functions allocate and free memory, which are not async-signal-safe functions. This can lead to deadlocks. One place we call tmp_objdir_create() is in git-receive-pack, where we create a temporary quarantine directory "incoming". Incoming objects will be written to this directory before they get moved to the object directory. We have observed this code leading to a deadlock: Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)): #0 __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80 <main_arena>) at ./lowlevellock.c:35 #1 0x00007f621baa635b in __GI___libc_malloc (bytes=bytes@entry=32816) at malloc.c:3064 #2 0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60, flags=0, close_fd=true, fd=5) at ../sysdeps/posix/opendir.c:118 #3 opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69 #4 __opendir (name=<optimized out>) at ../sysdeps/posix/opendir.c:92 #5 0x0000557c19c77de1 in remove_dir_recurse () git#6 0x0000557c19d81a4f in remove_tmp_objdir_on_signal () #7 <signal handler called> git#8 _int_malloc (av=av@entry=0x7f621bbf8b80 <main_arena>, bytes=bytes@entry=7160) at malloc.c:4116 git#9 0x00007f621baa62c9 in __GI___libc_malloc (bytes=7160) at malloc.c:3066 git#10 0x00007f621bd1e987 in inflateInit2_ () from /opt/gitlab/embedded/lib/libz.so.1 git#11 0x0000557c19dbe5f4 in git_inflate_init () git#12 0x0000557c19cee02a in unpack_compressed_entry () git#13 0x0000557c19cf08cb in unpack_entry () git#14 0x0000557c19cf0f32 in packed_object_info () git#15 0x0000557c19cd68cd in do_oid_object_info_extended () git#16 0x0000557c19cd6e2b in read_object_file_extended () git#17 0x0000557c19cdec2f in parse_object () git#18 0x0000557c19c34977 in lookup_commit_reference_gently () git#19 0x0000557c19d69309 in mark_uninteresting () git#20 0x0000557c19d2d180 in do_for_each_repo_ref_iterator () git#21 0x0000557c19d21678 in for_each_ref () git#22 0x0000557c19d6a94f in assign_shallow_commits_to_refs () git#23 0x0000557c19bc02b2 in cmd_receive_pack () git#24 0x0000557c19b29fdd in handle_builtin () git#25 0x0000557c19b2a526 in cmd_main () git#26 0x0000557c19b28ea2 in main () Since we can't do the cleanup in a portable and signal-safe way, skip the cleanup when we're handling a signal. This means that when signal handling, the temporary directory may not get cleaned up properly. This is mitigated by b3cecf4 (tmp-objdir: new API for creating temporary writable databases, 2021-12-06) which changed the default name and allows gc to clean up these temporary directories. In the event of a normal exit, we should still be cleaning up via the atexit() handler. Helped-by: Jeff King <peff@peff.net> Signed-off-by: John Cai <johncai86@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When "read_strategy_opts()" is called we may have populated the "opts->strategy" before, so we'll need to free() it to avoid leaking memory. We populate it before because we cal get_replay_opts() from within "rebase.c" with an already populated "opts", which we then copy. Then if we're doing a "rebase -i" the sequencer API itself will promptly clobber our alloc'd version of it with its own. If this code is changed to do, instead of the added free() here a: if (opts->strategy) opts->strategy = xstrdup("another leak"); We get a couple of stacktraces from -fsanitize=leak showing how we ended up clobbering the already allocated value, i.e.: Direct leak of 6 byte(s) in 1 object(s) allocated from: #0 0x7f2e8cd45545 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75 #1 0x7f2e8cb0fcaa in __GI___strdup string/strdup.c:42 #2 0x6c4778 in xstrdup wrapper.c:39 #3 0x66bcb8 in read_strategy_opts sequencer.c:2902 #4 0x66bf7b in read_populate_opts sequencer.c:2969 #5 0x6723f9 in sequencer_continue sequencer.c:5063 #6 0x4a4f74 in run_sequencer_rebase builtin/rebase.c:348 #7 0x4a64c8 in run_specific_rebase builtin/rebase.c:753 #8 0x4a9b8b in cmd_rebase builtin/rebase.c:1824 #9 0x407a32 in run_builtin git.c:466 #10 0x407e0a in handle_builtin git.c:721 #11 0x40803d in run_argv git.c:788 #12 0x40850f in cmd_main git.c:923 #13 0x4eee79 in main common-main.c:57 git-for-windows#14 0x7f2e8ca9f209 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 git-for-windows#15 0x7f2e8ca9f2bb in __libc_start_main_impl ../csu/libc-start.c:389 git-for-windows#16 0x405fd0 in _start (git+0x405fd0) Direct leak of 4 byte(s) in 1 object(s) allocated from: #0 0x7f2e8cd45545 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75 #1 0x7f2e8cb0fcaa in __GI___strdup string/strdup.c:42 #2 0x6c4778 in xstrdup wrapper.c:39 #3 0x4a3c31 in xstrdup_or_null git-compat-util.h:1169 #4 0x4a447a in get_replay_opts builtin/rebase.c:163 #5 0x4a4f5b in run_sequencer_rebase builtin/rebase.c:346 #6 0x4a64c8 in run_specific_rebase builtin/rebase.c:753 #7 0x4a9b8b in cmd_rebase builtin/rebase.c:1824 #8 0x407a32 in run_builtin git.c:466 #9 0x407e0a in handle_builtin git.c:721 #10 0x40803d in run_argv git.c:788 #11 0x40850f in cmd_main git.c:923 #12 0x4eee79 in main common-main.c:57 #13 0x7f2e8ca9f209 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 git-for-windows#14 0x7f2e8ca9f2bb in __libc_start_main_impl ../csu/libc-start.c:389 git-for-windows#15 0x405fd0 in _start (git+0x405fd0) This can be seen in e.g. the 4th test of "t3404-rebase-interactive.sh". In the larger picture the ownership of the "struct replay_opts" is quite a mess, e.g. in this case rebase.c's static "get_replay_opts()" function partially creates it, but nothing in rebase.c will free() it. The structure is "mostly owned" by the sequencer API, but it also expects to get these partially populated versions of it. It would be better to have rebase keep track of what it allocated, and free() that, and to pass that as a "const" to the sequencer API, which would copy what it needs to its own version, and to free() that. But doing so is a much larger change, and however messy the ownership boundary is here is consistent with what we're doing already, so let's just free() this to fix the leak. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com>

There is an out-of-bounds read possible when parsing gitattributes that have an attribute that is 2^31+1 bytes long. This is caused due to an integer overflow when we assign the result of strlen(3P) to an `int`, where we use the wrapped-around value in a subsequent call to memcpy(3P). The following code reproduces the issue: blob=$(perl -e 'print "a" x 2147483649 . " attr"' | git hash-object -w --stdin) git update-index --add --cacheinfo 100644,$blob,.gitattributes git check-attr --all file AddressSanitizer:DEADLYSIGNAL ================================================================= ==8451==ERROR: AddressSanitizer: SEGV on unknown address 0x7f93efa00800 (pc 0x7f94f1f8f082 bp 0x7ffddb59b3a0 sp 0x7ffddb59ab28 T0) ==8451==The signal is caused by a READ memory access. #0 0x7f94f1f8f082 (/usr/lib/libc.so.6+0x176082) #1 0x7f94f2047d9c in __interceptor_strspn /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:752 #2 0x560e190f7f26 in parse_attr_line attr.c:375 #3 0x560e190f9663 in handle_attr_line attr.c:660 #4 0x560e190f9ddd in read_attr_from_index attr.c:769 #5 0x560e190f9f14 in read_attr attr.c:797 #6 0x560e190fa24e in bootstrap_attr_stack attr.c:867 #7 0x560e190fa4a5 in prepare_attr_stack attr.c:902 #8 0x560e190fb5dc in collect_some_attrs attr.c:1097 #9 0x560e190fb93f in git_all_attrs attr.c:1128 #10 0x560e18e6136e in check_attr builtin/check-attr.c:67 #11 0x560e18e61c12 in cmd_check_attr builtin/check-attr.c:183 #12 0x560e18e15993 in run_builtin git.c:466 #13 0x560e18e16397 in handle_builtin git.c:721 git-for-windows#14 0x560e18e16b2b in run_argv git.c:788 git-for-windows#15 0x560e18e17991 in cmd_main git.c:926 git-for-windows#16 0x560e190ae2bd in main common-main.c:57 git-for-windows#17 0x7f94f1e3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#18 0x7f94f1e3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#19 0x560e18e110e4 in _start ../sysdeps/x86_64/start.S:115 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (/usr/lib/libc.so.6+0x176082) ==8451==ABORTING Fix this bug by converting the variable to a `size_t` instead. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

It is possible to trigger an integer overflow when parsing attribute names when there are more than 2^31 of them for a single pattern. This can either lead to us dying due to trying to request too many bytes: blob=$(perl -e 'print "f" . " a=" x 2147483649' | git hash-object -w --stdin) git update-index --add --cacheinfo 100644,$blob,.gitattributes git attr-check --all file ================================================================= ==1022==ERROR: AddressSanitizer: requested allocation size 0xfffffff800000032 (0xfffffff800001038 after adjustments for alignment, red zones etc.) exceeds maximum supported size of 0x10000000000 (thread T0) #0 0x7fd3efabf411 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77 #1 0x5563a0a1e3d3 in xcalloc wrapper.c:150 #2 0x5563a058d005 in parse_attr_line attr.c:384 #3 0x5563a058e661 in handle_attr_line attr.c:660 #4 0x5563a058eddb in read_attr_from_index attr.c:769 #5 0x5563a058ef12 in read_attr attr.c:797 #6 0x5563a058f24c in bootstrap_attr_stack attr.c:867 #7 0x5563a058f4a3 in prepare_attr_stack attr.c:902 #8 0x5563a05905da in collect_some_attrs attr.c:1097 #9 0x5563a059093d in git_all_attrs attr.c:1128 #10 0x5563a02f636e in check_attr builtin/check-attr.c:67 #11 0x5563a02f6c12 in cmd_check_attr builtin/check-attr.c:183 #12 0x5563a02aa993 in run_builtin git.c:466 #13 0x5563a02ab397 in handle_builtin git.c:721 git-for-windows#14 0x5563a02abb2b in run_argv git.c:788 git-for-windows#15 0x5563a02ac991 in cmd_main git.c:926 git-for-windows#16 0x5563a05432bd in main common-main.c:57 git-for-windows#17 0x7fd3ef82228f (/usr/lib/libc.so.6+0x2328f) ==1022==HINT: if you don't care about these errors you may set allocator_may_return_null=1 SUMMARY: AddressSanitizer: allocation-size-too-big /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77 in __interceptor_calloc ==1022==ABORTING Or, much worse, it can lead to an out-of-bounds write because we underallocate and then memcpy(3P) into an array: perl -e ' print "A " . "\rh="x2000000000; print "\rh="x2000000000; print "\rh="x294967294 . "\n" ' >.gitattributes git add .gitattributes git commit -am "evil attributes" $ git clone --quiet /path/to/repo ================================================================= ==15062==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000002550 at pc 0x5555559884d5 bp 0x7fffffffbc60 sp 0x7fffffffbc58 WRITE of size 8 at 0x602000002550 thread T0 #0 0x5555559884d4 in parse_attr_line attr.c:393 #1 0x5555559884d4 in handle_attr_line attr.c:660 #2 0x555555988902 in read_attr_from_index attr.c:784 #3 0x555555988902 in read_attr_from_index attr.c:747 #4 0x555555988a1d in read_attr attr.c:800 #5 0x555555989b0c in bootstrap_attr_stack attr.c:882 #6 0x555555989b0c in prepare_attr_stack attr.c:917 #7 0x555555989b0c in collect_some_attrs attr.c:1112 #8 0x55555598b141 in git_check_attr attr.c:1126 #9 0x555555a13004 in convert_attrs convert.c:1311 #10 0x555555a95e04 in checkout_entry_ca entry.c:553 #11 0x555555d58bf6 in checkout_entry entry.h:42 #12 0x555555d58bf6 in check_updates unpack-trees.c:480 #13 0x555555d5eb55 in unpack_trees unpack-trees.c:2040 git-for-windows#14 0x555555785ab7 in checkout builtin/clone.c:724 git-for-windows#15 0x555555785ab7 in cmd_clone builtin/clone.c:1384 git-for-windows#16 0x55555572443c in run_builtin git.c:466 git-for-windows#17 0x55555572443c in handle_builtin git.c:721 git-for-windows#18 0x555555727872 in run_argv git.c:788 git-for-windows#19 0x555555727872 in cmd_main git.c:926 git-for-windows#20 0x555555721fa0 in main common-main.c:57 git-for-windows#21 0x7ffff73f1d09 in __libc_start_main ../csu/libc-start.c:308 git-for-windows#22 0x555555723f39 in _start (git+0x1cff39) 0x602000002552 is located 0 bytes to the right of 2-byte region [0x602000002550,0x602000002552) allocated by thread T0 here: #0 0x7ffff768c037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x555555d7fff7 in xcalloc wrapper.c:150 #2 0x55555598815f in parse_attr_line attr.c:384 #3 0x55555598815f in handle_attr_line attr.c:660 #4 0x555555988902 in read_attr_from_index attr.c:784 #5 0x555555988902 in read_attr_from_index attr.c:747 #6 0x555555988a1d in read_attr attr.c:800 #7 0x555555989b0c in bootstrap_attr_stack attr.c:882 #8 0x555555989b0c in prepare_attr_stack attr.c:917 #9 0x555555989b0c in collect_some_attrs attr.c:1112 #10 0x55555598b141 in git_check_attr attr.c:1126 #11 0x555555a13004 in convert_attrs convert.c:1311 #12 0x555555a95e04 in checkout_entry_ca entry.c:553 #13 0x555555d58bf6 in checkout_entry entry.h:42 git-for-windows#14 0x555555d58bf6 in check_updates unpack-trees.c:480 git-for-windows#15 0x555555d5eb55 in unpack_trees unpack-trees.c:2040 git-for-windows#16 0x555555785ab7 in checkout builtin/clone.c:724 git-for-windows#17 0x555555785ab7 in cmd_clone builtin/clone.c:1384 git-for-windows#18 0x55555572443c in run_builtin git.c:466 git-for-windows#19 0x55555572443c in handle_builtin git.c:721 git-for-windows#20 0x555555727872 in run_argv git.c:788 git-for-windows#21 0x555555727872 in cmd_main git.c:926 git-for-windows#22 0x555555721fa0 in main common-main.c:57 git-for-windows#23 0x7ffff73f1d09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: heap-buffer-overflow attr.c:393 in parse_attr_line Shadow bytes around the buggy address: 0x0c047fff8450: fa fa 00 02 fa fa 00 07 fa fa fd fd fa fa 00 00 0x0c047fff8460: fa fa 02 fa fa fa fd fd fa fa 00 06 fa fa 05 fa 0x0c047fff8470: fa fa fd fd fa fa 00 02 fa fa 06 fa fa fa 05 fa 0x0c047fff8480: fa fa 07 fa fa fa fd fd fa fa 00 01 fa fa 00 02 0x0c047fff8490: fa fa 00 03 fa fa 00 fa fa fa 00 01 fa fa 00 03 =>0x0c047fff84a0: fa fa 00 01 fa fa 00 02 fa fa[02]fa fa fa fa fa 0x0c047fff84b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff84c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff84d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff84e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff84f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==15062==ABORTING Fix this bug by using `size_t` instead to count the number of attributes so that this value cannot reasonably overflow without running out of memory before already. Reported-by: Markus Vervier <markus.vervier@x41-dsec.de> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When using a padding specifier in the pretty format passed to git-log(1) we need to calculate the string length in several places. These string lengths are stored in `int`s though, which means that these can easily overflow when the input lengths exceeds 2GB. This can ultimately lead to an out-of-bounds write when these are used in a call to memcpy(3P): ==8340==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f1ec62f97fe at pc 0x7f2127e5f427 bp 0x7ffd3bd63de0 sp 0x7ffd3bd63588 WRITE of size 1 at 0x7f1ec62f97fe thread T0 #0 0x7f2127e5f426 in __interceptor_memcpy /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 #1 0x5628e96aa605 in format_and_pad_commit pretty.c:1762 #2 0x5628e96aa7f4 in format_commit_item pretty.c:1801 #3 0x5628e97cdb24 in strbuf_expand strbuf.c:429 #4 0x5628e96ab060 in repo_format_commit_message pretty.c:1869 #5 0x5628e96acd0f in pretty_print_commit pretty.c:2161 #6 0x5628e95a44c8 in show_log log-tree.c:781 #7 0x5628e95a76ba in log_tree_commit log-tree.c:1117 #8 0x5628e922bed5 in cmd_log_walk_no_free builtin/log.c:508 #9 0x5628e922c35b in cmd_log_walk builtin/log.c:549 #10 0x5628e922f1a2 in cmd_log builtin/log.c:883 #11 0x5628e9106993 in run_builtin git.c:466 #12 0x5628e9107397 in handle_builtin git.c:721 #13 0x5628e9107b07 in run_argv git.c:788 git-for-windows#14 0x5628e91088a7 in cmd_main git.c:923 git-for-windows#15 0x5628e939d682 in main common-main.c:57 git-for-windows#16 0x7f2127c3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#17 0x7f2127c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#18 0x5628e91020e4 in _start ../sysdeps/x86_64/start.S:115 0x7f1ec62f97fe is located 2 bytes to the left of 4831838265-byte region [0x7f1ec62f9800,0x7f1fe62f9839) allocated by thread T0 here: #0 0x7f2127ebe7ea in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:85 #1 0x5628e98774d4 in xrealloc wrapper.c:136 #2 0x5628e97cb01c in strbuf_grow strbuf.c:99 #3 0x5628e97ccd42 in strbuf_addchars strbuf.c:327 #4 0x5628e96aa55c in format_and_pad_commit pretty.c:1761 #5 0x5628e96aa7f4 in format_commit_item pretty.c:1801 #6 0x5628e97cdb24 in strbuf_expand strbuf.c:429 #7 0x5628e96ab060 in repo_format_commit_message pretty.c:1869 #8 0x5628e96acd0f in pretty_print_commit pretty.c:2161 #9 0x5628e95a44c8 in show_log log-tree.c:781 #10 0x5628e95a76ba in log_tree_commit log-tree.c:1117 #11 0x5628e922bed5 in cmd_log_walk_no_free builtin/log.c:508 #12 0x5628e922c35b in cmd_log_walk builtin/log.c:549 #13 0x5628e922f1a2 in cmd_log builtin/log.c:883 git-for-windows#14 0x5628e9106993 in run_builtin git.c:466 git-for-windows#15 0x5628e9107397 in handle_builtin git.c:721 git-for-windows#16 0x5628e9107b07 in run_argv git.c:788 git-for-windows#17 0x5628e91088a7 in cmd_main git.c:923 git-for-windows#18 0x5628e939d682 in main common-main.c:57 git-for-windows#19 0x7f2127c3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#20 0x7f2127c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#21 0x5628e91020e4 in _start ../sysdeps/x86_64/start.S:115 SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 in __interceptor_memcpy Shadow bytes around the buggy address: 0x0fe458c572a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0fe458c572b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0fe458c572c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0fe458c572d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0fe458c572e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x0fe458c572f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa] 0x0fe458c57300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0fe458c57310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0fe458c57320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0fe458c57330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0fe458c57340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==8340==ABORTING The pretty format can also be used in `git archive` operations via the `export-subst` attribute. So this is what in our opinion makes this a critical issue in the context of Git forges which allow to download an archive of user supplied Git repositories. Fix this vulnerability by using `size_t` instead of `int` to track the string lengths. Add tests which detect this vulnerability when Git is compiled with the address sanitizer. Reported-by: Joern Schneeweisz <jschneeweisz@gitlab.com> Original-patch-by: Joern Schneeweisz <jschneeweisz@gitlab.com> Modified-by: Taylor Blau <me@ttalorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

With the `%>>(<N>)` pretty formatter, you can ask git-log(1) et al to steal spaces. To do so we need to look ahead of the next token to see whether there are spaces there. This loop takes into account ANSI sequences that end with an `m`, and if it finds any it will skip them until it finds the first space. While doing so it does not take into account the buffer's limits though and easily does an out-of-bounds read. Add a test that hits this behaviour. While we don't have an easy way to verify this, the test causes the following failure when run with `SANITIZE=address`: ==37941==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000000baf at pc 0x55ba6f88e0d0 bp 0x7ffc84c50d20 sp 0x7ffc84c50d10 READ of size 1 at 0x603000000baf thread T0 #0 0x55ba6f88e0cf in format_and_pad_commit pretty.c:1712 #1 0x55ba6f88e7b4 in format_commit_item pretty.c:1801 #2 0x55ba6f9b1ae4 in strbuf_expand strbuf.c:429 #3 0x55ba6f88f020 in repo_format_commit_message pretty.c:1869 #4 0x55ba6f890ccf in pretty_print_commit pretty.c:2161 #5 0x55ba6f7884c8 in show_log log-tree.c:781 #6 0x55ba6f78b6ba in log_tree_commit log-tree.c:1117 #7 0x55ba6f40fed5 in cmd_log_walk_no_free builtin/log.c:508 #8 0x55ba6f41035b in cmd_log_walk builtin/log.c:549 #9 0x55ba6f4131a2 in cmd_log builtin/log.c:883 #10 0x55ba6f2ea993 in run_builtin git.c:466 #11 0x55ba6f2eb397 in handle_builtin git.c:721 #12 0x55ba6f2ebb07 in run_argv git.c:788 #13 0x55ba6f2ec8a7 in cmd_main git.c:923 git-for-windows#14 0x55ba6f581682 in main common-main.c:57 git-for-windows#15 0x7f2d08c3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#16 0x7f2d08c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#17 0x55ba6f2e60e4 in _start ../sysdeps/x86_64/start.S:115 0x603000000baf is located 1 bytes to the left of 24-byte region [0x603000000bb0,0x603000000bc8) allocated by thread T0 here: #0 0x7f2d08ebe7ea in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:85 #1 0x55ba6fa5b494 in xrealloc wrapper.c:136 #2 0x55ba6f9aefdc in strbuf_grow strbuf.c:99 #3 0x55ba6f9b0a06 in strbuf_add strbuf.c:298 #4 0x55ba6f9b1a25 in strbuf_expand strbuf.c:418 #5 0x55ba6f88f020 in repo_format_commit_message pretty.c:1869 #6 0x55ba6f890ccf in pretty_print_commit pretty.c:2161 #7 0x55ba6f7884c8 in show_log log-tree.c:781 #8 0x55ba6f78b6ba in log_tree_commit log-tree.c:1117 #9 0x55ba6f40fed5 in cmd_log_walk_no_free builtin/log.c:508 #10 0x55ba6f41035b in cmd_log_walk builtin/log.c:549 #11 0x55ba6f4131a2 in cmd_log builtin/log.c:883 #12 0x55ba6f2ea993 in run_builtin git.c:466 #13 0x55ba6f2eb397 in handle_builtin git.c:721 git-for-windows#14 0x55ba6f2ebb07 in run_argv git.c:788 git-for-windows#15 0x55ba6f2ec8a7 in cmd_main git.c:923 git-for-windows#16 0x55ba6f581682 in main common-main.c:57 git-for-windows#17 0x7f2d08c3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#18 0x7f2d08c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#19 0x55ba6f2e60e4 in _start ../sysdeps/x86_64/start.S:115 SUMMARY: AddressSanitizer: heap-buffer-overflow pretty.c:1712 in format_and_pad_commit Shadow bytes around the buggy address: 0x0c067fff8120: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd 0x0c067fff8130: fd fd fa fa fd fd fd fd fa fa fd fd fd fa fa fa 0x0c067fff8140: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x0c067fff8150: fa fa fd fd fd fd fa fa 00 00 00 fa fa fa fd fd 0x0c067fff8160: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa =>0x0c067fff8170: fd fd fd fa fa[fa]00 00 00 fa fa fa 00 00 00 fa 0x0c067fff8180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8190: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff81a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff81b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff81c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Luckily enough, this would only cause us to copy the out-of-bounds data into the formatted commit in case we really had an ANSI sequence preceding our buffer. So this bug likely has no security consequences. Fix it regardless by not traversing past the buffer's start. Reported-by: Patrick Steinhardt <ps@pks.im> Reported-by: Eric Sesterhenn <eric.sesterhenn@x41-dsec.de> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

An out-of-bounds read can be triggered when parsing an incomplete padding format string passed via `--pretty=format` or in Git archives when files are marked with the `export-subst` gitattribute. This bug exists since we have introduced support for truncating output via the `trunc` keyword a7f01c6 (pretty: support truncating in %>, %< and %><, 2013-04-19). Before this commit, we used to find the end of the formatting string by using strchr(3P). This function returns a `NULL` pointer in case the character in question wasn't found. The subsequent check whether any character was found thus simply checked the returned pointer. After the commit we switched to strcspn(3P) though, which only returns the offset to the first found character or to the trailing NUL byte. As the end pointer is now computed by adding the offset to the start pointer it won't be `NULL` anymore, and as a consequence the check doesn't do anything anymore. The out-of-bounds data that is being read can in fact end up in the formatted string. As a consequence, it is possible to leak memory contents either by calling git-log(1) or via git-archive(1) when any of the archived files is marked with the `export-subst` gitattribute. ==10888==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000000398 at pc 0x7f0356047cb2 bp 0x7fff3ffb95d0 sp 0x7fff3ffb8d78 READ of size 1 at 0x602000000398 thread T0 #0 0x7f0356047cb1 in __interceptor_strchrnul /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:725 #1 0x563b7cec9a43 in strbuf_expand strbuf.c:417 #2 0x563b7cda7060 in repo_format_commit_message pretty.c:1869 #3 0x563b7cda8d0f in pretty_print_commit pretty.c:2161 #4 0x563b7cca04c8 in show_log log-tree.c:781 #5 0x563b7cca36ba in log_tree_commit log-tree.c:1117 #6 0x563b7c927ed5 in cmd_log_walk_no_free builtin/log.c:508 #7 0x563b7c92835b in cmd_log_walk builtin/log.c:549 #8 0x563b7c92b1a2 in cmd_log builtin/log.c:883 #9 0x563b7c802993 in run_builtin git.c:466 #10 0x563b7c803397 in handle_builtin git.c:721 #11 0x563b7c803b07 in run_argv git.c:788 #12 0x563b7c8048a7 in cmd_main git.c:923 #13 0x563b7ca99682 in main common-main.c:57 git-for-windows#14 0x7f0355e3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#15 0x7f0355e3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#16 0x563b7c7fe0e4 in _start ../sysdeps/x86_64/start.S:115 0x602000000398 is located 0 bytes to the right of 8-byte region [0x602000000390,0x602000000398) allocated by thread T0 here: #0 0x7f0356072faa in __interceptor_strdup /usr/src/debug/gcc/libsanitizer/asan/asan_interceptors.cpp:439 #1 0x563b7cf7317c in xstrdup wrapper.c:39 #2 0x563b7cd9a06a in save_user_format pretty.c:40 #3 0x563b7cd9b3e5 in get_commit_format pretty.c:173 #4 0x563b7ce54ea0 in handle_revision_opt revision.c:2456 #5 0x563b7ce597c9 in setup_revisions revision.c:2850 #6 0x563b7c9269e0 in cmd_log_init_finish builtin/log.c:269 #7 0x563b7c927362 in cmd_log_init builtin/log.c:348 #8 0x563b7c92b193 in cmd_log builtin/log.c:882 #9 0x563b7c802993 in run_builtin git.c:466 #10 0x563b7c803397 in handle_builtin git.c:721 #11 0x563b7c803b07 in run_argv git.c:788 #12 0x563b7c8048a7 in cmd_main git.c:923 #13 0x563b7ca99682 in main common-main.c:57 git-for-windows#14 0x7f0355e3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#15 0x7f0355e3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#16 0x563b7c7fe0e4 in _start ../sysdeps/x86_64/start.S:115 SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:725 in __interceptor_strchrnul Shadow bytes around the buggy address: 0x0c047fff8020: fa fa fd fd fa fa 00 06 fa fa 05 fa fa fa fd fd 0x0c047fff8030: fa fa 00 02 fa fa 06 fa fa fa 05 fa fa fa fd fd 0x0c047fff8040: fa fa 00 07 fa fa 03 fa fa fa fd fd fa fa 00 00 0x0c047fff8050: fa fa 00 01 fa fa fd fd fa fa 00 00 fa fa 00 01 0x0c047fff8060: fa fa 00 06 fa fa 00 06 fa fa 05 fa fa fa 05 fa =>0x0c047fff8070: fa fa 00[fa]fa fa fd fa fa fa fd fd fa fa fd fd 0x0c047fff8080: fa fa fd fd fa fa 00 00 fa fa 00 fa fa fa fd fa 0x0c047fff8090: fa fa fd fd fa fa 00 00 fa fa fa fa fa fa fa fa 0x0c047fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==10888==ABORTING Fix this bug by checking whether `end` points at the trailing NUL byte. Add a test which catches this out-of-bounds read and which demonstrates that we used to write out-of-bounds data into the formatted message. Reported-by: Markus Vervier <markus.vervier@x41-dsec.de> Original-patch-by: Markus Vervier <markus.vervier@x41-dsec.de> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The return type of both `utf8_strwidth()` and `utf8_strnwidth()` is `int`, but we operate on string lengths which are typically of type `size_t`. This means that when the string is longer than `INT_MAX`, we will overflow and thus return a negative result. This can lead to an out-of-bounds write with `--pretty=format:%<1)%B` and a commit message that is 2^31+1 bytes long: ================================================================= ==26009==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000001168 at pc 0x7f95c4e5f427 bp 0x7ffd8541c900 sp 0x7ffd8541c0a8 WRITE of size 2147483649 at 0x603000001168 thread T0 #0 0x7f95c4e5f426 in __interceptor_memcpy /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 #1 0x5612bbb1068c in format_and_pad_commit pretty.c:1763 #2 0x5612bbb1087a in format_commit_item pretty.c:1801 #3 0x5612bbc33bab in strbuf_expand strbuf.c:429 #4 0x5612bbb110e7 in repo_format_commit_message pretty.c:1869 #5 0x5612bbb12d96 in pretty_print_commit pretty.c:2161 #6 0x5612bba0a4d5 in show_log log-tree.c:781 #7 0x5612bba0d6c7 in log_tree_commit log-tree.c:1117 #8 0x5612bb691ed5 in cmd_log_walk_no_free builtin/log.c:508 #9 0x5612bb69235b in cmd_log_walk builtin/log.c:549 #10 0x5612bb6951a2 in cmd_log builtin/log.c:883 #11 0x5612bb56c993 in run_builtin git.c:466 #12 0x5612bb56d397 in handle_builtin git.c:721 #13 0x5612bb56db07 in run_argv git.c:788 git-for-windows#14 0x5612bb56e8a7 in cmd_main git.c:923 git-for-windows#15 0x5612bb803682 in main common-main.c:57 git-for-windows#16 0x7f95c4c3c28f (/usr/lib/libc.so.6+0x2328f) git-for-windows#17 0x7f95c4c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) git-for-windows#18 0x5612bb5680e4 in _start ../sysdeps/x86_64/start.S:115 0x603000001168 is located 0 bytes to the right of 24-byte region [0x603000001150,0x603000001168) allocated by thread T0 here: #0 0x7f95c4ebe7ea in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:85 #1 0x5612bbcdd556 in xrealloc wrapper.c:136 #2 0x5612bbc310a3 in strbuf_grow strbuf.c:99 #3 0x5612bbc32acd in strbuf_add strbuf.c:298 #4 0x5612bbc33aec in strbuf_expand strbuf.c:418 #5 0x5612bbb110e7 in repo_format_commit_message pretty.c:1869 #6 0x5612bbb12d96 in pretty_print_commit pretty.c:2161 #7 0x5612bba0a4d5 in show_log log-tree.c:781 #8 0x5612bba0d6c7 in log_tree_commit log-tree.c:1117 #9 0x5612bb691ed5 in cmd_log_walk_no_free builtin/log.c:508 #10 0x5612bb69235b in cmd_log_walk builtin/log.c:549 #11 0x5612bb6951a2 in cmd_log builtin/log.c:883 #12 0x5612bb56c993 in run_builtin git.c:466 #13 0x5612bb56d397 in handle_builtin git.c:721 git-for-windows#14 0x5612bb56db07 in run_argv git.c:788 git-for-windows#15 0x5612bb56e8a7 in cmd_main git.c:923 git-for-windows#16 0x5612bb803682 in main common-main.c:57 git-for-windows#17 0x7f95c4c3c28f (/usr/lib/libc.so.6+0x2328f) SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 in __interceptor_memcpy Shadow bytes around the buggy address: 0x0c067fff81d0: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x0c067fff81e0: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd 0x0c067fff81f0: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x0c067fff8200: fd fd fd fa fa fa fd fd fd fd fa fa 00 00 00 fa 0x0c067fff8210: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd =>0x0c067fff8220: fd fa fa fa fd fd fd fa fa fa 00 00 00[fa]fa fa 0x0c067fff8230: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8240: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8250: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8260: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8270: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==26009==ABORTING Now the proper fix for this would be to convert both functions to return an `size_t` instead of an `int`. But given that this commit may be part of a security release, let's instead do the minimal viable fix and die in case we see an overflow. Add a test that would have previously caused us to crash. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The t5309 script triggers a racy false positive with SANITIZE=leak on a multi-core system. Running with "--stress --run=6" usually fails within 10 seconds or so for me, complaining with something like: + git index-pack --fix-thin --stdin fatal: REF_DELTA at offset 46 already resolved (duplicate base 01d7713666f4de822776c7622c10f1b07de280dc?) ================================================================= ==3904583==ERROR: LeakSanitizer: detected memory leaks Direct leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fa790d01986 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 #1 0x7fa790add769 in __pthread_getattr_np nptl/pthread_getattr_np.c:180 #2 0x7fa790d117c5 in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150 #3 0x7fa790d11957 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:598 #4 0x7fa790d03fe8 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:51 #5 0x7fa790d013fd in __lsan_thread_start_func ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:440 #6 0x7fa790adc3eb in start_thread nptl/pthread_create.c:444 #7 0x7fa790b5ca5b in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s). Aborted What happens is this: 0. We construct a bogus pack with a duplicate object in it and trigger index-pack. 1. We spawn a bunch of worker threads to resolve deltas (on my system it is 16 threads). 2. One of the threads sees the duplicate object and bails by calling exit(), taking down all of the threads. This is expected and is the point of the test. 3. At the time exit() is called, we may still be spawning threads from the main process via pthread_create(). LSan hooks thread creation to update its book-keeping; it has to know where each thread's stack is (so it can find entry points for reachable memory). So it calls pthread_getattr_np() to get information about the new thread. That may allocate memory that must be freed with a matching call to pthread_attr_destroy(). Probably LSan does that immediately, but if you're unlucky enough, the exit() will happen while it's between those two calls, and the allocated pthread_attr_t appears as a leak. This isn't a real leak. It's not even in our code, but rather in the LSan instrumentation code. So we could just ignore it. But the false positive can cause people to waste time tracking it down. It's possibly something that LSan could protect against (e.g., cover the getattr/destroy pair with a mutex, and then in the final post-exit() check for leaks try to take the same mutex). But I don't know enough about LSan to say if that's a reasonable approach or not (or if my analysis is even completely correct). In the meantime, it's pretty easy to avoid the race by making creation of the worker threads "atomic". That is, we'll spawn all of them before letting any of them start to work. That's easy to do because we already have a work_lock() mutex for handing out that work. If the main process takes it, then all of the threads will immediately block until we've finished spawning and released it. This shouldn't make any practical difference for non-LSan runs. The thread spawning is quick, and could happen before any worker thread gets scheduled anyway. Probably other spots that use threads are subject to the same issues. But since we have to manually insert locking (and since this really is kind of a hack), let's not bother with them unless somebody experiences a similar racy false-positive in practice. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

It is tempting to think of "files and directories" of the current directory as valid inputs to the add and set subcommands of git sparse-checkout. However, in non-cone mode, they often aren't and using them as potential completions leads to *many* forms of confusion: Issue #1. It provides the *wrong* files and directories. For git sparse-checkout add we always want to add files and directories not currently in our sparse checkout, which means we want file and directories not currently present in the current working tree. Providing the files and directories currently present is thus always wrong. For git sparse-checkout set we have a similar problem except in the subset of cases where we are trying to narrow our checkout to a strict subset of what we already have. That is not a very common scenario, especially since it often does not even happen to be true for the first use of the command; for years we required users to create a sparse-checkout via git sparse-checkout init git sparse-checkout set <args...> (or use a clone option that did the init step for you at clone time). The init command creates a minimal sparse-checkout with just the top-level directory present, meaning the set command has to be used to expand the checkout. Thus, only in a special and perhaps unusual cases would any of the suggestions from normal file and directory completion be appropriate. Issue #2: Suggesting patterns that lead to warnings is unfriendly. If the user specifies any regular file and omits the leading '/', then the sparse-checkout command will warn the user that their command is problematic and suggest they use a leading slash instead. Issue #3: Completion gets confused by leading '/', and provides wrong paths. Users often want to anchor their patterns to the toplevel of the repository, especially when listing individual files. There are a number of reasons for this, but notably even sparse-checkout encourages them to do so (as noted above). However, if users do so (via adding a leading '/' to their pattern), then bash completion will interpret the leading slash not as a request for a path at the toplevel of the repository, but as a request for a path at the root of the filesytem. That means at best that completion cannot help with such paths, and if it does find any completions, they are almost guaranteed to be wrong. Issue #4: Suggesting invalid patterns from subdirectories is unfriendly. There is no per-directory equivalent to .gitignore with sparse-checkouts. There is only a single worktree-global $GIT_DIR/info/sparse-checkout file. As such, paths to files must be specified relative to the toplevel of a repository. Providing suggestions of paths that are relative to the current working directory, as bash completion defaults to, is wrong when the current working directory is not the worktree toplevel directory. Issue #5: Paths with special characters will be interpreted incorrectly The entries in the sparse-checkout file are patterns, not paths. While most paths also qualify as patterns (though even in such cases it would be better for users to not use them directly but prefix them with a leading '/'), there are a variety of special characters that would need special escaping beyond the normal shell escaping: '*', '?', '\', '[', ']', and any leading '#' or '!'. If completion suggests any such paths, users will likely expect them to be treated as an exact path rather than as a pattern that might match some number of files other than 1. However, despite the first four issues, we can note that _if_ users are using tab completion, then they are probably trying to specify a path in the index. As such, we transform their argument into a top-level-rooted pattern that matches such a file. For example, if they type: git sparse-checkout add Make<TAB> we could "complete" to git sparse-checkout add /Makefile or, if they ran from the Documentation/technical/ subdirectory: git sparse-checkout add m<TAB> we could "complete" it to: git sparse-checkout add /Documentation/technical/multi-pack-index.txt Note in both cases I use "complete" in quotes, because we actually add characters both before and after the argument in question, so we are kind of abusing "bash completions" to be "bash completions AND beginnings". The fifth issue is a bit stickier, especially when you consider that we not only need to deal with escaping issues because of special meanings of patterns in sparse-checkout & gitignore files, but also that we need to consider escaping issues due to ls-files needing to sometimes quote or escape characters, and because the shell needs to escape some characters. The multiple interacting forms of escaping could get ugly; this patch makes no attempt to do so and simply documents that we decided to not deal with those corner cases for now but at least get the common cases right. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When reusing objects from a pack, we keep track of a set of one or more `reused_chunk`s, corresponding to sections of one or more object(s) from a source pack that we are reusing. Each chunk contains two pieces of information: - the offset of the first object in the source pack (relative to the beginning of the source pack) - the difference between that offset, and the corresponding offset in the pack we're generating The purpose of keeping track of these is so that we can patch an OFS_DELTAs that cross over a section of the reuse pack that we didn't take. For instance, consider a hypothetical pack as shown below: (chunk #2) __________... / / +--------+---------+-------------------+---------+ ... | <base> | <other> | (unused) | <delta> | ... +--------+---------+-------------------+---------+ \ / \______________/ (chunk #1) Suppose that we are sending objects "base", "other", and "delta", and that the "delta" object is stored as an OFS_DELTA, and that its base is "base". If we don't send any objects in the "(unused)" range, we can't copy the delta'd object directly, since its delta offset includes a range of the pack that we didn't copy, so we have to account for that difference when patching and reassembling the delta. In order to compute this value correctly, we need to know not only where we are in the packfile we're assembling (with `hashfile_total(f)`) but also the position of the first byte of the packfile that we are currently reusing. Currently, this works just fine, since when reusing only a single pack those two values are always identical (because verbatim reuse is the first thing pack-objects does when enabled after writing the pack header). But when reusing multiple packs which have one or more gaps, we'll need to account for these two values diverging. Together, these two allow us to compute the reused chunk's offset difference relative to the start of the reused pack, as desired. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

jeffhostetler added 2 commits September 7, 2017 16:36

pack-objects: add filtering help text

36d8558

Add help text for new object filtering options to pack-objects documentation. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

jeffhostetler changed the title ~~Core/pack objects filtering~~ pack-objects filtering Sep 7, 2017

jeffhostetler closed this Sep 22, 2017

jeffhostetler deleted the core/pack_objects_filtering branch September 22, 2017 19:13

jeffhostetler pushed a commit that referenced this pull request Jun 3, 2020

l10n: it.po: update for Git 2.27.0 round #2

a09c79f

Signed-off-by: Alessandro Menti <alessandro.menti@alessandromenti.it>

jeffhostetler pushed a commit that referenced this pull request Jun 3, 2020

Merge branch of github.com:AlessandroMenti/git-po

e228b2c

* 'update-italian-translation' of github.com:AlessandroMenti/git-po: l10n: it.po: update for Git 2.27.0 round #2

jeffhostetler pushed a commit that referenced this pull request Dec 13, 2021

Merge branch 'pt-PT' of github.com:git-l10n-pt-PT/git-po

1b372d1

* 'pt-PT' of github.com:git-l10n-pt-PT/git-po: l10n: pt_PT: cleaning duplicate translations (#2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack-objects filtering #2

pack-objects filtering #2

jeffhostetler commented Sep 7, 2017

pack-objects filtering #2

pack-objects filtering #2

Conversation

jeffhostetler commented Sep 7, 2017