Commit Graph

86 Commits (master)

Author SHA1 Message Date
Elijah Newren df6e874496 diff.h: remove unnecessary include of oidset.h
This also made it clear that several .c files depended upon various
things that oidset included, but had omitted the direct #include for
those headers.  Add those now.

Signed-off-by: Elijah Newren <>
Signed-off-by: Junio C Hamano <>
2023-06-21 13:39:53 -07:00
Elijah Newren b6c09c03eb pretty.h: move has_non_ascii() declaration from commit.h
The function is defined in pretty.c, so this moves the declaration to
a more logical place.

Signed-off-by: Elijah Newren <>
Signed-off-by: Junio C Hamano <>
2023-02-23 17:25:29 -08:00
Jeff King 0ada4b9bfe diffcore-pickaxe: mark unused parameters in pickaxe functions
We have a virtual pickaxe_fn for handling -G versus -S pickaxe options.
They need to take the same set of parameters, but of course they care
about different ones (e.g., a regex -G will never use a kwset).

Mark the unused ones to appease -Wunused-parameter.

Signed-off-by: Jeff King <>
Signed-off-by: Junio C Hamano <>
2022-10-17 21:24:04 -07:00
Ævar Arnfjörð Bjarmason 5d93460024 xdiff-interface: replace discard_hunk_line() with a flag
Remove the dummy discard_hunk_line() function added in
3b40a090fd (diff: avoid generating unused hunk header lines,
2018-11-02) in favor of having a new XDL_EMIT_NO_HUNK_HDR flag, for
use along with the two existing and similar XDL_EMIT_* flags.

Unlike the recently amended xdiff_emit_line_fn interface which'll be
called in a loop in xdl_emit_diff(), the hunk header is only emitted

It makes more sense to pass this as a flag than provide a dummy
callback because that function may be able to skip doing certain work
if it knows the caller is doing nothing with the hunk header.

It would be possible to do so in the case of -U0 now, but the benefit
of doing so is so small that I haven't bothered. But this leaves the
door open to that, and more importantly makes the API use more

The reason we're putting a flag in the gap between 1<<0 and 1<<2 is
that the old 1<<1 flag was removed in 907681e940 (xdiff: drop
XDL_EMIT_COMMON, 2016-02-23) without re-ordering the remaining flags.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason f97fe35857 pickaxe -G: don't special-case create/delete
Instead of special-casing creations and deletions let's just generate
a diff for them.

This logic of not running a diff under -G if we don't have both sides
dates back to the original implementation of -S in
52e9578985 ([PATCH] Introducing software archaeologist's tool
"pickaxe"., 2005-05-21).

In the case of -S we were not working with the xdiff interface and
needed to do this, but when -G was implemented in f506b8e8b5 (git
log/diff: add -G<regexp> that greps in the patch text, 2010-08-23)
this logic was diligently copied over.

But as the performance test added earlier in this series shows, this
does not make much of a difference. With:


With the HEAD~ commit being the preceding "pickaxe -G: terminate early
on matching lines" we get these results. Note that it's only the -G
codepaths that are relevant to this change:

    Test                                                                      origin/next       HEAD~                   HEAD
    4209.1: git log -S'int main' <limit-rev>..                                0.35(0.32+0.03)   0.35(0.33+0.02) +0.0%   0.35(0.30+0.05) +0.0%
    4209.2: git log -S'æ' <limit-rev>..                                       0.46(0.42+0.04)   0.46(0.41+0.05) +0.0%   0.46(0.42+0.04) +0.0%
    4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>..         0.65(0.62+0.02)   0.64(0.61+0.02) -1.5%   0.64(0.60+0.04) -1.5%
    4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..          0.52(0.45+0.06)   0.52(0.50+0.01) +0.0%   0.54(0.47+0.04) +3.8%
    4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..       0.39(0.34+0.05)   0.39(0.34+0.04) +0.0%   0.39(0.36+0.03) +0.0%
    4209.6: git log -G'(int|void|null)' <limit-rev>..                         0.60(0.55+0.04)   0.58(0.54+0.03) -3.3%   0.58(0.49+0.08) -3.3%
    4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>..                          0.61(0.52+0.06)   0.59(0.53+0.05) -3.3%   0.59(0.54+0.05) -3.3%
    4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                       0.61(0.51+0.07)   0.58(0.54+0.04) -4.9%   0.57(0.51+0.06) -6.6%
    4209.9: git log -i -S'int main' <limit-rev>..                             0.36(0.31+0.04)   0.36(0.34+0.02) +0.0%   0.35(0.32+0.03) -2.8%
    4209.10: git log -i -S'æ' <limit-rev>..                                   0.36(0.33+0.03)   0.39(0.34+0.01) +8.3%   0.36(0.32+0.03) +0.0%
    4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>..     0.83(0.77+0.05)   0.82(0.77+0.05) -1.2%   0.80(0.75+0.04) -3.6%
    4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..      0.67(0.61+0.03)   0.64(0.61+0.03) -4.5%   0.63(0.61+0.02) -6.0%
    4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..   0.40(0.37+0.02)   0.40(0.37+0.03) +0.0%   0.40(0.36+0.04) +0.0%
    4209.14: git log -i -G'(int|void|null)' <limit-rev>..                     0.58(0.51+0.07)   0.59(0.52+0.06) +1.7%   0.58(0.52+0.05) +0.0%
    4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>..                      0.60(0.54+0.05)   0.60(0.54+0.06) +0.0%   0.60(0.56+0.03) +0.0%
    4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                   0.58(0.51+0.06)   0.57(0.52+0.05) -1.7%   0.60(0.48+0.09) +3.4%

This small simplification really doesn't buy us much now, but I've got
plans to both convert the pickaxe code to using a PCREv2 backend[1]
and to implement additional pickaxe modes to do custom searches
through the diff[2]. Always having the diff available under -G is
going to help to simplify both of those changes.


Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason fa59e7beb2 pickaxe -G: terminate early on matching lines
Solve a long-standing item for "git log -Grx" of us e.g. finding "+
str" in the diff context and noting that we had a "hit", but xdiff
diligently continuing to generate and spew the rest of the diff at
us. This makes use of a new "early return" xdiff interface added by
preceding commits.

The TODO item (or, the NEEDSWORK comment) has been there since "git
log -G" was implemented. See f506b8e8b5 (git log/diff: add -G<regexp>
that greps in the patch text, 2010-08-23).

But now with the support added in the preceding changes to the
xdiff-interface we can return early. Let's assert the behavior of that
new early-return xdiff-interface by having a BUG() call here to die if
it ever starts handing us needless work again.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason a8d5eb6dc0 xdiff-interface: prepare for allowing early return
Change the function prototype of xdiff_emit_line_fn to return an "int"
instead of "void". Change all of those functions to "return 0",
nothing checks those return values yet, and no behavior is being

In subsequent commits the interface will be changed to allow early
return via this new return value.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason 5b0672a26e pickaxe -S: slightly optimize contains()
When the "log -S<pat>" switch counts occurrences of <pat> on the
pre-image and post-image of a change. As soon as we know we had e.g. 1
before and 2 now we can stop, we don't need to keep counting past 2.

With this change a diff between A and B may have different performance
characteristics than between B and A. That's OK in this case, since
we'll emit the same output, and the effect is to make one of them

I'm picking a check of "one" first on the assumption that it's a more
common case to have files grow over time than not.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason 5d35a9531c pickaxe: rename variables in has_changes() for brevity
Rename the {one,two}_contains variables to c{1,2}. This will make a
follow-up change easier to read.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason 52e011cd2b pickaxe -S: support content with NULs under --pickaxe-regex
Fix a bug in the matching routine powering -S<rx> --pickaxe-regex so
that we won't abort early on content that has NULs in it.

We've had a hard requirement on REG_STARTEND since 2f8952250a (regex:
add regexec_buf() that can work on a non NUL-terminated string,
2016-09-21), but this sanity check dates back to d01d8c6782 (Support
for pickaxe matching regular expressions, 2006-03-29).

It wasn't needed anymore, and as the now-passing test shows, actively
getting in our way. Since we always require REG_STARTEND support we do
not need to stop at NULs. If we are dealing with a haystack with NUL
in it. The needle may be behind that NUL.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason 2e197a7592 pickaxe: assert that we must have a needle under -G or -S
Assert early in diffcore_pickaxe() that we've got a needle to work
with under -G and -S.

This code is redundant to the check -G and -S get from
parse-options.c's get_arg(), which I'm adding a test for.

This check dates back to e1b161161d (diffcore-pickaxe: fix infinite
loop on zero-length needle, 2007-01-25) when "git log -S" could send
this code into an infinite loop.

It was then later refactored in 8fa4b09fb1 (pickaxe: hoist empty
needle check, 2012-10-28) into its current form, but it seemingly
wasn't noticed that in the meantime a move to the parse-options.c API
in dea007fb4c (diff: parse separate options like -S foo, 2010-08-05)
had made it redundant.

Let's retain some of the paranoia here with a BUG(), but there's no
need to be checking this in the pickaxe_match() inner loop.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason 03c1f14acf pickaxe: refactor function selection in diffcore-pickaxe()
It's hard to read this codepath at a glance and reason about exactly
what combination of -G and -S will compile either regexes or kwset,
and whether we'll then dispatch to "diff_grep" or "has_changes".

Then in the "--find-object" case we aren't using the callback
function, but were previously passing down "has_changes".

Refactor this code to exhaustively check "opts", it's now more obvious
what callback function (or none) we want under what mode.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Ævar Arnfjörð Bjarmason 102fdd2e07 pickaxe/style: consolidate declarations and assignments
Refactor contains() to do its assignments at the same time that it
does its declarations.

This code could have been refactored in ef90ab66e8 (pickaxe: use
textconv for -S counting, 2012-10-28) when a function call between the
declarations and assignments was removed.

Signed-off-by: Ævar Arnfjörð Bjarmason <>
Signed-off-by: Junio C Hamano <>
2021-05-11 12:47:31 +09:00
Junio C Hamano ecdc7cbbac Merge branch 'tb/log-G-binary'
"git log -G<regex>" looked for a hunk in the "git log -p" patch
output that contained a string that matches the given pattern.
Optimize this code to ignore binary files, which by default will
not show any hunk that would match any pattern (unless textconv or
the --text option is in effect, that is).

* tb/log-G-binary:
  log -G: ignore binary files
2019-01-14 15:29:31 -08:00
Junio C Hamano cde555480b Merge branch 'nd/the-index'
More codepaths become aware of working with in-core repository
instance other than the default "the_repository".

* nd/the-index: (22 commits)
  rebase-interactive.c: remove the_repository references
  rerere.c: remove the_repository references
  pack-*.c: remove the_repository references
  pack-check.c: remove the_repository references
  notes-cache.c: remove the_repository references
  line-log.c: remove the_repository reference
  diff-lib.c: remove the_repository references
  delta-islands.c: remove the_repository references
  cache-tree.c: remove the_repository references
  bundle.c: remove the_repository references
  branch.c: remove the_repository reference
  bisect.c: remove the_repository reference
  blame.c: remove implicit dependency the_repository
  sequencer.c: remove implicit dependency on the_repository
  sequencer.c: remove implicit dependency on the_index
  transport.c: remove implicit dependency on the_index
  notes-merge.c: remove implicit dependency the_repository
  notes-merge.c: remove implicit dependency on the_index
  list-objects.c: reduce the_repository references
  list-objects-filter.c: remove implicit dependency on the_index
2019-01-04 13:33:33 -08:00
Thomas Braun e0e7cb8080 log -G: ignore binary files
The -G<regex> option of log looks for the differences whose patch text
contains added/removed lines that match regex.

Currently -G looks also into patches of binary files (which
according to [1]) is binary as well.

This has a couple of issues:

- It makes the pickaxe search slow. In a proprietary repository of the
  author with only ~5500 commits and a total .git size of ~300MB
  searching takes ~13 seconds

    $time git log -Gwave > /dev/null

    real    0m13,241s
    user    0m12,596s
    sys     0m0,644s

  whereas when we ignore binary files with this patch it takes ~4s

    $time ~/devel/git/git log -Gwave > /dev/null

    real    0m3,713s
    user    0m3,608s
    sys     0m0,105s

  which is a speedup of more than fourfold.

- The internally used algorithm for generating patch text is based on
  xdiff and its states in [1]

  > The output format of the binary patch file is proprietary
  > (and binary) and it is basically a collection of copy and insert
  > commands [..]

  which means that the current format could change once the internal
  algorithm is changed as the format is not standardized. In addition
  the git binary patch format used for preparing patches for git apply
  is *different* from the xdiff format as can be seen by comparing

  git log -p -a

    commit 6e95bf4bafccf14650d02ab57f3affe669be10cf
    Author: A U Thor <>
    Date:   Thu Apr 7 15:14:13 2005 -0700

        modify binary file

    diff --git a/data.bin b/data.bin
    index f414c84..edfeb6f 100644
    --- a/data.bin
    +++ b/data.bin
    @@ -1,2 +1,4 @@

  with git log --binary

    commit 6e95bf4bafccf14650d02ab57f3affe669be10cf
    Author: A U Thor <>
    Date:   Thu Apr 7 15:14:13 2005 -0700

        modify binary file

    diff --git a/data.bin b/data.bin
    index f414c84bd3aa25fa07836bb1fb73db784635e24b..edfeb6f501[..]
    GIT binary patch
    literal 12

    literal 6

  which seems unexpected.

To resolve these issues this patch makes -G<regex> ignore binary files
by default. Textconv filters are supported and also -a/--text for
getting the old and broken behaviour back.

The -S<block of text> option of log looks for differences that changes
the number of occurrences of the specified block of text (i.e.
addition/deletion) in a file. As we want to keep the current behaviour,
add a test to ensure it stays that way.


Signed-off-by: Thomas Braun <>
Signed-off-by: Junio C Hamano <>
2018-12-26 14:59:37 -08:00
Junio C Hamano 39d23dfa40 Merge branch 'jk/xdiff-interface'
The interface into "xdiff" library used to discover the offset and
size of a generated patch hunk by first formatting it into the
textual hunk header "@@ -n,m +k,l @@" and then parsing the numbers
out.  A new interface has been introduced to allow callers a more
direct access to them.

* jk/xdiff-interface:
  xdiff-interface: drop parse_hunk_header()
  range-diff: use a hunk callback
  diff: convert --check to use a hunk callback
  combine-diff: use an xdiff hunk callback
  diff: use hunk callback for word-diff
  diff: discard hunk headers for patch-ids earlier
  diff: avoid generating unused hunk header lines
  xdiff-interface: provide a separate consume callback for hunks
  xdiff: provide a separate emit callback for hunks
2018-11-13 22:37:27 +09:00
Nguyễn Thái Ngọc Duy bd7ad45b64 notes-cache.c: remove the_repository references
Signed-off-by: Nguyễn Thái Ngọc Duy <>
Signed-off-by: Junio C Hamano <>
2018-11-12 14:50:06 +09:00
Jeff King 3b40a090fd diff: avoid generating unused hunk header lines
Some callers of xdi_diff_outf() do not look at the generated hunk header
lines at all. By plugging in a no-op hunk callback, this tells xdiff not
to even bother formatting them.

This patch introduces a stock no-op callback and uses it with a few
callers whose line callbacks explicitly ignore hunk headers (because
they look only for +/- lines).

Signed-off-by: Jeff King <>
Signed-off-by: Junio C Hamano <>
2018-11-05 13:14:35 +09:00
Jeff King 9346d6d14d xdiff-interface: provide a separate consume callback for hunks
The previous commit taught xdiff to optionally provide the hunk header
data to a specialized callback. But most users of xdiff actually use our
more convenient xdi_diff_outf() helper, which ensures that our callbacks
are always fed whole lines.

Let's plumb the special hunk-callback through this interface, too. It
will follow the same rule as xdiff when the hunk callback is NULL (i.e.,
continue to pass a stringified hunk header to the line callback). Since
we add NULL to each caller, there should be no behavior change yet.

Signed-off-by: Jeff King <>
Signed-off-by: Junio C Hamano <>
2018-11-02 20:43:02 +09:00
Nguyễn Thái Ngọc Duy acd00ea049 userdiff.c: remove implicit dependency on the_index
[jc: squashed in missing forward decl in userdiff.h found by Ramsay]

Helped-by: Ramsay Jones <>
Signed-off-by: Nguyễn Thái Ngọc Duy <>
Signed-off-by: Junio C Hamano <>
2018-09-21 09:50:58 -07:00
Nguyễn Thái Ngọc Duy 6afaf80785 diff.c: remove the_index dependency in textconv() functions
Signed-off-by: Nguyễn Thái Ngọc Duy <>
Signed-off-by: Junio C Hamano <>
2018-09-21 09:48:10 -07:00
Martin Ågren 17154b1576 regex: do not call `regfree()` if compilation fails
It is apparently undefined behavior to call `regfree()` on a regex where
`regcomp()` failed. The language in [1] is a bit muddy, at least to me,
but the clearest hint is this (`preg` is the `regex_t *`):

    Upon successful completion, the regcomp() function shall return 0.
    Otherwise, it shall return an integer value indicating an error as
    described in <regex.h>, and the content of preg is undefined.

Funnily enough, there is also the `regerror()` function which should be
given a pointer to such a "failed" `regex_t` -- the content of which
would supposedly be undefined -- and which may investigate it to come up
with a detailed error message.

In any case, the example in that document shows how `regfree()` is not
called after `regcomp()` fails.

We have quite a few users of this API and most get this right. These
three users do not.

Several implementations can handle this just fine [2] and these code paths
supposedly have not wreaked havoc or we'd have heard about it. (These
are all in code paths where git got bad input and is just about to die
anyway.) But let's just avoid the issue altogether.



Researched-by: Eric Sunshine <>
Signed-off-byi Martin Ågren <>

Signed-off-by: Junio C Hamano <>
2018-05-21 13:58:32 +09:00
Stefan Beller 5e505257f2 diff: properly error out when combining multiple pickaxe options
In f506b8e8b5 (git log/diff: add -G<regexp> that greps in the patch text,
2010-08-23) we were hesitant to check if the user requests both -S and
-G at the same time. Now that the pickaxe family also offers --find-object,
which looks slightly more different than the former two, let's add a check
that those are not used at the same time.

Signed-off-by: Stefan Beller <>
Signed-off-by: Junio C Hamano <>
2018-01-04 15:02:40 -08:00
Stefan Beller 15af58c1ad diffcore: add a pickaxe option to find a specific blob
Sometimes users are given a hash of an object and they want to
identify it further (ex.: Use verify-pack to find the largest blobs,
but what are these? or [1])

One might be tempted to extend git-describe to also work with blobs,
such that `git describe <blob-id>` gives a description as
'<commit-ish>:<path>'.  This was implemented at [2]; as seen by the sheer
number of responses (>110), it turns out this is tricky to get right.
The hard part to get right is picking the correct 'commit-ish' as that
could be the commit that (re-)introduced the blob or the blob that
removed the blob; the blob could exist in different branches.

Junio hinted at a different approach of solving this problem, which this
patch implements. Teach the diff machinery another flag for restricting
the information to what is shown. For example:

    $ ./git log --oneline --find-object=v2.0.0:Makefile
    b2feb64309 Revert the whole "ask curl-config" topic for now
    47fbfded53 i18n: only extract comments marked with "TRANSLATORS:"

we observe that the Makefile as shipped with 2.0 was appeared in
v1.9.2-471-g47fbfded53 and in v2.0.0-rc1-5-gb2feb6430b.  The
reason why these commits both occur prior to v2.0.0 are evil
merges that are not found using this new mechanism.


Signed-off-by: Stefan Beller <>
Signed-off-by: Junio C Hamano <>
2018-01-04 15:02:40 -08:00
Stefan Beller c1ddc4610c diff: migrate diff_flags.pickaxe_ignore_case to a pickaxe_opts bit
Currently flags for pickaxing are found in different places. Unify the
flags into the `pickaxe_opts` field, which will contain any pickaxe related

Signed-off-by: Stefan Beller <>
Signed-off-by: Junio C Hamano <>
2018-01-04 15:02:40 -08:00
Brandon Williams 0d1e0e7801 diff: make struct diff_flags members lowercase
Now that the flags stored in struct diff_flags are being accessed
directly and not through macros, change all struct members from being
uppercase to lowercase.
This conversion is done using the following semantic patch:

	expression E;
	+ E.recursive

	expression E;
	+ E.tree_in_recursive

	expression E;
	+ E.binary

	expression E;
	+ E.text

	expression E;
	+ E.full_index

	expression E;
	+ E.silent_on_remove

	expression E;
	+ E.find_copies_harder

	expression E;
	+ E.follow_renames

	expression E;
	+ E.rename_empty

	expression E;
	+ E.has_changes

	expression E;
	+ E.quick

	expression E;
	+ E.no_index

	expression E;
	+ E.allow_external

	expression E;
	+ E.exit_with_status

	expression E;
	+ E.reverse_diff

	expression E;
	+ E.check_failed

	expression E;
	+ E.relative_name

	expression E;
	+ E.ignore_submodules

	expression E;
	+ E.dirstat_cumulative

	expression E;
	+ E.dirstat_by_file

	expression E;
	+ E.allow_textconv

	expression E;
	+ E.textconv_set_via_cmdline

	expression E;
	+ E.diff_from_contents

	expression E;
	+ E.dirty_submodules

	expression E;
	+ E.ignore_untracked_in_submodules

	expression E;
	+ E.ignore_dirty_submodules

	expression E;
	+ E.override_submodule_config

	expression E;
	+ E.dirstat_by_line

	expression E;
	+ E.funccontext

	expression E;
	+ E.pickaxe_ignore_case

	expression E;
	+ E.default_follow_renames

Signed-off-by: Brandon Williams <>
Signed-off-by: Junio C Hamano <>
2017-11-01 11:51:40 +09:00
Brandon Williams 3b69daed86 diff: remove DIFF_OPT_TST macro
Remove the `DIFF_OPT_TST` macro and instead access the flags directly.
This conversion is done using the following semantic patch:

	expression E;
	identifier fld;
	- DIFF_OPT_TST(&E, fld)
	+ E.flags.fld

	type T;
	T *ptr;
	identifier fld;
	- DIFF_OPT_TST(ptr, fld)
	+ ptr->flags.fld

Signed-off-by: Brandon Williams <>
Signed-off-by: Junio C Hamano <>
2017-11-01 11:50:03 +09:00
Junio C Hamano 0efeb5ca12 Merge branch 'js/regexec-buf'
Fix for potential segv introduced in v2.11.0 and later (also

* js/regexec-buf:
  pickaxe: fix segfault with '-S<...> --pickaxe-regex'
2017-03-24 13:07:35 -07:00
SZEDER Gábor f53c5de29c pickaxe: fix segfault with '-S<...> --pickaxe-regex'
'git {log,diff,...} -S<...> --pickaxe-regex' can segfault as a result
of out-of-bounds memory reads.

diffcore-pickaxe.c:contains() looks for all matches of the given regex
in a buffer in a loop, advancing the buffer pointer to the end of the
last match in each iteration.  When we switched to REG_STARTEND in
b7d36ffca (regex: use regexec_buf(), 2016-09-21), we started passing
the size of that buffer to the regexp engine, too.  Unfortunately,
this buffer size is never updated on subsequent iterations, and as the
buffer pointer advances on each iteration, this "bufptr+bufsize"
points past the end of the buffer.  This results in segmentation
fault, if that memory can't be accessed.  In case of 'git log' it can
also result in erroneously listed commits, if the memory past the end
of buffer is accessible and happens to contain data matching the

Reduce the buffer size on each iteration as the buffer pointer is
advanced, thus maintaining the correct end of buffer location.
Furthermore, make sure that the buffer pointer is not dereferenced in
the control flow statements when we already reached the end of the

The new test is flaky, I've never seen it fail on my Linux box even
without the fix, but this is expected according to db5dfa3 (regex:
-G<pattern> feeds a non NUL-terminated string to regexec() and fails,
2016-09-21).  However, it did fail on Travis CI with the first (and
incomplete) version of the fix, and based on that commit message I
would expect the new test without the fix to fail most of the time on

Signed-off-by: SZEDER Gábor <>
Signed-off-by: Junio C Hamano <>
2017-03-18 12:22:33 -07:00
Junio C Hamano 6a67695268 Merge branch 'js/regexec-buf'
Some codepaths in "git diff" used regexec(3) on a buffer that was
mmap(2)ed, which may not have a terminating NUL, leading to a read
beyond the end of the mapped region.  This was fixed by introducing
a regexec_buf() helper that takes a <ptr,len> pair with REG_STARTEND

* js/regexec-buf:
  regex: use regexec_buf()
  regex: add regexec_buf() that can work on a non NUL-terminated string
  regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails
2016-09-26 16:09:19 -07:00
Johannes Schindelin b7d36ffca0 regex: use regexec_buf()
The new regexec_buf() function operates on buffers with an explicitly
specified length, rather than NUL-terminated strings.

We need to use this function whenever the buffer we want to pass to
regexec(3) may have been mmap(2)ed (and is hence not NUL-terminated).

Note: the original motivation for this patch was to fix a bug where
`git diff -G <regex>` would crash. This patch converts more callers,
though, some of which allocated to construct NUL-terminated strings,
or worse, modified buffers to temporarily insert NULs while calling
regexec(3).  By converting them to use regexec_buf(), the code has
become much cleaner.

Signed-off-by: Johannes Schindelin <>
Signed-off-by: Junio C Hamano <>
2016-09-21 13:56:15 -07:00
Nguyễn Thái Ngọc Duy b51a9c1479 diffcore-pickaxe: support case insensitive match on non-ascii
Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, so we quote the string and pass it to regcomp as
a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because might cause problems elsewhere, probably.

Noticed-by: Plamen Totev <>
Signed-off-by: Nguyễn Thái Ngọc Duy <>
Signed-off-by: Junio C Hamano <>
2016-07-01 12:44:57 -07:00
Nguyễn Thái Ngọc Duy 3d5b23a362 diffcore-pickaxe: Add regcomp_or_die()
There's another regcomp code block coming in this function that needs
the same error handling. This function can help avoid duplicating
error handling code.

Helped-by: Jeff King <>
Signed-off-by: Nguyễn Thái Ngọc Duy <>
Signed-off-by: Junio C Hamano <>
2016-07-01 12:44:57 -07:00
Jeff King 3efb988098 react to errors in xdi_diff
When we call into xdiff to perform a diff, we generally lose
the return code completely. Typically by ignoring the return
of our xdi_diff wrapper, but sometimes we even propagate
that return value up and then ignore it later.  This can
lead to us silently producing incorrect diffs (e.g., "git
log" might produce no output at all, not even a diff header,
for a content-level diff).

In practice this does not happen very often, because the
typical reason for xdiff to report failure is that it
malloc() failed (it uses straight malloc, and not our
xmalloc wrapper).  But it could also happen when xdiff
triggers one our callbacks, which returns an error (e.g.,
outf() in builtin/rerere.c tries to report a write failure
in this way). And the next patch also plans to add more
failure modes.

Let's notice an error return from xdiff and react
appropriately. In most of the diff.c code, we can simply
die(), which matches the surrounding code (e.g., that is
what we do if we fail to load a file for diffing in the
first place). This is not that elegant, but we are probably
better off dying to let the user know there was a problem,
rather than simply generating bogus output.

We could also just die() directly in xdi_diff, but the
callers typically have a bit more context, and can provide a
better message (and if we do later decide to pass errors up,
we're one step closer to doing so).

There is one interesting case, which is in diff_grep(). Here
if we cannot generate the diff, there is nothing to match,
and we silently return "no hits". This is actually what the
existing code does already, but we make it a little more

Signed-off-by: Jeff King <>
Signed-off-by: Junio C Hamano <>
2015-09-28 14:57:10 -07:00
René Scharfe e4aab50475 pickaxe: simplify kwset loop in contains()
Inlining the variable "found" actually makes the code shorter and
easier to read.

Signed-off-by: Rene Scharfe <>
Signed-off-by: Junio C Hamano <>
2014-03-24 15:13:17 -07:00
René Scharfe 542b2aa2c9 pickaxe: call strlen only when necessary in diffcore_pickaxe_count()
We need to determine the search term's length only when fixed-string
matching is used; regular expression compilation takes a NUL-terminated
string directly.  Only call strlen() in the former case.

Signed-off-by: Rene Scharfe <>
Signed-off-by: Junio C Hamano <>
2014-03-24 15:13:17 -07:00
René Scharfe 3753bd1f69 pickaxe: move pickaxe() after pickaxe_match()
pickaxe() calls pickaxe_match(); moving the definition of the former
after the latter allows us to do without an explicit function

Signed-off-by: Rene Scharfe <>
Signed-off-by: Junio C Hamano <>
2014-03-24 15:13:10 -07:00
René Scharfe 63b52afaa8 pickaxe: merge diffcore_pickaxe_grep() and diffcore_pickaxe_count() into diffcore_pickaxe()
diffcore_pickaxe_count() initializes the regular expression or kwset for
the search term, calls pickaxe() with the callback has_changes() and
cleans up afterwards.  diffcore_pickaxe_grep() does the same, only it
doesn't support kwset and uses the callback diff_grep() instead.  Merge
the two functions to form the new diffcore_pickaxe() and thus get rid of
the duplicate regex setup and cleanup code.

Signed-off-by: Rene Scharfe <>
Signed-off-by: Junio C Hamano <>
2014-03-24 15:12:45 -07:00
René Scharfe 218c45a45c pickaxe: honor -i when used with -S and --pickaxe-regex
accccde4 (pickaxe: allow -i to search in patch case-insensitively)
allowed case-insenitive matching for -G and -S, but for the latter
only if fixed string matching is used.  Allow it for -S and regular
expression matching as well to make the support complete.

Signed-off-by: Rene Scharfe <>
Signed-off-by: Junio C Hamano <>
2014-03-24 15:12:45 -07:00
Junio C Hamano d5a3897f94 Merge branch 'rs/pickaxe-simplify'
* rs/pickaxe-simplify:
  diffcore-pickaxe: simplify has_changes and contains
2013-07-12 12:04:17 -07:00
René Scharfe 3bdb5b9f1f diffcore-pickaxe: simplify has_changes and contains
Halve the number of callsites of contains() to two using temporary
variables, simplifying the code.  While at it, get rid of the
diff_options parameter, which became unused with 8fa4b09f.

Signed-off-by: René Scharfe <>
Acked-by: Jeff King <>
Signed-off-by: Junio C Hamano <>
2013-07-07 10:24:11 -07:00
Ramkumar Ramachandra 276b22d333 diffcore-pickaxe: make error messages more consistent
Currently, diffcore-pickaxe reports two distinct errors for the same
user error:

    $ git log --pickaxe-regex -S'\1'
    fatal: invalid pickaxe regex: Invalid back reference

    $ git log -G'\1'
    fatal: invalid log-grep regex: Invalid back reference

This "log-grep" was only an internal name for the -G feature during
development, and invite confusion with "git log --grep=<pattern>".

Change the error messages to say "invalid regex".

Signed-off-by: Ramkumar Ramachandra <>
Signed-off-by: Junio C Hamano <>
2013-06-03 10:50:22 -07:00
Jeff King 61690bf4a1 diffcore-pickaxe: unify code for log -S/-G
The logic flow of has_changes() used for "log -S" and diff_grep()
used for "log -G" are essentially the same.  See if we have both
sides that could be different in any interesting way, slurp the
contents in core, possibly after applying textconv, inspect the
contents, clean-up and report the result.  The only difference
between the two is how "inspect" step works.

Unify this codeflow in a helper, pickaxe_match(), which takes a
callback function that implements the specific "inspect" step.

After removing the common scaffolding code from the existing
has_changes() and diff_grep(), they each becomes such a callback
function suitable for passing to pickaxe_match().

Signed-off-by: Jeff King <>
Signed-off-by: Junio C Hamano <>
2013-04-05 10:31:09 -07:00
Junio C Hamano 88ff684dd5 diffcore-pickaxe: fix leaks in "log -S<block>" and "log -G<pattern>"
The diff_grep() and has_changes() functions had early return
codepaths for unmerged filepairs, which simply returned 0.  When we
taught textconv filter to them, one was ignored and continued to
return early without freeing the result filtered by textconv, and
the other had a failed attempt to fix, which allowed the planned
return value 0 to be overwritten by a bogus call to contains().

Signed-off-by: Junio C Hamano <>
2013-04-05 10:31:09 -07:00
Junio C Hamano ebb7226258 diffcore-pickaxe: port optimization from has_changes() to diff_grep()
These two functions are called in the same codeflow to implement
"log -S<block>" and "log -G<pattern>", respectively, but the latter
lacked two obvious optimizations the former implemented, namely:

 - When a pickaxe limit is not given at all, they should return
   without wasting any cycle;

 - When both sides of the filepair are the same, and the same
   textconv conversion apply to them, return early, as there will be
   no interesting differences between the two anyway.

Also release the filespec data once the processing is done (this is
not about leaking memory--it is about releasing data we finished
looking at as early as possible).

Signed-off-by: Junio C Hamano <>
2013-04-05 10:31:09 -07:00
Simon Ruderich a8f6109428 diffcore-pickaxe: respect --no-textconv
git log -S doesn't respect --no-textconv:

    $ echo '*.txt diff=wrong' > .gitattributes
    $ git -c diff.wrong.textconv='xxx' log --no-textconv -Sfoo
    error: cannot run xxx: No such file or directory
    fatal: unable to read files to diff

Reported-by: Matthieu Moy <>
Signed-off-by: Simon Ruderich <>
Signed-off-by: Junio C Hamano <>
2013-04-05 10:30:44 -07:00
Jeff King 7cdb9b42c3 diffcore-pickaxe: remove fill_one()
fill_one is _almost_ identical to just calling fill_textconv; the
exception is that for the !DIFF_FILE_VALID case, fill_textconv gives us
an empty buffer rather than a NULL one. Since we currently use the NULL
pointer as a signal that the file is not present on one side of the
diff, we must now switch to using DIFF_FILE_VALID to make the same

Signed-off-by: Jeff King <>
Signed-off-by: Simon Ruderich <>
Signed-off-by: Junio C Hamano <>
2013-04-04 20:33:19 -07:00
Simon Ruderich bc6158981b diffcore-pickaxe: remove unnecessary call to get_textconv()
The fill_one() function is responsible for finding and filling the
textconv filter as necessary, and is called by diff_grep() function
that implements "git log -G<pattern>".

The has_changes() function that implements "git log -S<block>" calls
get_textconv() for two sides being compared, before it checks to see
if it was asked to perform the pickaxe limiting.  Move the code
around to avoid this wastage.

After has_changes() calls get_textconv() to obtain textconv for both
sides, fill_one() is called to use them.

By adding get_textconv() to diff_grep() and relieving fill_one() of
responsibility to find the textconv filter, we can avoid calling
get_textconv() twice in has_changes().

With this change it's also no longer necessary for fill_one() to
modify the textconv argument, therefore pass a pointer instead of a
pointer to a pointer.

Signed-off-by: Simon Ruderich <>
Signed-off-by: Junio C Hamano <>
2013-04-04 20:33:19 -07:00
Jeff King ef90ab66e8 pickaxe: use textconv for -S counting
We currently just look at raw blob data when using "-S" to
pickaxe. This is mostly historical, as pickaxe predates the
textconv feature. If the user has bothered to define a
textconv filter, it is more likely that their search string will be
on the textconv output, as that is what they will see in the
diff (and we do not even provide a mechanism for them to
search for binary needles that contain NUL characters).

This patch teaches "-S" to use textconv, just as we
already do for "-G".

Signed-off-by: Jeff King <>
2012-10-28 08:48:17 -04:00