Git Source Code Mirror - This is a publish-only repository and all pull requests are ignored. Please follow Documentation/SubmittingPatches procedure for any of your improvements. https://git-scm.com/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
git/diff.h

700 lines
22 KiB

/*
* Copyright (C) 2005 Junio C Hamano
*/
#ifndef DIFF_H
#define DIFF_H
#include "tree-walk.h"
#include "pathspec.h"
#include "object.h"
diffcore: add a pickaxe option to find a specific blob Sometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-pack to find the largest blobs, but what are these? or [1]) One might be tempted to extend git-describe to also work with blobs, such that `git describe <blob-id>` gives a description as '<commit-ish>:<path>'. This was implemented at [2]; as seen by the sheer number of responses (>110), it turns out this is tricky to get right. The hard part to get right is picking the correct 'commit-ish' as that could be the commit that (re-)introduced the blob or the blob that removed the blob; the blob could exist in different branches. Junio hinted at a different approach of solving this problem, which this patch implements. Teach the diff machinery another flag for restricting the information to what is shown. For example: $ ./git log --oneline --find-object=v2.0.0:Makefile b2feb64309 Revert the whole "ask curl-config" topic for now 47fbfded53 i18n: only extract comments marked with "TRANSLATORS:" we observe that the Makefile as shipped with 2.0 was appeared in v1.9.2-471-g47fbfded53 and in v2.0.0-rc1-5-gb2feb6430b. The reason why these commits both occur prior to v2.0.0 are evil merges that are not found using this new mechanism. [1] https://stackoverflow.com/questions/223678/which-commit-has-this-blob [2] https://public-inbox.org/git/20171028004419.10139-1-sbeller@google.com/ Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
#include "oidset.h"
/**
* The diff API is for programs that compare two sets of files (e.g. two trees,
* one tree and the index) and present the found difference in various ways.
* The calling program is responsible for feeding the API pairs of files, one
* from the "old" set and the corresponding one from "new" set, that are
* different.
* The library called through this API is called diffcore, and is responsible
* for two things.
*
* - finding total rewrites (`-B`), renames (`-M`) and copies (`-C`), and
* changes that touch a string (`-S`), as specified by the caller.
*
* - outputting the differences in various formats, as specified by the caller.
*
* Calling sequence
* ----------------
*
* - Prepare `struct diff_options` to record the set of diff options, and then
* call `repo_diff_setup()` to initialize this structure. This sets up the
* vanilla default.
*
* - Fill in the options structure to specify desired output format, rename
* detection, etc. `diff_opt_parse()` can be used to parse options given
* from the command line in a way consistent with existing git-diff family
* of programs.
*
* - Call `diff_setup_done()`; this inspects the options set up so far for
* internal consistency and make necessary tweaking to it (e.g. if textual
* patch output was asked, recursive behaviour is turned on); the callback
* set_default in diff_options can be used to tweak this more.
*
* - As you find different pairs of files, call `diff_change()` to feed
* modified files, `diff_addremove()` to feed created or deleted files, or
* `diff_unmerge()` to feed a file whose state is 'unmerged' to the API.
* These are thin wrappers to a lower-level `diff_queue()` function that is
* flexible enough to record any of these kinds of changes.
*
* - Once you finish feeding the pairs of files, call `diffcore_std()`.
* This will tell the diffcore library to go ahead and do its work.
*
diff: add an API for deferred freeing Add a diff_free() function to free anything we may have allocated in the "diff_options" struct, and the ability to make calling it a noop by setting "no_free" in "diff_options". This is required because when e.g. "git diff" is run we'll allocate things in that struct, use the diff machinery once, and then exit. But if we run e.g. "git log -p" we're going to re-use what we allocated across multiple diff_flush() calls, and only want to free things at the end. We've thus ended up with features like the recently added "diff -I"[1] where we'll leak memory. As it turns out it could have simply used the pattern established in 6ea57703f6 (log: prepare log/log-tree to reuse the diffopt.close_file attribute, 2016-06-22). Manually adding more such flags to things log_tree_commit() every time we need to allocate something would be tedious. Let's instead move that fclose() code it to a new diff_free(), in anticipation of freeing more things in that function in follow-up commits. Some functions such as log_tree_commit() need an idiom of optionally retaining a previous "no_free", as they may either free the memory themselves, or their caller may do so. I'm keeping that idiom in log_show_early() for good measure, even though I don't think it's currently called in this manner. It also gets passed an existing "struct rev_info", so future callers may want to set the "no_free" flag. This change is a bit hard to read because while the freeing pattern we're introducing isn't unusual, the "file" member is a special snowflake. We usually don't want to fclose() it. This is because "file" is usually stdout, in which case we don't want to fclose() it. We only want to opt-in to closing it when we e.g. open a file on the filesystem. Thus the opt-in "close_file" flag. So the API in general just needs a "no_free" flag to defer freeing, but the "file" member still needs its "close_file" flag. This is made more confusing because while refactoring this code we could replace some "close_file=0" with "no_free=1", whereas others need to set both flags. This is because there were some cases where an existing "close_file=0" meant "let's defer deallocation", and others where it meant "we don't want to close this file handle at all". 1. 296d4a94e7 (diff: add -I<regex> that ignores matching changes, 2020-10-20) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2 years ago
* - Calling `diff_flush()` will produce the output, it will call
* `diff_free()` to free any resources, e.g. those allocated in
* `diff_opt_parse()`.
*
* - Set `.no_free = 1` before calling `diff_flush()` to defer the
* freeing of allocated memory in diff_options. This is useful when
* `diff_flush()` is being called in a loop, rather than as a
* one-off. When setting `.no_free = 1` you must ensure that
* `diff_free()` is called at the end, either by flipping the flag
* before the last `diff_flush()` call, or by flipping it before
* calling `diff_free()` yourself.
*/
struct combine_diff_path;
struct commit;
struct diff_filespec;
struct diff_options;
struct diff_queue_struct;
struct oid_array;
struct option;
struct repository;
struct rev_info;
struct strbuf;
struct userdiff_driver;
tree-diff: rework diff_tree() to generate diffs for multiparent cases as well Previously diff_tree(), which is now named ll_diff_tree_sha1(), was generating diff_filepair(s) for two trees t1 and t2, and that was usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes a commit introduces. In Git, however, we have fundamentally built flexibility in that a commit can have many parents - 1 for a plain commit, 2 for a simple merge, but also more than 2 for merging several heads at once. For merges there is a so called combine-diff, which shows diff, a merge introduces by itself, omitting changes done by any parent. That works through first finding paths, that are different to all parents, and then showing generalized diff, with separate columns for +/- for each parent. The code lives in combine-diff.c . There is an impedance mismatch, however, in that a commit could generally have any number of parents, and that while diffing trees, we divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there is no special casing for multiple parents commits in e.g. revision-walker . That impedance mismatch *hurts* *performance* *badly* for generating combined diffs - in "combine-diff: optimize combine_diff_path sets intersection" I've already removed some slowness from it, but from the timings provided there, it could be seen, that combined diffs still cost more than an order of magnitude more cpu time, compared to diff for usual commits, and that would only be an optimistic estimate, if we take into account that for e.g. linux.git there is only one merge for several dozens of plain commits. That slowness comes from the fact that currently, while generating combined diff, a lot of time is spent computing diff(commit,commit^2) just to only then intersect that huge diff to almost small set of files from diff(commit,commit^1). That's because at present, to compute combine-diff, for first finding paths, that "every parent touches", we use the following combine-diff property/definition: D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn) (w.r.t. paths) where D(A,P1...Pn) is combined diff between commit A, and parents Pi and D(A,Pi) is usual two-tree diff Pi..A So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n 1-parent diffs and intersecting results will be slow. And usually, for linux.git and other topic-based workflows, that D(A,P2) is huge, because, if merge-base of A and P2, is several dozens of merges (from A, via first parent) below, that D(A,P2) will be diffing sum of merges from several subsystems to 1 subsystem. The solution is to avoid computing n 1-parent diffs, and to find changed-to-all-parents paths via scanning A's and all Pi's trees simultaneously, at each step comparing their entries, and based on that comparison, populate paths result, and deduce we could *skip* *recursing* into subdirectories, if at least for 1 parent, sha1 of that dir tree is the same as in A. That would save us from doing significant amount of needless work. Such approach is very similar to what diff_tree() does, only there we deal with scanning only 2 trees simultaneously, and for n+1 tree, the logic is a bit more complex: D(T,P1...Pn) calculation scheme ------------------------------- D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn) (regarding resulting paths set) D(T,Pj) - diff between T..Pj D(T,P1...Pn) - combined diff from T to parents P1,...,Pn We start from all trees, which are sorted, and compare their entries in lock-step: T P1 Pn - - - |t| |p1| |pn| |-| |--| ... |--| imin = argmin(p1...pn) | | | | | | |-| |--| |--| |.| |. | |. | . . . . . . at any time there could be 3 cases: 1) t < p[imin]; 2) t > p[imin]; 3) t = p[imin]. Schematic deduction of what every case means, and what to do, follows: 1) t < p[imin] -> ∀j t ∉ Pj -> "+t" ∈ D(T,Pj) -> D += "+t"; t↓ 2) t > p[imin] 2.1) ∃j: pj > p[imin] -> "-p[imin]" ∉ D(T,Pj) -> D += ø; ∀ pi=p[imin] pi↓ 2.2) ∀i pi = p[imin] -> pi ∉ T -> "-pi" ∈ D(T,Pi) -> D += "-p[imin]"; ∀i pi↓ 3) t = p[imin] 3.1) ∃j: pj > p[imin] -> "+t" ∈ D(T,Pj) -> only pi=p[imin] remains to investigate 3.2) pi = p[imin] -> investigate δ(t,pi) | | v 3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø -> ⎧δ(t,pi) - if pi=p[imin] -> D += ⎨ ⎩"+t" - if pi>p[imin] in any case t↓ ∀ pi=p[imin] pi↓ ~ For comparison, here is how diff_tree() works: D(A,B) calculation scheme ------------------------- A B - - |a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓ |-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓ | | | | a = b -> investigate δ(a,b) a↓ b↓ |-| |-| |.| |.| . . . . ~~~~~~~~ This patch generalizes diff tree-walker to work with arbitrary number of parents as described above - i.e. now there is a resulting tree t, and some parents trees tp[i] i=[0..nparent). The generalization builds on the fact that usual diff D(A,B) is by definition the same as combined diff D(A,[B]), so if we could rework the code for common case and make it be not slower for nparent=1 case, usual diff(t1,t2) generation will not be slower, and multiparent diff tree-walker would greatly benefit generating combine-diff. What we do is as follows: 1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be a paths generator (new name diff_tree_paths()), with each generated path being `struct combine_diff_path` with info for path, new sha1,mode and for every parent which sha1,mode it was in it. 2) From that info, we can still generate usual diff queue with struct diff_filepairs, via "exporting" generated combine_diff_path, if we know we run for nparent=1 case. (see emit_diff() which is now named emit_diff_first_parent_only()) 3) In order for diff_can_quit_early(), which checks DIFF_OPT_TST(opt, HAS_CHANGES)) to work, that exporting have to be happening not in bulk, but incrementally, one diff path at a time. For such consumers, there is a new callback in diff_options introduced: ->pathchange(opt, struct combine_diff_path *) which, if set to !NULL, is called for every generated path. (see new compat ll_diff_tree_sha1() wrapper around new paths generator for setup) 4) The paths generation itself, is reworked from previous ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation scheme" provided above: On the start we allocate [nparent] arrays in place what was earlier just for one parent tree. then we just generalize loops, and comparison according to the algorithm. Some notes(*): 1) alloca(), for small arrays, is used for "runs not slower for nparent=1 case than before" goal - if we change it to xmalloc()/free() the timings get ~1% worse. For alloca() we use just-introduced xalloca/xalloca_free compatibility wrappers, so it should not be a portability problem. 2) For every parent tree, we need to keep a tag, whether entry from that parent equals to entry from minimal parent. For performance reasons I'm keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ. Not doing so, we'd need to alloca another [nparent] array, which hurts performance. 3) For emitted paths, memory could be reused, if we know the path was processed via callback and will not be needed later. We use efficient hand-made realloc-style path_appendnew(), that saves us from ~1-1.5% of potential additional slowdown. 4) goto(s) are used in several places, as the code executes a little bit faster with lowered register pressure. Also - we should now check for FIND_COPIES_HARDER not only when two entries names are the same, and their hashes are equal, but also for a case, when a path was removed from some of all parents having it. The reason is, if we don't, that path won't be emitted at all (see "a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants all paths - with diff or without - to be emitted, to be later analyzed for being copies sources. The new check is only necessary for nparent >1, as for nparent=1 case xmin_eqtotal always =1 =nparent, and a path is always added to diff as removal. ~~~~~~~~ Timings for # without -c, i.e. testing only nparent=1 case `git log --raw --no-abbrev --no-renames` before and after the patch are as follows: navy.git linux.git v3.10..v3.11 before 0.611s 1.889s after 0.619s 1.907s slowdown 1.3% 0.9% This timings show we did no harm to usual diff(tree1,tree2) generation. From the table we can see that we actually did ~1% slowdown, but I think I've "earned" that 1% in the previous patch ("tree-diff: reuse base str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case, net timings stays approximately the same. The output also stayed the same. (*) If we revert 1)-4) to more usual techniques, for nparent=1 case, we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as "do no harm for nparent=1 case" rule. For linux.git, combined diff will run an order of magnitude faster and appropriate timings will be provided in the next commit, as we'll be taking advantage of the new diff tree-walker for combined-diff generation there. P.S. and combined diff is not some exotic/for-play-only stuff - for example for a program I write to represent Git archives as readonly filesystem, there is initial scan with `git log --reverse --raw --no-abbrev --no-renames -c` to extract log of what was created/changed when, as a result building a map {} sha1 -> in which commit (and date) a content was added that `-c` means also show combined diff for merges, and without them, if a merge is non-trivial (merges changes from two parents with both having separate changes to a file), or an evil one, the map will not be full, i.e. some valid sha1 would be absent from it. That case was my initial motivation for combined diffs speedup. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>
9 years ago
typedef int (*pathchange_fn_t)(struct diff_options *options,
struct combine_diff_path *path);
typedef void (*change_fn_t)(struct diff_options *options,
unsigned old_mode, unsigned new_mode,
const struct object_id *old_oid,
const struct object_id *new_oid,
int old_oid_valid, int new_oid_valid,
const char *fullpath,
unsigned old_dirty_submodule, unsigned new_dirty_submodule);
typedef void (*add_remove_fn_t)(struct diff_options *options,
int addremove, unsigned mode,
const struct object_id *oid,
int oid_valid,
const char *fullpath, unsigned dirty_submodule);
typedef void (*diff_format_fn_t)(struct diff_queue_struct *q,
struct diff_options *options, void *data);
typedef struct strbuf *(*diff_prefix_fn_t)(struct diff_options *opt, void *data);
#define DIFF_FORMAT_RAW 0x0001
#define DIFF_FORMAT_DIFFSTAT 0x0002
#define DIFF_FORMAT_NUMSTAT 0x0004
#define DIFF_FORMAT_SUMMARY 0x0008
#define DIFF_FORMAT_PATCH 0x0010
#define DIFF_FORMAT_SHORTSTAT 0x0020
Add "--dirstat" for some directory statistics This adds a new form of overview diffstat output, doing something that I have occasionally ended up doing manually (and badly, because it's actually pretty nasty to do), and that I think is very useful for an project like the kernel that has a fairly deep and well-separated directory structure with semantic meaning. What I mean by that is that it's often interesting to see exactly which sub-directories are impacted by a patch, and to what degree - even if you don't perhaps care so much about the individual files themselves. What makes the concept more interesting is that the "impact" is often hierarchical: in the kernel, for example, something could either have a very localized impact to "fs/ext3/" and then it's interesting to see that such a patch changes mostly that subdirectory, but you could have another patch that changes some generic VFS-layer issue which affects _many_ subdirectories that are all under "fs/", but none - or perhaps just a couple of them - of the individual filesystems are interesting in themselves. So what commonly happens is that you may have big changes in a specific sub-subdirectory, but still also significant separate changes to the subdirectory leading up to that - maybe you have significant VFS-level changes, but *also* changes under that VFS layer in the NFS-specific directories, for example. In that case, you do want the low-level parts that are significant to show up, but then the insignificant ones should show up as under the more generic top-level directory. This patch shows all of that with "--dirstat". The output can be either something simple like commit 81772fe... Author: Thomas Gleixner <tglx@linutronix.de> Date: Sun Feb 10 23:57:36 2008 +0100 x86: remove over noisy debug printk pageattr-test.c contains a noisy debug printk that people reported. The condition under which it prints (randomly tapping into a mem_map[] hole and not being able to c_p_a() there) is valid behavior and not interesting to report. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> 100.0% arch/x86/mm/ or something much more complex like commit e231c2e... Author: David Howells <dhowells@redhat.com> Date: Thu Feb 7 00:15:26 2008 -0800 Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p) 20.5% crypto/ 7.6% fs/afs/ 7.6% fs/fuse/ 7.6% fs/gfs2/ 5.1% fs/jffs2/ 5.1% fs/nfs/ 5.1% fs/nfsd/ 7.6% fs/reiserfs/ 15.3% fs/ 7.6% net/rxrpc/ 10.2% security/keys/ where that latter example is an example of significant work in some individual fs/*/ subdirectories (like the patches to reiserfs accounting for 7.6% of the whole), but then discounting those individual filesystems, there's also 15.3% other "random" things that weren't worth reporting on their oen left over under fs/ in general (either in that directory itself, or in subdirectories of fs/ that didn't have enough changes to be reported individually). I'd like to stress that the "15.3% fs/" mentioned above is the stuff that is under fs/ but that was _not_ significant enough to report on its own. So the above does _not_ mean that 15.3% of the work was under fs/ per se, because that 15.3% does *not* include the already-reported 7.6% of afs, 7.6% of fuse etc. If you want to enable "cumulative" directory statistics, you can use the "--cumulative" flag, which adds up percentages recursively even when they have been already reported for a sub-directory. That cumulative output is disabled if *all* of the changes in one subdirectory come from a deeper subdirectory, to avoid repeating subdirectories all the way to the root. For an example of the cumulative reporting, the above commit becomes commit e231c2e... Author: David Howells <dhowells@redhat.com> Date: Thu Feb 7 00:15:26 2008 -0800 Convert ERR_PTR(PTR_ERR(p)) instances to ERR_CAST(p) 20.5% crypto/ 7.6% fs/afs/ 7.6% fs/fuse/ 7.6% fs/gfs2/ 5.1% fs/jffs2/ 5.1% fs/nfs/ 5.1% fs/nfsd/ 7.6% fs/reiserfs/ 61.5% fs/ 7.6% net/rxrpc/ 10.2% security/keys/ in which the commit percentages now obviously add up to much more than 100%: now the changes that were already reported for the sub-directories under fs/ are then cumulatively included in the whole percentage of fs/ (ie now shows 61.5% as opposed to the 15.3% without the cumulative reporting). The default reporting limit has been arbitrarily set at 3%, which seems to be a pretty good cut-off, but you can specify the cut-off manually by giving it as an option parameter (eg "--dirstat=5" makes the cut-off be at 5% instead) NOTE! The percentages are purely about the total lines added and removed, not anything smarter (or dumber) than that. Also note that you should not generally expect things to add up to 100%: not only does it round down, we don't report leftover scraps (they add up to the top-level change count, but we don't even bother reporting that, it only reports subdirectories). Quite frankly, as a top-level manager this is really convenient for me, but it's going to be very boring for git itself since there are few subdirectories. Also, don't expect things to make tons of sense if you combine this with "-M" and there are cross-directory renames etc. But even for git itself, you can get some fun statistics. Try out git log --dirstat and see the occasional mentions of things like Documentation/, git-gui/, gitweb/ and gitk-git/. Or try out something like git diff --dirstat v1.5.0..v1.5.4 which does kind of git an overview that shows *something*. But in general, the output is more exciting for big projects with deeper structure, and doing a git diff --dirstat v2.6.24..v2.6.25-rc1 on the kernel is what I actually wrote this for! Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
15 years ago
#define DIFF_FORMAT_DIRSTAT 0x0040
/* These override all above */
#define DIFF_FORMAT_NAME 0x0100
#define DIFF_FORMAT_NAME_STATUS 0x0200
#define DIFF_FORMAT_CHECKDIFF 0x0400
/* Same as output_format = 0 but we know that -s flag was given
* and we should not give default value to output_format.
*/
#define DIFF_FORMAT_NO_OUTPUT 0x0800
#define DIFF_FORMAT_CALLBACK 0x1000
#define DIFF_FLAGS_INIT { 0 }
struct diff_flags {
/**
* Tells if tree traversal done by tree-diff should recursively descend
* into a tree object pair that are different in preimage and postimage set.
*/
unsigned recursive;
unsigned tree_in_recursive;
/* Affects the way how a file that is seemingly binary is treated. */
unsigned binary;
unsigned text;
/**
* Tells the patch output format not to use abbreviated object names on the
* "index" lines.
*/
unsigned full_index;
/* Affects if diff-files shows removed files. */
unsigned silent_on_remove;
/**
* Tells the diffcore library that the caller is feeding unchanged
* filepairs to allow copies from unmodified files be detected.
*/
unsigned find_copies_harder;
unsigned follow_renames;
unsigned rename_empty;
/* Internal; used for optimization to see if there is any change. */
unsigned has_changes;
unsigned quick;
/**
* Tells diff-files that the input is not tracked files but files in random
* locations on the filesystem.
*/
unsigned no_index;
/**
* Tells output routine that it is Ok to call user specified patch output
* routine. Plumbing disables this to ensure stable output.
*/
unsigned allow_external;
/**
* For communication between the calling program and the options parser;
* tell the calling program to signal the presence of difference using
* program exit code.
*/
unsigned exit_with_status;
/**
* Tells the library that the calling program is feeding the filepairs
* reversed; `one` is two, and `two` is one.
*/
unsigned reverse_diff;
unsigned check_failed;
unsigned relative_name;
unsigned ignore_submodules;
unsigned dirstat_cumulative;
unsigned dirstat_by_file;
unsigned allow_textconv;
unsigned textconv_set_via_cmdline;
unsigned diff_from_contents;
unsigned dirty_submodules;
unsigned ignore_untracked_in_submodules;
unsigned ignore_submodule_set;
unsigned ignore_dirty_submodules;
unsigned override_submodule_config;
unsigned dirstat_by_line;
unsigned funccontext;
unsigned default_follow_renames;
unsigned stat_with_summary;
unsigned suppress_diff_headers;
unsigned dual_color_diffed_diffs;
unsigned suppress_hunk_header_line_count;
};
static inline void diff_flags_or(struct diff_flags *a,
const struct diff_flags *b)
{
char *tmp_a = (char *)a;
const char *tmp_b = (const char *)b;
int i;
for (i = 0; i < sizeof(struct diff_flags); i++)
tmp_a[i] |= tmp_b[i];
}
#define DIFF_XDL_TST(opts, flag) ((opts)->xdl_opts & XDF_##flag)
#define DIFF_XDL_SET(opts, flag) ((opts)->xdl_opts |= XDF_##flag)
#define DIFF_XDL_CLR(opts, flag) ((opts)->xdl_opts &= ~XDF_##flag)
#define DIFF_WITH_ALG(opts, flag) (((opts)->xdl_opts & ~XDF_DIFF_ALGORITHM_MASK) | XDF_##flag)
enum diff_words_type {
DIFF_WORDS_NONE = 0,
DIFF_WORDS_PORCELAIN,
DIFF_WORDS_PLAIN,
DIFF_WORDS_COLOR
};
enum diff_submodule_format {
DIFF_SUBMODULE_SHORT = 0,
DIFF_SUBMODULE_LOG,
DIFF_SUBMODULE_INLINE_DIFF
};
/**
* the set of options the calling program wants to affect the operation of
* diffcore library with.
*/
struct diff_options {
const char *orderfile;
/*
* "--rotate-to=<file>" would start showing at <file> and when
* the output reaches the end, wrap around by default.
* Setting skip_instead_of_rotate to true stops the output at the
* end, effectively discarding the earlier part of the output
* before <file>'s diff (this is used to implement the
* "--skip-to=<file>" option).
*
* When rotate_to_strict is set, it is an error if there is no
* <file> in the diff. Otherwise, the output starts at the
* path that is the same as, or first path that sorts after,
* <file>. Because it is unreasonable to require the exact
* match for "git log -p --rotate-to=<file>" (i.e. not all
* commit would touch that single <file>), "git log" sets it
* to false. "git diff" sets it to true to detect an error
* in the command line option.
*/
const char *rotate_to;
int skip_instead_of_rotate;
int rotate_to_strict;
/**
* A constant string (can and typically does contain newlines to look for
* a block of text, not just a single line) to filter out the filepairs
* that do not change the number of strings contained in its preimage and
* postimage of the diff_queue.
*/
const char *pickaxe;
unsigned pickaxe_opts;
/* -I<regex> */
regex_t **ignore_regex;
size_t ignore_regex_nr, ignore_regex_alloc;
const char *single_follow;
const char *a_prefix, *b_prefix;
const char *line_prefix;
size_t line_prefix_length;
/**
* collection of boolean options that affects the operation, but some do
* not have anything to do with the diffcore library.
*/
struct diff_flags flags;
/* diff-filter bits */
unsigned int filter, filter_not;
int use_color;
/* Number of context lines to generate in patch output. */
int context;
int interhunkcontext;
/* Affects the way detection logic for complete rewrites, renames and
* copies.
*/
int break_opt;
int detect_rename;
int irreversible_delete;
int skip_stat_unmatch;
int line_termination;
/* The output format used when `diff_flush()` is run. */
int output_format;
/* Affects the way detection logic for complete rewrites, renames and
* copies.
*/
int rename_score;
int rename_limit;
int needed_rename_limit;
int degraded_cc_to_c;
int show_rename_progress;
int dirstat_permille;
int setup;
/* Number of hexdigits to abbreviate raw format output to. */
int abbrev;
/* If non-zero, then stop computing after this many changes. */
int max_changes;
int ita_invisible_in_index;
/* white-space error highlighting */
#define WSEH_NEW (1<<12)
#define WSEH_CONTEXT (1<<13)
#define WSEH_OLD (1<<14)
unsigned ws_error_highlight;
diff --relative: output paths as relative to the current subdirectory This adds --relative option to the diff family. When you start from a subdirectory: $ git diff --relative shows only the diff that is inside your current subdirectory, and without $prefix part. People who usually live in subdirectories may like it. There are a few things I should also mention about the change: - This works not just with diff but also works with the log family of commands, but the history pruning is not affected. In other words, if you go to a subdirectory, you can say: $ git log --relative -p but it will show the log message even for commits that do not touch the current directory. You can limit it by giving pathspec yourself: $ git log --relative -p . This originally was not a conscious design choice, but we have a way to affect diff pathspec and pruning pathspec independently. IOW "git log --full-diff -p ." tells it to prune history to commits that affect the current subdirectory but show the changes with full context. I think it makes more sense to leave pruning independent from --relative than the obvious alternative of always pruning with the current subdirectory, which would break the symmetry. - Because this works also with the log family, you could format-patch a single change, limiting the effect to your subdirectory, like so: $ cd gitk-git $ git format-patch -1 --relative 911f1eb But because that is a special purpose usage, this option will never become the default, with or without repository or user preference configuration. The risk of producing a partial patch and sending it out by mistake is too great if we did so. - This is inherently incompatible with --no-index, which is a bolted-on hack that does not have much to do with git itself. I didn't bother checking and erroring out on the combined use of the options, but probably I should. Signed-off-by: Junio C Hamano <gitster@pobox.com>
15 years ago
const char *prefix;
int prefix_length;
const char *stat_sep;
int xdl_opts;
/* see Documentation/diff-options.txt */
char **anchors;
size_t anchors_nr, anchors_alloc;
int stat_width;
int stat_name_width;
int stat_graph_width;
int stat_count;
const char *word_regex;
enum diff_words_type word_diff;
enum diff_submodule_format submodule_format;
diffcore: add a pickaxe option to find a specific blob Sometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-pack to find the largest blobs, but what are these? or [1]) One might be tempted to extend git-describe to also work with blobs, such that `git describe <blob-id>` gives a description as '<commit-ish>:<path>'. This was implemented at [2]; as seen by the sheer number of responses (>110), it turns out this is tricky to get right. The hard part to get right is picking the correct 'commit-ish' as that could be the commit that (re-)introduced the blob or the blob that removed the blob; the blob could exist in different branches. Junio hinted at a different approach of solving this problem, which this patch implements. Teach the diff machinery another flag for restricting the information to what is shown. For example: $ ./git log --oneline --find-object=v2.0.0:Makefile b2feb64309 Revert the whole "ask curl-config" topic for now 47fbfded53 i18n: only extract comments marked with "TRANSLATORS:" we observe that the Makefile as shipped with 2.0 was appeared in v1.9.2-471-g47fbfded53 and in v2.0.0-rc1-5-gb2feb6430b. The reason why these commits both occur prior to v2.0.0 are evil merges that are not found using this new mechanism. [1] https://stackoverflow.com/questions/223678/which-commit-has-this-blob [2] https://public-inbox.org/git/20171028004419.10139-1-sbeller@google.com/ Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
struct oidset *objfind;
/* this is set by diffcore for DIFF_FORMAT_PATCH */
int found_changes;
/* to support internal diff recursion by --follow hack*/
int found_follow;
/* Callback which allows tweaking the options in diff_setup_done(). */
void (*set_default)(struct diff_options *);
FILE *file;
int close_file;
#define OUTPUT_INDICATOR_NEW 0
#define OUTPUT_INDICATOR_OLD 1
#define OUTPUT_INDICATOR_CONTEXT 2
char output_indicators[3];
struct pathspec pathspec;
tree-diff: rework diff_tree() to generate diffs for multiparent cases as well Previously diff_tree(), which is now named ll_diff_tree_sha1(), was generating diff_filepair(s) for two trees t1 and t2, and that was usually used for a commit as t1=HEAD~, and t2=HEAD - i.e. to see changes a commit introduces. In Git, however, we have fundamentally built flexibility in that a commit can have many parents - 1 for a plain commit, 2 for a simple merge, but also more than 2 for merging several heads at once. For merges there is a so called combine-diff, which shows diff, a merge introduces by itself, omitting changes done by any parent. That works through first finding paths, that are different to all parents, and then showing generalized diff, with separate columns for +/- for each parent. The code lives in combine-diff.c . There is an impedance mismatch, however, in that a commit could generally have any number of parents, and that while diffing trees, we divide cases for 2-tree diffs and more-than-2-tree diffs. I mean there is no special casing for multiple parents commits in e.g. revision-walker . That impedance mismatch *hurts* *performance* *badly* for generating combined diffs - in "combine-diff: optimize combine_diff_path sets intersection" I've already removed some slowness from it, but from the timings provided there, it could be seen, that combined diffs still cost more than an order of magnitude more cpu time, compared to diff for usual commits, and that would only be an optimistic estimate, if we take into account that for e.g. linux.git there is only one merge for several dozens of plain commits. That slowness comes from the fact that currently, while generating combined diff, a lot of time is spent computing diff(commit,commit^2) just to only then intersect that huge diff to almost small set of files from diff(commit,commit^1). That's because at present, to compute combine-diff, for first finding paths, that "every parent touches", we use the following combine-diff property/definition: D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn) (w.r.t. paths) where D(A,P1...Pn) is combined diff between commit A, and parents Pi and D(A,Pi) is usual two-tree diff Pi..A So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n 1-parent diffs and intersecting results will be slow. And usually, for linux.git and other topic-based workflows, that D(A,P2) is huge, because, if merge-base of A and P2, is several dozens of merges (from A, via first parent) below, that D(A,P2) will be diffing sum of merges from several subsystems to 1 subsystem. The solution is to avoid computing n 1-parent diffs, and to find changed-to-all-parents paths via scanning A's and all Pi's trees simultaneously, at each step comparing their entries, and based on that comparison, populate paths result, and deduce we could *skip* *recursing* into subdirectories, if at least for 1 parent, sha1 of that dir tree is the same as in A. That would save us from doing significant amount of needless work. Such approach is very similar to what diff_tree() does, only there we deal with scanning only 2 trees simultaneously, and for n+1 tree, the logic is a bit more complex: D(T,P1...Pn) calculation scheme ------------------------------- D(T,P1...Pn) = D(T,P1) ^ ... ^ D(T,Pn) (regarding resulting paths set) D(T,Pj) - diff between T..Pj D(T,P1...Pn) - combined diff from T to parents P1,...,Pn We start from all trees, which are sorted, and compare their entries in lock-step: T P1 Pn - - - |t| |p1| |pn| |-| |--| ... |--| imin = argmin(p1...pn) | | | | | | |-| |--| |--| |.| |. | |. | . . . . . . at any time there could be 3 cases: 1) t < p[imin]; 2) t > p[imin]; 3) t = p[imin]. Schematic deduction of what every case means, and what to do, follows: 1) t < p[imin] -> ∀j t ∉ Pj -> "+t" ∈ D(T,Pj) -> D += "+t"; t↓ 2) t > p[imin] 2.1) ∃j: pj > p[imin] -> "-p[imin]" ∉ D(T,Pj) -> D += ø; ∀ pi=p[imin] pi↓ 2.2) ∀i pi = p[imin] -> pi ∉ T -> "-pi" ∈ D(T,Pi) -> D += "-p[imin]"; ∀i pi↓ 3) t = p[imin] 3.1) ∃j: pj > p[imin] -> "+t" ∈ D(T,Pj) -> only pi=p[imin] remains to investigate 3.2) pi = p[imin] -> investigate δ(t,pi) | | v 3.1+3.2) looking at δ(t,pi) ∀i: pi=p[imin] - if all != ø -> ⎧δ(t,pi) - if pi=p[imin] -> D += ⎨ ⎩"+t" - if pi>p[imin] in any case t↓ ∀ pi=p[imin] pi↓ ~ For comparison, here is how diff_tree() works: D(A,B) calculation scheme ------------------------- A B - - |a| |b| a < b -> a ∉ B -> D(A,B) += +a a↓ |-| |-| a > b -> b ∉ A -> D(A,B) += -b b↓ | | | | a = b -> investigate δ(a,b) a↓ b↓ |-| |-| |.| |.| . . . . ~~~~~~~~ This patch generalizes diff tree-walker to work with arbitrary number of parents as described above - i.e. now there is a resulting tree t, and some parents trees tp[i] i=[0..nparent). The generalization builds on the fact that usual diff D(A,B) is by definition the same as combined diff D(A,[B]), so if we could rework the code for common case and make it be not slower for nparent=1 case, usual diff(t1,t2) generation will not be slower, and multiparent diff tree-walker would greatly benefit generating combine-diff. What we do is as follows: 1) diff tree-walker ll_diff_tree_sha1() is internally reworked to be a paths generator (new name diff_tree_paths()), with each generated path being `struct combine_diff_path` with info for path, new sha1,mode and for every parent which sha1,mode it was in it. 2) From that info, we can still generate usual diff queue with struct diff_filepairs, via "exporting" generated combine_diff_path, if we know we run for nparent=1 case. (see emit_diff() which is now named emit_diff_first_parent_only()) 3) In order for diff_can_quit_early(), which checks DIFF_OPT_TST(opt, HAS_CHANGES)) to work, that exporting have to be happening not in bulk, but incrementally, one diff path at a time. For such consumers, there is a new callback in diff_options introduced: ->pathchange(opt, struct combine_diff_path *) which, if set to !NULL, is called for every generated path. (see new compat ll_diff_tree_sha1() wrapper around new paths generator for setup) 4) The paths generation itself, is reworked from previous ll_diff_tree_sha1() code according to "D(A,P1...Pn) calculation scheme" provided above: On the start we allocate [nparent] arrays in place what was earlier just for one parent tree. then we just generalize loops, and comparison according to the algorithm. Some notes(*): 1) alloca(), for small arrays, is used for "runs not slower for nparent=1 case than before" goal - if we change it to xmalloc()/free() the timings get ~1% worse. For alloca() we use just-introduced xalloca/xalloca_free compatibility wrappers, so it should not be a portability problem. 2) For every parent tree, we need to keep a tag, whether entry from that parent equals to entry from minimal parent. For performance reasons I'm keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ. Not doing so, we'd need to alloca another [nparent] array, which hurts performance. 3) For emitted paths, memory could be reused, if we know the path was processed via callback and will not be needed later. We use efficient hand-made realloc-style path_appendnew(), that saves us from ~1-1.5% of potential additional slowdown. 4) goto(s) are used in several places, as the code executes a little bit faster with lowered register pressure. Also - we should now check for FIND_COPIES_HARDER not only when two entries names are the same, and their hashes are equal, but also for a case, when a path was removed from some of all parents having it. The reason is, if we don't, that path won't be emitted at all (see "a > xi" case), and we'll just skip it, and FIND_COPIES_HARDER wants all paths - with diff or without - to be emitted, to be later analyzed for being copies sources. The new check is only necessary for nparent >1, as for nparent=1 case xmin_eqtotal always =1 =nparent, and a path is always added to diff as removal. ~~~~~~~~ Timings for # without -c, i.e. testing only nparent=1 case `git log --raw --no-abbrev --no-renames` before and after the patch are as follows: navy.git linux.git v3.10..v3.11 before 0.611s 1.889s after 0.619s 1.907s slowdown 1.3% 0.9% This timings show we did no harm to usual diff(tree1,tree2) generation. From the table we can see that we actually did ~1% slowdown, but I think I've "earned" that 1% in the previous patch ("tree-diff: reuse base str(buf) memory on sub-tree recursion", HEAD~~) so for nparent=1 case, net timings stays approximately the same. The output also stayed the same. (*) If we revert 1)-4) to more usual techniques, for nparent=1 case, we'll get ~2-2.5% of additional slowdown, which I've tried to avoid, as "do no harm for nparent=1 case" rule. For linux.git, combined diff will run an order of magnitude faster and appropriate timings will be provided in the next commit, as we'll be taking advantage of the new diff tree-walker for combined-diff generation there. P.S. and combined diff is not some exotic/for-play-only stuff - for example for a program I write to represent Git archives as readonly filesystem, there is initial scan with `git log --reverse --raw --no-abbrev --no-renames -c` to extract log of what was created/changed when, as a result building a map {} sha1 -> in which commit (and date) a content was added that `-c` means also show combined diff for merges, and without them, if a merge is non-trivial (merges changes from two parents with both having separate changes to a file), or an evil one, the map will not be full, i.e. some valid sha1 would be absent from it. That case was my initial motivation for combined diffs speedup. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>
9 years ago
pathchange_fn_t pathchange;
change_fn_t change;
add_remove_fn_t add_remove;
revision: quit pruning diff more quickly when possible When the revision traversal machinery is given a pathspec, we must compute the parent-diff for each commit to determine which ones are TREESAME. We set the QUICK diff flag to avoid looking at more entries than we need; we really just care whether there are any changes at all. But there is one case where we want to know a bit more: if --remove-empty is set, we care about finding cases where the change consists only of added entries (in which case we may prune the parent in try_to_simplify_commit()). To cover that case, our file_add_remove() callback does not quit the diff upon seeing an added entry; it keeps looking for other types of entries. But this means when --remove-empty is not set (and it is not by default), we compute more of the diff than is necessary. You can see this in a pathological case where a commit adds a very large number of entries, and we limit based on a broad pathspec. E.g.: perl -e ' chomp(my $blob = `git hash-object -w --stdin </dev/null`); for my $a (1..1000) { for my $b (1..1000) { print "100644 $blob\t$a/$b\n"; } } ' | git update-index --index-info git commit -qm add git rev-list HEAD -- . This case takes about 100ms now, but after this patch only needs 6ms. That's not a huge improvement, but it's easy to get and it protects us against even more pathological cases (e.g., going from 1 million to 10 million files would take ten times as long with the current code, but not increase at all after this patch). This is reported to minorly speed-up pathspec limiting in real world repositories (like the 100-million-file Windows repository), but probably won't make a noticeable difference outside of pathological setups. This patch actually covers the case without --remove-empty, and the case where we see only deletions. See the in-code comment for details. Note that we have to add a new member to the diff_options struct so that our callback can see the value of revs->remove_empty_trees. This callback parameter could be passed to the "add_remove" and "change" callbacks, but there's not much point. They already receive the diff_options struct, and doing it this way avoids having to update the function signature of the other callbacks (arguably the format_callback and output_prefix functions could benefit from the same simplification). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
void *change_fn_data;
diff_format_fn_t format_callback;
void *format_callback_data;
diff_prefix_fn_t output_prefix;
void *output_prefix_data;
int diff_path_counter;
diff.c: buffer all output if asked to Introduce a new option 'emitted_symbols' in the struct diff_options which controls whether all output is buffered up until all output is available. It is set internally in diff.c when necessary. We'll have a new struct 'emitted_string' in diff.c which will be used to buffer each line. The emitted_string will duplicate the memory of the line to buffer as that is easiest to reason about for now. In a future patch we may want to decrease the memory usage by not duplicating all output for buffering but rather we may want to store offsets into the file or in case of hunk descriptions such as the similarity score, we could just store the relevant number and reproduce the text later on. This approach was chosen as a first step because it is quite simple compared to the alternative with less memory footprint. emit_diff_symbol factors out the emission part and depending on the diff_options->emitted_symbols the emission will be performed directly when calling emit_diff_symbol or after the whole process is done, i.e. by buffering we have add the possibility for a second pass over the whole output before doing the actual output. In 6440d34 (2012-03-14, diff: tweak a _copy_ of diff_options with word-diff) we introduced a duplicate diff options struct for word emissions as we may have different regex settings in there. When buffering the output, we need to operate on just one buffer, so we have to copy back the emissions of the word buffer into the main buffer. Unconditionally enable output via buffer in this patch as it yields a great opportunity for testing, i.e. all the diff tests from the test suite pass without having reordering issues (i.e. only parts of the output got buffered, and we forgot to buffer other parts). The test suite passes, which gives confidence that we converted all functions to use emit_string for output. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
struct emitted_diff_symbols *emitted_symbols;
diff.c: color moved lines differently When a patch consists mostly of moving blocks of code around, it can be quite tedious to ensure that the blocks are moved verbatim, and not undesirably modified in the move. To that end, color blocks that are moved within the same patch differently. For example (OM, del, add, and NM are different colors): [OM] -void sensitive_stuff(void) [OM] -{ [OM] - if (!is_authorized_user()) [OM] - die("unauthorized"); [OM] - sensitive_stuff(spanning, [OM] - multiple, [OM] - lines); [OM] -} void another_function() { [del] - printf("foo"); [add] + printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NM] + if (!is_authorized_user()) [NM] + die("unauthorized"); [NM] + sensitive_stuff(spanning, [NM] + multiple, [NM] + lines); [NM] +} However adjacent blocks may be problematic. For example, in this potentially malicious patch, the swapping of blocks can be spotted: [OM] -void sensitive_stuff(void) [OM] -{ [OMA] - if (!is_authorized_user()) [OMA] - die("unauthorized"); [OM] - sensitive_stuff(spanning, [OM] - multiple, [OM] - lines); [OMA] -} void another_function() { [del] - printf("foo"); [add] + printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NMA] + sensitive_stuff(spanning, [NMA] + multiple, [NMA] + lines); [NM] + if (!is_authorized_user()) [NM] + die("unauthorized"); [NMA] +} If the moved code is larger, it is easier to hide some permutation in the code, which is why some alternative coloring is needed. This patch implements the first mode: * basic alternating 'Zebra' mode This conveys all information needed to the user. Defer customization to later patches. First I implemented an alternative design, which would try to fingerprint a line by its neighbors to detect if we are in a block or at the boundary. This idea iss error prone as it inspected each line and its neighboring lines to determine if the line was (a) moved and (b) if was deep inside a hunk by having matching neighboring lines. This is unreliable as the we can construct hunks which have equal neighbors that just exceed the number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter as a line, that is permutated to AXYZCXYZBXYZD..'). Instead this provides a dynamic programming greedy algorithm that finds the largest moved hunk and then has several modes on highlighting bounds. A note on the options '--submodule=diff' and '--color-words/--word-diff': In the conversion to use emit_line in the prior patches both submodules as well as word diff output carefully chose to call emit_line with sign=0. All output with sign=0 is ignored for move detection purposes in this patch, such that no weird looking output will be generated for these cases. This leads to another thought: We could pass on '--color-moved' to submodules such that they color up moved lines for themselves. If we'd do so only line moves within a repository boundary are marked up. It is useful to have moved lines colored, but there are annoying corner cases, such as a single line moved, that is very common. For example in a typical patch of C code, we have closing braces that end statement blocks or functions. While it is technically true that these lines are moved as they show up elsewhere, it is harmful for the review as the reviewers attention is drawn to such a minor side annoyance. For now let's have a simple solution of hardcoding the number of moved lines to be at least 3 before coloring them. Note, that the length is applied across all blocks to find the 'lonely' blocks that pollute new code, but do not interfere with a permutated block where each permutation has less lines than 3. Helped-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
enum {
COLOR_MOVED_NO = 0,
COLOR_MOVED_PLAIN = 1,
COLOR_MOVED_BLOCKS = 2,
COLOR_MOVED_ZEBRA = 3,
COLOR_MOVED_ZEBRA_DIM = 4,
diff.c: color moved lines differently When a patch consists mostly of moving blocks of code around, it can be quite tedious to ensure that the blocks are moved verbatim, and not undesirably modified in the move. To that end, color blocks that are moved within the same patch differently. For example (OM, del, add, and NM are different colors): [OM] -void sensitive_stuff(void) [OM] -{ [OM] - if (!is_authorized_user()) [OM] - die("unauthorized"); [OM] - sensitive_stuff(spanning, [OM] - multiple, [OM] - lines); [OM] -} void another_function() { [del] - printf("foo"); [add] + printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NM] + if (!is_authorized_user()) [NM] + die("unauthorized"); [NM] + sensitive_stuff(spanning, [NM] + multiple, [NM] + lines); [NM] +} However adjacent blocks may be problematic. For example, in this potentially malicious patch, the swapping of blocks can be spotted: [OM] -void sensitive_stuff(void) [OM] -{ [OMA] - if (!is_authorized_user()) [OMA] - die("unauthorized"); [OM] - sensitive_stuff(spanning, [OM] - multiple, [OM] - lines); [OMA] -} void another_function() { [del] - printf("foo"); [add] + printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NMA] + sensitive_stuff(spanning, [NMA] + multiple, [NMA] + lines); [NM] + if (!is_authorized_user()) [NM] + die("unauthorized"); [NMA] +} If the moved code is larger, it is easier to hide some permutation in the code, which is why some alternative coloring is needed. This patch implements the first mode: * basic alternating 'Zebra' mode This conveys all information needed to the user. Defer customization to later patches. First I implemented an alternative design, which would try to fingerprint a line by its neighbors to detect if we are in a block or at the boundary. This idea iss error prone as it inspected each line and its neighboring lines to determine if the line was (a) moved and (b) if was deep inside a hunk by having matching neighboring lines. This is unreliable as the we can construct hunks which have equal neighbors that just exceed the number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter as a line, that is permutated to AXYZCXYZBXYZD..'). Instead this provides a dynamic programming greedy algorithm that finds the largest moved hunk and then has several modes on highlighting bounds. A note on the options '--submodule=diff' and '--color-words/--word-diff': In the conversion to use emit_line in the prior patches both submodules as well as word diff output carefully chose to call emit_line with sign=0. All output with sign=0 is ignored for move detection purposes in this patch, such that no weird looking output will be generated for these cases. This leads to another thought: We could pass on '--color-moved' to submodules such that they color up moved lines for themselves. If we'd do so only line moves within a repository boundary are marked up. It is useful to have moved lines colored, but there are annoying corner cases, such as a single line moved, that is very common. For example in a typical patch of C code, we have closing braces that end statement blocks or functions. While it is technically true that these lines are moved as they show up elsewhere, it is harmful for the review as the reviewers attention is drawn to such a minor side annoyance. For now let's have a simple solution of hardcoding the number of moved lines to be at least 3 before coloring them. Note, that the length is applied across all blocks to find the 'lonely' blocks that pollute new code, but do not interfere with a permutated block where each permutation has less lines than 3. Helped-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
} color_moved;
#define COLOR_MOVED_DEFAULT COLOR_MOVED_ZEBRA
#define COLOR_MOVED_MIN_ALNUM_COUNT 20
diff.c: add white space mode to move detection that allows indent changes The option of --color-moved has proven to be useful as observed on the mailing list. However when refactoring sometimes the indentation changes, for example when partitioning a functions into smaller helper functions the code usually mostly moved around except for a decrease in indentation. To just review the moved code ignoring the change in indentation, a mode to ignore spaces in the move detection as implemented in a previous patch would be enough. However the whole move coloring as motivated in commit 2e2d5ac (diff.c: color moved lines differently, 2017-06-30), brought up the notion of the reviewer being able to trust the move of a "block". As there are languages such as python, which depend on proper relative indentation for the control flow of the program, ignoring any white space change in a block would not uphold the promises of 2e2d5ac that allows reviewers to pay less attention to the inside of a block, as inside the reviewer wants to assume the same program flow. This new mode of white space ignorance will take this into account and will only allow the same white space changes per line in each block. This patch even allows only for the same change at the beginning of the lines. As this is a white space mode, it is made exclusive to other white space modes in the move detection. This patch brings some challenges, related to the detection of blocks. We need a wide net to catch the possible moved lines, but then need to narrow down to check if the blocks are still intact. Consider this example (ignoring block sizes): - A - B - C + A + B + C At the beginning of a block when checking if there is a counterpart for A, we have to ignore all space changes. However at the following lines we have to check if the indent change stayed the same. Checking if the indentation change did stay the same, is done by computing the indentation change by the difference in line length, and then assume the change is only in the beginning of the longer line, the common tail is the same. That is why the test contains lines like: - <TAB> A ... + A <TAB> ... As the first line starting a block is caught using a compare function that ignores white spaces unlike the rest of the block, where the white space delta is taken into account for the comparison, we also have to think about the following situation: - A - B - A - B + A + B + A + B When checking if the first A (both in the + and - lines) is a start of a block, we have to check all 'A' and record all the white space deltas such that we can find the example above to be just one block that is indented. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
/* XDF_WHITESPACE_FLAGS regarding block detection are set at 2, 3, 4 */
#define COLOR_MOVED_WS_ALLOW_INDENTATION_CHANGE (1<<5)
#define COLOR_MOVED_WS_ERROR (1<<0)
unsigned color_moved_ws_handling;
struct repository *repo;
struct option *parseopts;
struct strmap *additional_path_headers;
diff: add an API for deferred freeing Add a diff_free() function to free anything we may have allocated in the "diff_options" struct, and the ability to make calling it a noop by setting "no_free" in "diff_options". This is required because when e.g. "git diff" is run we'll allocate things in that struct, use the diff machinery once, and then exit. But if we run e.g. "git log -p" we're going to re-use what we allocated across multiple diff_flush() calls, and only want to free things at the end. We've thus ended up with features like the recently added "diff -I"[1] where we'll leak memory. As it turns out it could have simply used the pattern established in 6ea57703f6 (log: prepare log/log-tree to reuse the diffopt.close_file attribute, 2016-06-22). Manually adding more such flags to things log_tree_commit() every time we need to allocate something would be tedious. Let's instead move that fclose() code it to a new diff_free(), in anticipation of freeing more things in that function in follow-up commits. Some functions such as log_tree_commit() need an idiom of optionally retaining a previous "no_free", as they may either free the memory themselves, or their caller may do so. I'm keeping that idiom in log_show_early() for good measure, even though I don't think it's currently called in this manner. It also gets passed an existing "struct rev_info", so future callers may want to set the "no_free" flag. This change is a bit hard to read because while the freeing pattern we're introducing isn't unusual, the "file" member is a special snowflake. We usually don't want to fclose() it. This is because "file" is usually stdout, in which case we don't want to fclose() it. We only want to opt-in to closing it when we e.g. open a file on the filesystem. Thus the opt-in "close_file" flag. So the API in general just needs a "no_free" flag to defer freeing, but the "file" member still needs its "close_file" flag. This is made more confusing because while refactoring this code we could replace some "close_file=0" with "no_free=1", whereas others need to set both flags. This is because there were some cases where an existing "close_file=0" meant "let's defer deallocation", and others where it meant "we don't want to close this file handle at all". 1. 296d4a94e7 (diff: add -I<regex> that ignores matching changes, 2020-10-20) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2 years ago
int no_free;
};
unsigned diff_filter_bit(char status);
void diff_emit_submodule_del(struct diff_options *o, const char *line);
void diff_emit_submodule_add(struct diff_options *o, const char *line);
void diff_emit_submodule_untracked(struct diff_options *o, const char *path);
void diff_emit_submodule_modified(struct diff_options *o, const char *path);
void diff_emit_submodule_header(struct diff_options *o, const char *header);
void diff_emit_submodule_error(struct diff_options *o, const char *err);
void diff_emit_submodule_pipethrough(struct diff_options *o,
const char *line, int len);
struct diffstat_t {
int nr;
int alloc;
struct diffstat_file {
char *from_name;
char *name;
char *print_name;
const char *comments;
unsigned is_unmerged:1;
unsigned is_binary:1;
unsigned is_renamed:1;
unsigned is_interesting:1;
uintmax_t added, deleted;
} **files;
};
enum color_diff {
DIFF_RESET = 0,
DIFF_CONTEXT = 1,
DIFF_METAINFO = 2,
DIFF_FRAGINFO = 3,
DIFF_FILE_OLD = 4,
DIFF_FILE_NEW = 5,
DIFF_COMMIT = 6,
DIFF_WHITESPACE = 7,
diff.c: color moved lines differently When a patch consists mostly of moving blocks of code around, it can be quite tedious to ensure that the blocks are moved verbatim, and not undesirably modified in the move. To that end, color blocks that are moved within the same patch differently. For example (OM, del, add, and NM are different colors): [OM] -void sensitive_stuff(void) [OM] -{ [OM] - if (!is_authorized_user()) [OM] - die("unauthorized"); [OM] - sensitive_stuff(spanning, [OM] - multiple, [OM] - lines); [OM] -} void another_function() { [del] - printf("foo"); [add] + printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NM] + if (!is_authorized_user()) [NM] + die("unauthorized"); [NM] + sensitive_stuff(spanning, [NM] + multiple, [NM] + lines); [NM] +} However adjacent blocks may be problematic. For example, in this potentially malicious patch, the swapping of blocks can be spotted: [OM] -void sensitive_stuff(void) [OM] -{ [OMA] - if (!is_authorized_user()) [OMA] - die("unauthorized"); [OM] - sensitive_stuff(spanning, [OM] - multiple, [OM] - lines); [OMA] -} void another_function() { [del] - printf("foo"); [add] + printf("bar"); } [NM] +void sensitive_stuff(void) [NM] +{ [NMA] + sensitive_stuff(spanning, [NMA] + multiple, [NMA] + lines); [NM] + if (!is_authorized_user()) [NM] + die("unauthorized"); [NMA] +} If the moved code is larger, it is easier to hide some permutation in the code, which is why some alternative coloring is needed. This patch implements the first mode: * basic alternating 'Zebra' mode This conveys all information needed to the user. Defer customization to later patches. First I implemented an alternative design, which would try to fingerprint a line by its neighbors to detect if we are in a block or at the boundary. This idea iss error prone as it inspected each line and its neighboring lines to determine if the line was (a) moved and (b) if was deep inside a hunk by having matching neighboring lines. This is unreliable as the we can construct hunks which have equal neighbors that just exceed the number of lines inspected. (Think of 'AXYZBXYZCXYZD..' with each letter as a line, that is permutated to AXYZCXYZBXYZD..'). Instead this provides a dynamic programming greedy algorithm that finds the largest moved hunk and then has several modes on highlighting bounds. A note on the options '--submodule=diff' and '--color-words/--word-diff': In the conversion to use emit_line in the prior patches both submodules as well as word diff output carefully chose to call emit_line with sign=0. All output with sign=0 is ignored for move detection purposes in this patch, such that no weird looking output will be generated for these cases. This leads to another thought: We could pass on '--color-moved' to submodules such that they color up moved lines for themselves. If we'd do so only line moves within a repository boundary are marked up. It is useful to have moved lines colored, but there are annoying corner cases, such as a single line moved, that is very common. For example in a typical patch of C code, we have closing braces that end statement blocks or functions. While it is technically true that these lines are moved as they show up elsewhere, it is harmful for the review as the reviewers attention is drawn to such a minor side annoyance. For now let's have a simple solution of hardcoding the number of moved lines to be at least 3 before coloring them. Note, that the length is applied across all blocks to find the 'lonely' blocks that pollute new code, but do not interfere with a permutated block where each permutation has less lines than 3. Helped-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
5 years ago
DIFF_FUNCINFO = 8,
DIFF_FILE_OLD_MOVED = 9,
DIFF_FILE_OLD_MOVED_ALT = 10,
DIFF_FILE_OLD_MOVED_DIM = 11,
DIFF_FILE_OLD_MOVED_ALT_DIM = 12,
DIFF_FILE_NEW_MOVED = 13,
DIFF_FILE_NEW_MOVED_ALT = 14,
DIFF_FILE_NEW_MOVED_DIM = 15,
range-diff: use dim/bold cues to improve dual color mode It *is* a confusing thing to look at a diff of diffs. All too easy is it to mix up whether the -/+ markers refer to the "inner" or the "outer" diff, i.e. whether a `+` indicates that a line was added by either the old or the new diff (or both), or whether the new diff does something different than the old diff. To make things easier to process for normal developers, we introduced the dual color mode which colors the lines according to the commit diff, i.e. lines that are added by a commit (whether old, new, or both) are colored in green. In non-dual color mode, the lines would be colored according to the outer diff: if the old commit added a line, it would be colored red (because that line addition is only present in the first commit range that was specified on the command-line, i.e. the "old" commit, but not in the second commit range, i.e. the "new" commit). However, this dual color mode is still not making things clear enough, as we are looking at two levels of diffs, and we still only pick a color according to *one* of them (the outer diff marker is colored differently, of course, but in particular with deep indentation, it is easy to lose track of that outer diff marker's background color). Therefore, let's add another dimension to the mix. Still use green/red/normal according to the commit diffs, but now also dim the lines that were only in the old commit, and use bold face for the lines that are only in the new commit. That way, it is much easier not to lose track of, say, when we are looking at a line that was added in the previous iteration of a patch series but the new iteration adds a slightly different version: the obsolete change will be dimmed, the current version of the patch will be bold. At least this developer has a much easier time reading the range-diffs that way. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
4 years ago
DIFF_FILE_NEW_MOVED_ALT_DIM = 16,
DIFF_CONTEXT_DIM = 17,
DIFF_FILE_OLD_DIM = 18,
DIFF_FILE_NEW_DIM = 19,
DIFF_CONTEXT_BOLD = 20,
DIFF_FILE_OLD_BOLD = 21,
DIFF_FILE_NEW_BOLD = 22,
};
const char *diff_get_color(int diff_use_color, enum color_diff ix);
#define diff_get_color_opt(o, ix) \
diff_get_color((o)->use_color, ix)