2020-10-27 03:08:07 +01:00
|
|
|
/*
|
|
|
|
* "Ostensibly Recursive's Twin" merge strategy, or "ort" for short. Meant
|
|
|
|
* as a drop-in replacement for the "recursive" merge strategy, allowing one
|
|
|
|
* to replace
|
|
|
|
*
|
|
|
|
* git merge [-s recursive]
|
|
|
|
*
|
|
|
|
* with
|
|
|
|
*
|
|
|
|
* git merge -s ort
|
|
|
|
*
|
|
|
|
* Note: git's parser allows the space between '-s' and its argument to be
|
|
|
|
* missing. (Should I have backronymed "ham", "alsa", "kip", "nap, "alvo",
|
|
|
|
* "cale", "peedy", or "ins" instead of "ort"?)
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "cache.h"
|
|
|
|
#include "merge-ort.h"
|
|
|
|
|
2020-12-16 23:28:00 +01:00
|
|
|
#include "alloc.h"
|
2021-03-20 01:03:45 +01:00
|
|
|
#include "attr.h"
|
2020-12-03 16:59:40 +01:00
|
|
|
#include "blob.h"
|
2020-12-13 09:04:26 +01:00
|
|
|
#include "cache-tree.h"
|
2020-12-16 23:28:00 +01:00
|
|
|
#include "commit.h"
|
2020-12-03 16:59:40 +01:00
|
|
|
#include "commit-reach.h"
|
merge-ort: port merge_start() from merge-recursive
merge_start() basically does a bunch of sanity checks, then allocates
and initializes opt->priv -- a struct merge_options_internal.
Most of the sanity checks are usable as-is. The
allocation/intialization is a bit different since merge-ort has a very
different merge_options_internal than merge-recursive, but the idea is
the same.
The weirdest part here is that merge-ort and merge-recursive use the
same struct merge_options, even though merge_options has a number of
fields that are oddly specific to merge-recursive's internal
implementation and don't even make sense with merge-ort's high-level
design (e.g. buffer_output, which merge-ort has to always do). I reused
the same data structure because:
* most the fields made sense to both merge algorithms
* making a new struct would have required making new enums or somehow
externalizing them, and that was getting messy.
* it simplifies converting the existing callers by not having to
have different code paths for merge_options setup.
I also marked detect_renames as ignored. We can revisit that later, but
in short: merge-recursive allowed turning off rename detection because
it was sometimes glacially slow. When you speed something up by a few
orders of magnitude, it's worth revisiting whether that justification is
still relevant. Besides, if folks find it's still too slow, perhaps
they have a better scaling case than I could find and maybe it turns up
some more optimizations we can add. If it still is needed as an option,
it is easy to add later.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-13 09:04:10 +01:00
|
|
|
#include "diff.h"
|
|
|
|
#include "diffcore.h"
|
2020-12-13 09:04:24 +01:00
|
|
|
#include "dir.h"
|
2021-04-16 22:53:34 +02:00
|
|
|
#include "entry.h"
|
2021-01-01 03:34:44 +01:00
|
|
|
#include "ll-merge.h"
|
2020-12-13 09:04:21 +01:00
|
|
|
#include "object-store.h"
|
merge-ort: add prefetching for content merges
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)
introduced batching of fetching missing blobs, so that the diff
machinery would have one fetch subprocess grab N blobs instead of N
processes each grabbing 1.
However, the diff machinery is not the only thing in a merge that needs
to work on blobs. The 3-way content merges need them as well. Rather
than download all the blobs 1 at a time, prefetch all the blobs needed
for regular content merges.
This does not cover all possible paths in merge-ort that might need to
download blobs. Others include:
- The blob_unchanged() calls to avoid modify/delete conflicts (when
blob renormalization results in an "unchanged" file)
- Preliminary content merges needed for rename/add and
rename/rename(2to1) style conflicts. (Both of these types of
conflicts can result in nested conflict markers from the need to do
two levels of content merging; the first happens before our new
prefetch_for_content_merges() function.)
The first of these wouldn't be an extreme amount of work to support, and
even the second could be theoretically supported in batching, but all of
these cases seem unusual to me, and this is a minor performance
optimization anyway; in the worst case we only get some of the fetches
batched and have a few additional one-off fetches. So for now, just
handle the regular 3-way content merges in our prefetching.
For the testcase from the previous commit, the number of downloaded
objects remains at 63, but this drops the number of fetches needed from
32 down to 20, a sizeable reduction.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-22 10:04:41 +02:00
|
|
|
#include "promisor-remote.h"
|
2021-01-01 03:34:47 +01:00
|
|
|
#include "revision.h"
|
2020-12-13 09:04:08 +01:00
|
|
|
#include "strmap.h"
|
2021-09-09 20:47:29 +02:00
|
|
|
#include "submodule-config.h"
|
2021-01-01 03:34:45 +01:00
|
|
|
#include "submodule.h"
|
2020-12-13 09:04:09 +01:00
|
|
|
#include "tree.h"
|
2020-12-13 09:04:24 +01:00
|
|
|
#include "unpack-trees.h"
|
merge-ort: use histogram diff
In my cursory investigation, histogram diffs are about 2% slower than
Myers diffs. Others have probably done more detailed benchmarks. But,
in short, histogram diffs have been around for years and in a number of
cases provide obviously better looking diffs where Myers diffs are
unintelligible but the performance hit has kept them from becoming the
default.
However, there are real merge bugs we know about that have triggered on
git.git and linux.git, which I don't have a clue how to address without
the additional information that I believe is provided by histogram
diffs. See the following:
https://lore.kernel.org/git/20190816184051.GB13894@sigill.intra.peff.net/
https://lore.kernel.org/git/CABPp-BHvJHpSJT7sdFwfNcPn_sOXwJi3=o14qjZS3M8Rzcxe2A@mail.gmail.com/
https://lore.kernel.org/git/CABPp-BGtez4qjbtFT1hQoREfcJPmk9MzjhY5eEq1QhXT23tFOw@mail.gmail.com/
I don't like mismerges. I really don't like silent mismerges. While I
am sometimes willing to make performance and correctness tradeoff, I'm
much more interested in correctness in general. I want to fix the above
bugs. I have not yet started doing so, but I believe histogram diff at
least gives me an angle. Unfortunately, I can't rely on using the
information from histogram diff unless it's in use. And it hasn't been
used because of a few percentage performance hit.
In testcases I have looked at, merge-ort is _much_ faster than
merge-recursive for non-trivial merges/rebases/cherry-picks. As such,
this is a golden opportunity to switch out the underlying diff algorithm
(at least the one used by the merge machinery; git-diff and git-log are
separate questions); doing so will allow me to get additional data and
improved diffs, and I believe it will help me fix the above bugs at some
point in the future.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-13 09:04:11 +01:00
|
|
|
#include "xdiff-interface.h"
|
2020-12-13 09:04:08 +01:00
|
|
|
|
2020-12-13 09:04:13 +01:00
|
|
|
/*
|
|
|
|
* We have many arrays of size 3. Whenever we have such an array, the
|
|
|
|
* indices refer to one of the sides of the three-way merge. This is so
|
|
|
|
* pervasive that the constants 0, 1, and 2 are used in many places in the
|
|
|
|
* code (especially in arithmetic operations to find the other side's index
|
|
|
|
* or to compute a relevant mask), but sometimes these enum names are used
|
|
|
|
* to aid code clarity.
|
|
|
|
*
|
|
|
|
* See also 'filemask' and 'dirmask' in struct conflict_info; the "ith side"
|
|
|
|
* referred to there is one of these three sides.
|
|
|
|
*/
|
|
|
|
enum merge_side {
|
|
|
|
MERGE_BASE = 0,
|
|
|
|
MERGE_SIDE1 = 1,
|
|
|
|
MERGE_SIDE2 = 2
|
|
|
|
};
|
|
|
|
|
merge-ort: avoid accidental API mis-use
Previously, callers of the merge-ort API could have passed an
uninitialized value for struct merge_result *result. However, we want
to check result to see if it has cached renames from a previous merge
that we can reuse; such values would be found behind result->priv.
However, if result->priv is uninitialized, attempting to access behind
it will give a segfault. So, we need result->priv to be NULL (which
will be the case if the caller does a memset(&result, 0)), or be written
by a previous call to the merge-ort machinery. Documenting this
requirement may help, but despite being the person who introduced this
requirement, I still missed it once and it did not fail in a very clear
way and led to a long debugging session.
Add a _properly_initialized field to merge_result; that value will be
0 if the caller zero'ed the merge_result, it will be set to a very
specific value by a previous run by the merge-ort machinery, and if it's
uninitialized it will most likely either be 0 or some value that does
not match the specific one we'd expect allowing us to throw a much more
meaningful error.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20 08:09:37 +02:00
|
|
|
static unsigned RESULT_INITIALIZED = 0x1abe11ed; /* unlikely accidental value */
|
|
|
|
|
2021-03-11 01:38:26 +01:00
|
|
|
struct traversal_callback_data {
|
|
|
|
unsigned long mask;
|
|
|
|
unsigned long dirmask;
|
|
|
|
struct name_entry names[3];
|
|
|
|
};
|
|
|
|
|
merge-ort: add data structures for allowable trivial directory resolves
As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match. We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.
If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early. Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.
In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present. Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources. However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.
Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources. Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.
Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.
Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:33 +02:00
|
|
|
struct deferred_traversal_data {
|
|
|
|
/*
|
|
|
|
* possible_trivial_merges: directories to be explored only when needed
|
|
|
|
*
|
|
|
|
* possible_trivial_merges is a map of directory names to
|
|
|
|
* dir_rename_mask. When we detect that a directory is unchanged on
|
|
|
|
* one side, we can sometimes resolve the directory without recursing
|
|
|
|
* into it. Renames are the only things that can prevent such an
|
|
|
|
* optimization. However, for rename sources:
|
|
|
|
* - If no parent directory needed directory rename detection, then
|
|
|
|
* no path under such a directory can be a relevant_source.
|
|
|
|
* and for rename destinations:
|
|
|
|
* - If no cached rename has a target path under the directory AND
|
|
|
|
* - If there are no unpaired relevant_sources elsewhere in the
|
|
|
|
* repository
|
|
|
|
* then we don't need any path under this directory for a rename
|
|
|
|
* destination. The only way to know the last item above is to defer
|
|
|
|
* handling such directories until the end of collect_merge_info(),
|
|
|
|
* in handle_deferred_entries().
|
|
|
|
*
|
|
|
|
* For each we store dir_rename_mask, since that's the only bit of
|
|
|
|
* information we need, other than the path, to resume the recursive
|
|
|
|
* traversal.
|
|
|
|
*/
|
|
|
|
struct strintmap possible_trivial_merges;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* trivial_merges_okay: if trivial directory merges are okay
|
|
|
|
*
|
|
|
|
* See possible_trivial_merges above. The "no unpaired
|
|
|
|
* relevant_sources elsewhere in the repository" is a single boolean
|
|
|
|
* per merge side, which we store here. Note that while 0 means no,
|
|
|
|
* 1 only means "maybe" rather than "yes"; we optimistically set it
|
|
|
|
* to 1 initially and only clear when we determine it is unsafe to
|
|
|
|
* do trivial directory merges.
|
|
|
|
*/
|
|
|
|
unsigned trivial_merges_okay;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* target_dirs: ancestor directories of rename targets
|
|
|
|
*
|
|
|
|
* target_dirs contains all directory names that are an ancestor of
|
|
|
|
* any rename destination.
|
|
|
|
*/
|
|
|
|
struct strset target_dirs;
|
|
|
|
};
|
|
|
|
|
2020-12-14 17:21:30 +01:00
|
|
|
struct rename_info {
|
2021-01-07 22:35:49 +01:00
|
|
|
/*
|
|
|
|
* All variables that are arrays of size 3 correspond to data tracked
|
|
|
|
* for the sides in enum merge_side. Index 0 is almost always unused
|
|
|
|
* because we often only need to track information for MERGE_SIDE1 and
|
|
|
|
* MERGE_SIDE2 (MERGE_BASE can't have rename information since renames
|
|
|
|
* are determined relative to what changed since the MERGE_BASE).
|
|
|
|
*/
|
|
|
|
|
2020-12-14 17:21:30 +01:00
|
|
|
/*
|
|
|
|
* pairs: pairing of filenames from diffcore_rename()
|
|
|
|
*/
|
|
|
|
struct diff_queue_struct pairs[3];
|
|
|
|
|
2021-01-07 22:35:49 +01:00
|
|
|
/*
|
|
|
|
* dirs_removed: directories removed on a given side of history.
|
2021-03-13 23:22:03 +01:00
|
|
|
*
|
|
|
|
* The keys of dirs_removed[side] are the directories that were removed
|
|
|
|
* on the given side of history. The value of the strintmap for each
|
|
|
|
* directory is a value from enum dir_rename_relevance.
|
2021-01-07 22:35:49 +01:00
|
|
|
*/
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap dirs_removed[3];
|
2021-01-07 22:35:49 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* dir_rename_count: tracking where parts of a directory were renamed to
|
|
|
|
*
|
|
|
|
* When files in a directory are renamed, they may not all go to the
|
|
|
|
* same location. Each strmap here tracks:
|
|
|
|
* old_dir => {new_dir => int}
|
|
|
|
* That is, dir_rename_count[side] is a strmap to a strintmap.
|
|
|
|
*/
|
|
|
|
struct strmap dir_rename_count[3];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* dir_renames: computed directory renames
|
|
|
|
*
|
|
|
|
* This is a map of old_dir => new_dir and is derived in part from
|
|
|
|
* dir_rename_count.
|
|
|
|
*/
|
|
|
|
struct strmap dir_renames[3];
|
|
|
|
|
merge-ort: precompute subset of sources for which we need rename detection
rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.
To see why, first note that the merge machinery uses detected renames in
two ways:
* directory rename detection: when one side of history renames a
directory, and the other side of history adds new files to that
directory, we want to be able to warn the user about the need to
chose whether those new files stay in the old directory or move
to the new one.
* three-way content merging: in order to do three-way content merging
of files, we need three different file versions. If one side of
history renamed a file, then some of the content for the file is
found under a different path than in the merge base or on the
other side of history.
Add a simple testcase showing the two kinds of reasons renames are
relevant; it's a testcase that will only pass if we detect both kinds of
needed renames.
Other than the testcase added above, this commit concentrates just on
the three-way content merging; it will punt and mark all sources as
needed for directory rename detection, and leave it to future commits to
narrow that down more.
The point of three-way content merging is to reconcile changes made on
*both* sides of history. What if the file wasn't modified on both
sides? There are two possibilities:
* If it wasn't modified on the renamed side:
-> then we get to do exact rename detection, which is cheap.
* If it wasn't modified on the unrenamed side:
-> then detection of a rename for that source file is irrelevant
That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):
old.c new.c
MERGE_BASE: 01d01d 000000
MERGE_SIDE1: 01d01d 000000
MERGE_SIDE2: 000000 5e1ec7
If the rename *isn't* detected:
then old.c looks like it was unmodified on one side and deleted on
the other and should thus be removed. new.c looks like a new file we
should keep as-is.
If the rename *is* detected:
then a three-way content merge is done. Since the version of the
file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
will produce exactly the version of the file whose abbreviated
object id is 5e1ec7. It will record that file at the path new.c,
while removing old.c from the directory.
Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7. In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.
Use this information to compute whether we need rename detection for
each source created in add_pair().
It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other. Fixing those inconsistencies were
thus critically important to enabling this optimization. That work
involved the following:
* bringing consistency to add/add, rename/add, and rename/rename
conflict types, as done back in the topic merged at commit
ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
and further extended in commits 2a7c16c980 ("t6422, t6426: be more
flexible for add/add conflicts involving renames", 2020-08-10) and
e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
involving pair renames", 2020-08-10)
* making rename/delete more consistent with modify/delete
as done in commits 1f3c9ba707 ("t6425: be more flexible with
rename/delete conflict messages", 2020-08-10) and 727c75b23f
("t6404, t6423: expect improved rename/delete handling in ort
backend", 2020-10-26)
Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet. That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:25 +01:00
|
|
|
/*
|
2021-03-13 23:22:07 +01:00
|
|
|
* relevant_sources: deleted paths wanted in rename detection, and why
|
merge-ort: precompute subset of sources for which we need rename detection
rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.
To see why, first note that the merge machinery uses detected renames in
two ways:
* directory rename detection: when one side of history renames a
directory, and the other side of history adds new files to that
directory, we want to be able to warn the user about the need to
chose whether those new files stay in the old directory or move
to the new one.
* three-way content merging: in order to do three-way content merging
of files, we need three different file versions. If one side of
history renamed a file, then some of the content for the file is
found under a different path than in the merge base or on the
other side of history.
Add a simple testcase showing the two kinds of reasons renames are
relevant; it's a testcase that will only pass if we detect both kinds of
needed renames.
Other than the testcase added above, this commit concentrates just on
the three-way content merging; it will punt and mark all sources as
needed for directory rename detection, and leave it to future commits to
narrow that down more.
The point of three-way content merging is to reconcile changes made on
*both* sides of history. What if the file wasn't modified on both
sides? There are two possibilities:
* If it wasn't modified on the renamed side:
-> then we get to do exact rename detection, which is cheap.
* If it wasn't modified on the unrenamed side:
-> then detection of a rename for that source file is irrelevant
That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):
old.c new.c
MERGE_BASE: 01d01d 000000
MERGE_SIDE1: 01d01d 000000
MERGE_SIDE2: 000000 5e1ec7
If the rename *isn't* detected:
then old.c looks like it was unmodified on one side and deleted on
the other and should thus be removed. new.c looks like a new file we
should keep as-is.
If the rename *is* detected:
then a three-way content merge is done. Since the version of the
file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
will produce exactly the version of the file whose abbreviated
object id is 5e1ec7. It will record that file at the path new.c,
while removing old.c from the directory.
Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7. In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.
Use this information to compute whether we need rename detection for
each source created in add_pair().
It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other. Fixing those inconsistencies were
thus critically important to enabling this optimization. That work
involved the following:
* bringing consistency to add/add, rename/add, and rename/rename
conflict types, as done back in the topic merged at commit
ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
and further extended in commits 2a7c16c980 ("t6422, t6426: be more
flexible for add/add conflicts involving renames", 2020-08-10) and
e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
involving pair renames", 2020-08-10)
* making rename/delete more consistent with modify/delete
as done in commits 1f3c9ba707 ("t6425: be more flexible with
rename/delete conflict messages", 2020-08-10) and 727c75b23f
("t6404, t6423: expect improved rename/delete handling in ort
backend", 2020-10-26)
Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet. That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:25 +01:00
|
|
|
*
|
|
|
|
* relevant_sources is a set of deleted paths on each side of
|
|
|
|
* history for which we need rename detection. If a path is deleted
|
|
|
|
* on one side of history, we need to detect if it is part of a
|
|
|
|
* rename if either
|
|
|
|
* * the file is modified/deleted on the other side of history
|
2021-03-13 23:22:07 +01:00
|
|
|
* * we need to detect renames for an ancestor directory
|
merge-ort: precompute subset of sources for which we need rename detection
rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.
To see why, first note that the merge machinery uses detected renames in
two ways:
* directory rename detection: when one side of history renames a
directory, and the other side of history adds new files to that
directory, we want to be able to warn the user about the need to
chose whether those new files stay in the old directory or move
to the new one.
* three-way content merging: in order to do three-way content merging
of files, we need three different file versions. If one side of
history renamed a file, then some of the content for the file is
found under a different path than in the merge base or on the
other side of history.
Add a simple testcase showing the two kinds of reasons renames are
relevant; it's a testcase that will only pass if we detect both kinds of
needed renames.
Other than the testcase added above, this commit concentrates just on
the three-way content merging; it will punt and mark all sources as
needed for directory rename detection, and leave it to future commits to
narrow that down more.
The point of three-way content merging is to reconcile changes made on
*both* sides of history. What if the file wasn't modified on both
sides? There are two possibilities:
* If it wasn't modified on the renamed side:
-> then we get to do exact rename detection, which is cheap.
* If it wasn't modified on the unrenamed side:
-> then detection of a rename for that source file is irrelevant
That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):
old.c new.c
MERGE_BASE: 01d01d 000000
MERGE_SIDE1: 01d01d 000000
MERGE_SIDE2: 000000 5e1ec7
If the rename *isn't* detected:
then old.c looks like it was unmodified on one side and deleted on
the other and should thus be removed. new.c looks like a new file we
should keep as-is.
If the rename *is* detected:
then a three-way content merge is done. Since the version of the
file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
will produce exactly the version of the file whose abbreviated
object id is 5e1ec7. It will record that file at the path new.c,
while removing old.c from the directory.
Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7. In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.
Use this information to compute whether we need rename detection for
each source created in add_pair().
It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other. Fixing those inconsistencies were
thus critically important to enabling this optimization. That work
involved the following:
* bringing consistency to add/add, rename/add, and rename/rename
conflict types, as done back in the topic merged at commit
ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
and further extended in commits 2a7c16c980 ("t6422, t6426: be more
flexible for add/add conflicts involving renames", 2020-08-10) and
e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
involving pair renames", 2020-08-10)
* making rename/delete more consistent with modify/delete
as done in commits 1f3c9ba707 ("t6425: be more flexible with
rename/delete conflict messages", 2020-08-10) and 727c75b23f
("t6404, t6423: expect improved rename/delete handling in ort
backend", 2020-10-26)
Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet. That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:25 +01:00
|
|
|
* If neither of those are true, we can skip rename detection for
|
2021-03-13 23:22:07 +01:00
|
|
|
* that path. The reason is stored as a value from enum
|
|
|
|
* file_rename_relevance, as the reason can inform the algorithm in
|
|
|
|
* diffcore_rename_extended().
|
merge-ort: precompute subset of sources for which we need rename detection
rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.
To see why, first note that the merge machinery uses detected renames in
two ways:
* directory rename detection: when one side of history renames a
directory, and the other side of history adds new files to that
directory, we want to be able to warn the user about the need to
chose whether those new files stay in the old directory or move
to the new one.
* three-way content merging: in order to do three-way content merging
of files, we need three different file versions. If one side of
history renamed a file, then some of the content for the file is
found under a different path than in the merge base or on the
other side of history.
Add a simple testcase showing the two kinds of reasons renames are
relevant; it's a testcase that will only pass if we detect both kinds of
needed renames.
Other than the testcase added above, this commit concentrates just on
the three-way content merging; it will punt and mark all sources as
needed for directory rename detection, and leave it to future commits to
narrow that down more.
The point of three-way content merging is to reconcile changes made on
*both* sides of history. What if the file wasn't modified on both
sides? There are two possibilities:
* If it wasn't modified on the renamed side:
-> then we get to do exact rename detection, which is cheap.
* If it wasn't modified on the unrenamed side:
-> then detection of a rename for that source file is irrelevant
That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):
old.c new.c
MERGE_BASE: 01d01d 000000
MERGE_SIDE1: 01d01d 000000
MERGE_SIDE2: 000000 5e1ec7
If the rename *isn't* detected:
then old.c looks like it was unmodified on one side and deleted on
the other and should thus be removed. new.c looks like a new file we
should keep as-is.
If the rename *is* detected:
then a three-way content merge is done. Since the version of the
file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
will produce exactly the version of the file whose abbreviated
object id is 5e1ec7. It will record that file at the path new.c,
while removing old.c from the directory.
Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7. In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.
Use this information to compute whether we need rename detection for
each source created in add_pair().
It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other. Fixing those inconsistencies were
thus critically important to enabling this optimization. That work
involved the following:
* bringing consistency to add/add, rename/add, and rename/rename
conflict types, as done back in the topic merged at commit
ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
and further extended in commits 2a7c16c980 ("t6422, t6426: be more
flexible for add/add conflicts involving renames", 2020-08-10) and
e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
involving pair renames", 2020-08-10)
* making rename/delete more consistent with modify/delete
as done in commits 1f3c9ba707 ("t6425: be more flexible with
rename/delete conflict messages", 2020-08-10) and 727c75b23f
("t6404, t6423: expect improved rename/delete handling in ort
backend", 2020-10-26)
Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet. That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:25 +01:00
|
|
|
*/
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap relevant_sources[3];
|
merge-ort: precompute subset of sources for which we need rename detection
rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.
To see why, first note that the merge machinery uses detected renames in
two ways:
* directory rename detection: when one side of history renames a
directory, and the other side of history adds new files to that
directory, we want to be able to warn the user about the need to
chose whether those new files stay in the old directory or move
to the new one.
* three-way content merging: in order to do three-way content merging
of files, we need three different file versions. If one side of
history renamed a file, then some of the content for the file is
found under a different path than in the merge base or on the
other side of history.
Add a simple testcase showing the two kinds of reasons renames are
relevant; it's a testcase that will only pass if we detect both kinds of
needed renames.
Other than the testcase added above, this commit concentrates just on
the three-way content merging; it will punt and mark all sources as
needed for directory rename detection, and leave it to future commits to
narrow that down more.
The point of three-way content merging is to reconcile changes made on
*both* sides of history. What if the file wasn't modified on both
sides? There are two possibilities:
* If it wasn't modified on the renamed side:
-> then we get to do exact rename detection, which is cheap.
* If it wasn't modified on the unrenamed side:
-> then detection of a rename for that source file is irrelevant
That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):
old.c new.c
MERGE_BASE: 01d01d 000000
MERGE_SIDE1: 01d01d 000000
MERGE_SIDE2: 000000 5e1ec7
If the rename *isn't* detected:
then old.c looks like it was unmodified on one side and deleted on
the other and should thus be removed. new.c looks like a new file we
should keep as-is.
If the rename *is* detected:
then a three-way content merge is done. Since the version of the
file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
will produce exactly the version of the file whose abbreviated
object id is 5e1ec7. It will record that file at the path new.c,
while removing old.c from the directory.
Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7. In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.
Use this information to compute whether we need rename detection for
each source created in add_pair().
It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other. Fixing those inconsistencies were
thus critically important to enabling this optimization. That work
involved the following:
* bringing consistency to add/add, rename/add, and rename/rename
conflict types, as done back in the topic merged at commit
ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
and further extended in commits 2a7c16c980 ("t6422, t6426: be more
flexible for add/add conflicts involving renames", 2020-08-10) and
e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
involving pair renames", 2020-08-10)
* making rename/delete more consistent with modify/delete
as done in commits 1f3c9ba707 ("t6425: be more flexible with
rename/delete conflict messages", 2020-08-10) and 727c75b23f
("t6404, t6423: expect improved rename/delete handling in ort
backend", 2020-10-26)
Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet. That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:25 +01:00
|
|
|
|
merge-ort: add data structures for allowable trivial directory resolves
As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match. We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.
If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early. Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.
In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present. Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources. However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.
Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources. Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.
Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.
Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:33 +02:00
|
|
|
struct deferred_traversal_data deferred[3];
|
|
|
|
|
merge-ort: precompute whether directory rename detection is needed
The point of directory rename detection is that if one side of history
renames a directory, and the other side adds new files under the old
directory, then the merge can move those new files into the new
directory. This leads to the following important observation:
* If the other side does not add any new files under the old
directory, we do not need to detect any renames for that directory.
Similarly, directory rename detection had an important requirement:
* If a directory still exists on one side of history, it has not been
renamed on that side of history. (See section 4 of t6423 or
Documentation/technical/directory-rename-detection.txt for more
details).
Using these two bits of information, we note that directory rename
detection is only needed in cases where (1) directories exist in the
merge base and on one side of history (i.e. dirmask == 3 or dirmask ==
5), and (2) where there is some new file added to that directory on the
side where it still exists (thus where the file has filemask == 2 or
filemask == 4, respectively). This has to be done in two steps, because
we have the dirmask when we are first considering the directory, and
won't get the filemasks for the files within it until we recurse into
that directory. So, we save
dir_rename_mask = dirmask - 1
when we hit a directory that is missing on one side, and then later look
for cases of
filemask == dir_rename_mask
One final note is that as soon as we hit a directory that needs
directory rename detection, we will need to detect renames in all
subdirectories of that directory as well due to the "majority rules"
decision when files are renamed into different directory hierarchies.
We arbitrarily use the special value of 0x07 to record when we've hit
such a directory.
The combination of all the above mean that we introduce a variable
named dir_rename_mask (couldn't think of a better name) which has one
of the following values as we traverse into a directory:
* 0x00: directory rename detection not needed
* 0x02 or 0x04: directory rename detection only needed if files added
* 0x07: directory rename detection definitely needed
We then pass this value through to add_pairs() so that it can mark
location_relevant as true only when dir_rename_mask is 0x07.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:28 +01:00
|
|
|
/*
|
|
|
|
* dir_rename_mask:
|
|
|
|
* 0: optimization removing unmodified potential rename source okay
|
|
|
|
* 2 or 4: optimization okay, but must check for files added to dir
|
|
|
|
* 7: optimization forbidden; need rename source in case of dir rename
|
|
|
|
*/
|
|
|
|
unsigned dir_rename_mask:3;
|
|
|
|
|
2021-03-11 01:38:26 +01:00
|
|
|
/*
|
|
|
|
* callback_data_*: supporting data structures for alternate traversal
|
|
|
|
*
|
|
|
|
* We sometimes need to be able to traverse through all the files
|
|
|
|
* in a given tree before all immediate subdirectories within that
|
|
|
|
* tree. Since traverse_trees() doesn't do that naturally, we have
|
|
|
|
* a traverse_trees_wrapper() that stores any immediate
|
|
|
|
* subdirectories while traversing files, then traverses the
|
|
|
|
* immediate subdirectories later. These callback_data* variables
|
|
|
|
* store the information for the subdirectories so that we can do
|
|
|
|
* that traversal order.
|
|
|
|
*/
|
|
|
|
struct traversal_callback_data *callback_data;
|
|
|
|
int callback_data_nr, callback_data_alloc;
|
|
|
|
char *callback_data_traverse_path;
|
|
|
|
|
merge-ort: add code to check for whether cached renames can be reused
We need to know when renames detected in a previous merge operation can
be reused in a later merge operation. Consider the following setup
(from the git-rebase manpage):
A---B---C topic
/
D---E---F---G master
After rebasing, this will appear as:
A'--B'--C' topic
/
D---E---F---G master
Further, let's say that 'oldfile' was renamed to 'newfile' between E
and G. The rebase or cherry-pick of A onto G will involve a three-way
merge between E (as the merge base) and G and A. After detecting the
rename between E:oldfile and G:newfile, there will be a three-way
content merge of the following:
E:oldfile
G:newfile
A:oldfile
and produce a new result:
A':newfile
Now, when we want to pick B onto A', we will need to do a three-way
merge between A (as the merge-base) and A' and B. This will involve
a three-way content merge of
A:oldfile
A':newfile
B:oldfile
but only if we can detect that A:oldfile is similar enough to A':newfile
to be used together in a three-way content merge, i.e. only if we can
detect that A:oldfile and A':newfile are a rename. But we already know
that A:oldfile and A':newfile are similar enough to be used in a
three-way content merge, because that is precisely where A':newfile came
from in the previous merge.
Note that A & A' both appear in both merges. That gives us the
condition under which we can reuse renames.
There are a couple important points about this optimization:
- If the rebase or cherry-pick halts for user conflicts, these caches
are NOT saved anywhere. Thus, resuming a halted rebase or
cherry-pick will result in no reused renames for the next commit.
This is intentional, as user resolution can change files
significantly and in ways that violate the similarity assumptions
here.
- Technically, in a *very* narrow case this might give slightly
different results for rename detection. Using the example above,
if:
* E:oldfile had 20 lines
* G:newfile added 10 new lines at the beginning of the file
* A:oldfile deleted all but the first three lines of the file
then
=> A':newfile would have 13 lines, 3 of which matches those
in A:oldfile.
Consider the two cases:
* Without this optimization:
- the next step of the rebase operation (moving B to B')
would not detect the rename betwen A:oldfile and A':newfile
- we'd thus get a modify/delete conflict with the rebase
operation halting for the user to resolve, and have both
A':newfile and B:oldfile sitting in the working tree.
* With this optimization:
- the rename between A:oldfile and A':newfile would be detected
via the cache of renames
- a three-way merge between A:oldfile, A':newfile, and B:oldfile
would commence and be written to A':newfile
Now, is the difference in behavior a bug...or a bugfix? I can't
tell. Given that A:oldfile and A':newfile are not very similar,
when we three-way merge with B:oldfile it seems likely we'll hit a
conflict for the user to resolve. And it shouldn't be too hard for
users to see why we did that three-way merge; oldfile and newfile
*were* renames somewhere in the sequence. So, most of these corner
cases will still behave similarly -- namely, a conflict given to the
user to resolve. Also, consider the interesting case when commit B
is a clean revert of commit A. Without this optimization, a rebase
could not both apply a weird patch like A and then immediately
revert it; users would be forced to resolve merge conflicts. With
this optimization, it would successfully apply the clean revert.
So, there is certainly at least one case that behaves better. Even
if it's considered a "difference in behavior", I think both behaviors
are reasonable, and the time savings provided by this optimization
justify using the slightly altered rename heuristics.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20 08:09:36 +02:00
|
|
|
/*
|
|
|
|
* merge_trees: trees passed to the merge algorithm for the merge
|
|
|
|
*
|
|
|
|
* merge_trees records the trees passed to the merge algorithm. But,
|
|
|
|
* this data also is stored in merge_result->priv. If a sequence of
|
|
|
|
* merges are being done (such as when cherry-picking or rebasing),
|
|
|
|
* the next merge can look at this and re-use information from
|
|
|
|
* previous merges under certain circumstances.
|
|
|
|
*
|
|
|
|
* See also all the cached_* variables.
|
|
|
|
*/
|
|
|
|
struct tree *merge_trees[3];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cached_pairs_valid_side: which side's cached info can be reused
|
|
|
|
*
|
|
|
|
* See the description for merge_trees. For repeated merges, at most
|
|
|
|
* only one side's cached information can be used. Valid values:
|
|
|
|
* MERGE_SIDE2: cached data from side2 can be reused
|
|
|
|
* MERGE_SIDE1: cached data from side1 can be reused
|
|
|
|
* 0: no cached data can be reused
|
merge-ort: restart merge with cached renames to reduce process entry cost
The merge algorithm mostly consists of the following three functions:
collect_merge_info()
detect_and_process_renames()
process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames(). When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.
Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames. It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.
Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.
However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
collect_merge_info()
detect_and_process_renames()
process_entries()
into
collect_merge_info()
detect_and_process_renames()
<cache all the renames, and restart>
collect_merge_info()
detect_and_process_renames()
process_entries()
This may seem odd and look like more work. However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small. While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up). And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.
This restarting only makes sense if we can save recursing into enough
directories to make it worth our while. Introduce a simple heuristic to
guide this. Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air. It could be that
tweaking this number to make it either higher or lower improves the
optimization. (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases. I was
certain it was higher than needed. Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little. If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)
Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 205.1 ms ± 3.8 ms 204.2 ms ± 3.0 ms
mega-renames: 1.564 s ± 0.010 s 1.076 s ± 0.015 s
just-one-mega: 479.5 ms ± 3.9 ms 364.1 ms ± 7.0 ms
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:37 +02:00
|
|
|
* -1: See redo_after_renames; both sides can be reused.
|
merge-ort: add code to check for whether cached renames can be reused
We need to know when renames detected in a previous merge operation can
be reused in a later merge operation. Consider the following setup
(from the git-rebase manpage):
A---B---C topic
/
D---E---F---G master
After rebasing, this will appear as:
A'--B'--C' topic
/
D---E---F---G master
Further, let's say that 'oldfile' was renamed to 'newfile' between E
and G. The rebase or cherry-pick of A onto G will involve a three-way
merge between E (as the merge base) and G and A. After detecting the
rename between E:oldfile and G:newfile, there will be a three-way
content merge of the following:
E:oldfile
G:newfile
A:oldfile
and produce a new result:
A':newfile
Now, when we want to pick B onto A', we will need to do a three-way
merge between A (as the merge-base) and A' and B. This will involve
a three-way content merge of
A:oldfile
A':newfile
B:oldfile
but only if we can detect that A:oldfile is similar enough to A':newfile
to be used together in a three-way content merge, i.e. only if we can
detect that A:oldfile and A':newfile are a rename. But we already know
that A:oldfile and A':newfile are similar enough to be used in a
three-way content merge, because that is precisely where A':newfile came
from in the previous merge.
Note that A & A' both appear in both merges. That gives us the
condition under which we can reuse renames.
There are a couple important points about this optimization:
- If the rebase or cherry-pick halts for user conflicts, these caches
are NOT saved anywhere. Thus, resuming a halted rebase or
cherry-pick will result in no reused renames for the next commit.
This is intentional, as user resolution can change files
significantly and in ways that violate the similarity assumptions
here.
- Technically, in a *very* narrow case this might give slightly
different results for rename detection. Using the example above,
if:
* E:oldfile had 20 lines
* G:newfile added 10 new lines at the beginning of the file
* A:oldfile deleted all but the first three lines of the file
then
=> A':newfile would have 13 lines, 3 of which matches those
in A:oldfile.
Consider the two cases:
* Without this optimization:
- the next step of the rebase operation (moving B to B')
would not detect the rename betwen A:oldfile and A':newfile
- we'd thus get a modify/delete conflict with the rebase
operation halting for the user to resolve, and have both
A':newfile and B:oldfile sitting in the working tree.
* With this optimization:
- the rename between A:oldfile and A':newfile would be detected
via the cache of renames
- a three-way merge between A:oldfile, A':newfile, and B:oldfile
would commence and be written to A':newfile
Now, is the difference in behavior a bug...or a bugfix? I can't
tell. Given that A:oldfile and A':newfile are not very similar,
when we three-way merge with B:oldfile it seems likely we'll hit a
conflict for the user to resolve. And it shouldn't be too hard for
users to see why we did that three-way merge; oldfile and newfile
*were* renames somewhere in the sequence. So, most of these corner
cases will still behave similarly -- namely, a conflict given to the
user to resolve. Also, consider the interesting case when commit B
is a clean revert of commit A. Without this optimization, a rebase
could not both apply a weird patch like A and then immediately
revert it; users would be forced to resolve merge conflicts. With
this optimization, it would successfully apply the clean revert.
So, there is certainly at least one case that behaves better. Even
if it's considered a "difference in behavior", I think both behaviors
are reasonable, and the time savings provided by this optimization
justify using the slightly altered rename heuristics.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20 08:09:36 +02:00
|
|
|
*/
|
|
|
|
int cached_pairs_valid_side;
|
|
|
|
|
2021-05-20 08:09:34 +02:00
|
|
|
/*
|
|
|
|
* cached_pairs: Caching of renames and deletions.
|
|
|
|
*
|
|
|
|
* These are mappings recording renames and deletions of individual
|
|
|
|
* files (not directories). They are thus a map from an old
|
|
|
|
* filename to either NULL (for deletions) or a new filename (for
|
|
|
|
* renames).
|
|
|
|
*/
|
|
|
|
struct strmap cached_pairs[3];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cached_target_names: just the destinations from cached_pairs
|
|
|
|
*
|
|
|
|
* We sometimes want a fast lookup to determine if a given filename
|
|
|
|
* is one of the destinations in cached_pairs. cached_target_names
|
|
|
|
* is thus duplicative information, but it provides a fast lookup.
|
|
|
|
*/
|
|
|
|
struct strset cached_target_names[3];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cached_irrelevant: Caching of rename_sources that aren't relevant.
|
|
|
|
*
|
|
|
|
* If we try to detect a rename for a source path and succeed, it's
|
|
|
|
* part of a rename. If we try to detect a rename for a source path
|
|
|
|
* and fail, then it's a delete. If we do not try to detect a rename
|
|
|
|
* for a path, then we don't know if it's a rename or a delete. If
|
|
|
|
* merge-ort doesn't think the path is relevant, then we just won't
|
|
|
|
* cache anything for that path. But there's a slight problem in
|
|
|
|
* that merge-ort can think a path is RELEVANT_LOCATION, but due to
|
|
|
|
* commit 9bd342137e ("diffcore-rename: determine which
|
|
|
|
* relevant_sources are no longer relevant", 2021-03-13),
|
|
|
|
* diffcore-rename can downgrade the path to RELEVANT_NO_MORE. To
|
|
|
|
* avoid excessive calls to diffcore_rename_extended() we still need
|
|
|
|
* to cache such paths, though we cannot record them as either
|
|
|
|
* renames or deletes. So we cache them here as a "turned out to be
|
|
|
|
* irrelevant *for this commit*" as they are often also irrelevant
|
|
|
|
* for subsequent commits, though we will have to do some extra
|
|
|
|
* checking to see whether such paths become relevant for rename
|
|
|
|
* detection when cherry-picking/rebasing subsequent commits.
|
|
|
|
*/
|
|
|
|
struct strset cached_irrelevant[3];
|
|
|
|
|
merge-ort: restart merge with cached renames to reduce process entry cost
The merge algorithm mostly consists of the following three functions:
collect_merge_info()
detect_and_process_renames()
process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames(). When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.
Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames. It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.
Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.
However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
collect_merge_info()
detect_and_process_renames()
process_entries()
into
collect_merge_info()
detect_and_process_renames()
<cache all the renames, and restart>
collect_merge_info()
detect_and_process_renames()
process_entries()
This may seem odd and look like more work. However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small. While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up). And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.
This restarting only makes sense if we can save recursing into enough
directories to make it worth our while. Introduce a simple heuristic to
guide this. Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air. It could be that
tweaking this number to make it either higher or lower improves the
optimization. (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases. I was
certain it was higher than needed. Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little. If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)
Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 205.1 ms ± 3.8 ms 204.2 ms ± 3.0 ms
mega-renames: 1.564 s ± 0.010 s 1.076 s ± 0.015 s
just-one-mega: 479.5 ms ± 3.9 ms 364.1 ms ± 7.0 ms
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:37 +02:00
|
|
|
/*
|
|
|
|
* redo_after_renames: optimization flag for "restarting" the merge
|
|
|
|
*
|
|
|
|
* Sometimes it pays to detect renames, cache them, and then
|
|
|
|
* restart the merge operation from the beginning. The reason for
|
|
|
|
* this is that when we know where all the renames are, we know
|
|
|
|
* whether a certain directory has any paths under it affected --
|
|
|
|
* and if a directory is not affected then it permits us to do
|
|
|
|
* trivial tree merging in more cases. Doing trivial tree merging
|
|
|
|
* prevents the need to run process_entry() on every path
|
|
|
|
* underneath trees that can be trivially merged, and
|
|
|
|
* process_entry() is more expensive than collect_merge_info() --
|
|
|
|
* plus, the second collect_merge_info() will be much faster since
|
|
|
|
* it doesn't have to recurse into the relevant trees.
|
|
|
|
*
|
|
|
|
* Values for this flag:
|
|
|
|
* 0 = don't bother, not worth it (or conditions not yet checked)
|
|
|
|
* 1 = conditions for optimization met, optimization worthwhile
|
|
|
|
* 2 = we already did it (don't restart merge yet again)
|
|
|
|
*/
|
|
|
|
unsigned redo_after_renames;
|
|
|
|
|
2020-12-14 17:21:30 +01:00
|
|
|
/*
|
|
|
|
* needed_limit: value needed for inexact rename detection to run
|
|
|
|
*
|
|
|
|
* If the current rename limit wasn't high enough for inexact
|
|
|
|
* rename detection to run, this records the limit needed. Otherwise,
|
|
|
|
* this value remains 0.
|
|
|
|
*/
|
|
|
|
int needed_limit;
|
|
|
|
};
|
|
|
|
|
2020-12-13 09:04:08 +01:00
|
|
|
struct merge_options_internal {
|
|
|
|
/*
|
|
|
|
* paths: primary data structure in all of merge ort.
|
|
|
|
*
|
|
|
|
* The keys of paths:
|
|
|
|
* * are full relative paths from the toplevel of the repository
|
|
|
|
* (e.g. "drivers/firmware/raspberrypi.c").
|
|
|
|
* * store all relevant paths in the repo, both directories and
|
|
|
|
* files (e.g. drivers, drivers/firmware would also be included)
|
|
|
|
* * these keys serve to intern all the path strings, which allows
|
|
|
|
* us to do pointer comparison on directory names instead of
|
|
|
|
* strcmp; we just have to be careful to use the interned strings.
|
|
|
|
*
|
|
|
|
* The values of paths:
|
|
|
|
* * either a pointer to a merged_info, or a conflict_info struct
|
|
|
|
* * merged_info contains all relevant information for a
|
|
|
|
* non-conflicted entry.
|
|
|
|
* * conflict_info contains a merged_info, plus any additional
|
|
|
|
* information about a conflict such as the higher orders stages
|
|
|
|
* involved and the names of the paths those came from (handy
|
|
|
|
* once renames get involved).
|
|
|
|
* * a path may start "conflicted" (i.e. point to a conflict_info)
|
|
|
|
* and then a later step (e.g. three-way content merge) determines
|
|
|
|
* it can be cleanly merged, at which point it'll be marked clean
|
|
|
|
* and the algorithm will ignore any data outside the contained
|
|
|
|
* merged_info for that entry
|
|
|
|
* * If an entry remains conflicted, the merged_info portion of a
|
|
|
|
* conflict_info will later be filled with whatever version of
|
|
|
|
* the file should be placed in the working directory (e.g. an
|
|
|
|
* as-merged-as-possible variation that contains conflict markers).
|
|
|
|
*/
|
|
|
|
struct strmap paths;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* conflicted: a subset of keys->values from "paths"
|
|
|
|
*
|
|
|
|
* conflicted is basically an optimization between process_entries()
|
|
|
|
* and record_conflicted_index_entries(); the latter could loop over
|
|
|
|
* ALL the entries in paths AGAIN and look for the ones that are
|
|
|
|
* still conflicted, but since process_entries() has to loop over
|
|
|
|
* all of them, it saves the ones it couldn't resolve in this strmap
|
|
|
|
* so that record_conflicted_index_entries() can iterate just the
|
|
|
|
* relevant entries.
|
|
|
|
*/
|
|
|
|
struct strmap conflicted;
|
|
|
|
|
2020-12-03 16:59:43 +01:00
|
|
|
/*
|
2021-07-30 13:47:39 +02:00
|
|
|
* pool: memory pool for fast allocation/deallocation
|
2020-12-03 16:59:43 +01:00
|
|
|
*
|
2021-07-30 13:47:39 +02:00
|
|
|
* We allocate room for lots of filenames and auxiliary data
|
|
|
|
* structures in merge_options_internal, and it tends to all be
|
|
|
|
* freed together too. Using a memory pool for these provides a
|
|
|
|
* nice speedup.
|
2020-12-03 16:59:43 +01:00
|
|
|
*/
|
2021-07-31 19:27:38 +02:00
|
|
|
struct mem_pool pool;
|
2020-12-03 16:59:43 +01:00
|
|
|
|
merge-ort: add modify/delete handling and delayed output processing
The focus here is on adding a path_msg() which will queue up
warning/conflict/notice messages about the merge for later processing,
storing these in a pathname -> strbuf map. It might seem like a big
change, but it really just is:
* declaration of necessary map with some comments
* initialization and recording of data
* a bunch of code to iterate over the map at print/free time
* at least one caller in order to avoid an error about having an
unused function (which we provide in the form of implementing
modify/delete conflict handling).
At this stage, it is probably not clear why I am opting for delayed
output processing. There are multiple reasons:
1. Merges are supposed to abort if they would overwrite dirty changes
in the working tree. We cannot correctly determine whether changes
would be overwritten until both rename detection has occurred and
full processing of entries with the renames has finalized.
Warning/conflict/notice messages come up at intermediate codepaths
along the way, so unless we want spurious conflict/warning messages
being printed when the merge will be aborted anyway, we need to
save these messages and only print them when relevant.
2. There can be multiple messages for a single path, and we want all
messages for a give path to appear together instead of having them
grouped by conflict/warning type. This was a problem already with
merge-recursive.c but became even more important due to the
splitting apart of conflict types as discussed in the commit
message for 1f3c9ba707 ("t6425: be more flexible with rename/delete
conflict messages", 2020-08-10)
3. Some callers might want to avoid showing the output in certain
cases, such as if the end result is a clean merge. Rebases have
typically done this.
4. Some callers might not want the output to go to stdout or even
stderr, but might want to do something else with it entirely.
For example, a --remerge-diff option to `git show` or `git log
-p` that remerges on the fly and diffs merge commits against the
remerged version would benefit from stdout/stderr not being
written to in the standard form.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-03 16:59:46 +01:00
|
|
|
/*
|
|
|
|
* output: special messages and conflict notices for various paths
|
|
|
|
*
|
|
|
|
* This is a map of pathnames (a subset of the keys in "paths" above)
|
|
|
|
* to strbufs. It gathers various warning/conflict/notice messages
|
|
|
|
* for later processing.
|
|
|
|
*/
|
|
|
|
struct strmap output;
|
|
|
|
|
2020-12-13 09:04:08 +01:00
|
|
|
/*
|
2020-12-14 17:21:30 +01:00
|
|
|
* renames: various data relating to rename detection
|
|
|
|
*/
|
|
|
|
struct rename_info renames;
|
|
|
|
|
2021-03-20 01:03:45 +01:00
|
|
|
/*
|
|
|
|
* attr_index: hacky minimal index used for renormalization
|
|
|
|
*
|
|
|
|
* renormalization code _requires_ an index, though it only needs to
|
|
|
|
* find a .gitattributes file within the index. So, when
|
|
|
|
* renormalization is important, we create a special index with just
|
|
|
|
* that one file.
|
|
|
|
*/
|
|
|
|
struct index_state attr_index;
|
|
|
|
|
2020-12-13 09:04:08 +01:00
|
|
|
/*
|
2021-01-19 20:53:50 +01:00
|
|
|
* current_dir_name, toplevel_dir: temporary vars
|
2020-12-13 09:04:08 +01:00
|
|
|
*
|
2021-01-19 20:53:50 +01:00
|
|
|
* These are used in collect_merge_info_callback(), and will set the
|
|
|
|
* various merged_info.directory_name for the various paths we get;
|
|
|
|
* see documentation for that variable and the requirements placed on
|
|
|
|
* that field.
|
2020-12-13 09:04:08 +01:00
|
|
|
*/
|
|
|
|
const char *current_dir_name;
|
2021-01-19 20:53:50 +01:00
|
|
|
const char *toplevel_dir;
|
2020-12-13 09:04:08 +01:00
|
|
|
|
|
|
|
/* call_depth: recursion level counter for merging merge bases */
|
|
|
|
int call_depth;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct version_info {
|
|
|
|
struct object_id oid;
|
|
|
|
unsigned short mode;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct merged_info {
|
|
|
|
/* if is_null, ignore result. otherwise result has oid & mode */
|
|
|
|
struct version_info result;
|
|
|
|
unsigned is_null:1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* clean: whether the path in question is cleanly merged.
|
|
|
|
*
|
|
|
|
* see conflict_info.merged for more details.
|
|
|
|
*/
|
|
|
|
unsigned clean:1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* basename_offset: offset of basename of path.
|
|
|
|
*
|
|
|
|
* perf optimization to avoid recomputing offset of final '/'
|
|
|
|
* character in pathname (0 if no '/' in pathname).
|
|
|
|
*/
|
|
|
|
size_t basename_offset;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* directory_name: containing directory name.
|
|
|
|
*
|
|
|
|
* Note that we assume directory_name is constructed such that
|
|
|
|
* strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
|
|
|
|
* i.e. string equality is equivalent to pointer equality. For this
|
|
|
|
* to hold, we have to be careful setting directory_name.
|
|
|
|
*/
|
|
|
|
const char *directory_name;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct conflict_info {
|
|
|
|
/*
|
|
|
|
* merged: the version of the path that will be written to working tree
|
|
|
|
*
|
|
|
|
* WARNING: It is critical to check merged.clean and ensure it is 0
|
|
|
|
* before reading any conflict_info fields outside of merged.
|
|
|
|
* Allocated merge_info structs will always have clean set to 1.
|
|
|
|
* Allocated conflict_info structs will have merged.clean set to 0
|
|
|
|
* initially. The merged.clean field is how we know if it is safe
|
|
|
|
* to access other parts of conflict_info besides merged; if a
|
|
|
|
* conflict_info's merged.clean is changed to 1, the rest of the
|
|
|
|
* algorithm is not allowed to look at anything outside of the
|
|
|
|
* merged member anymore.
|
|
|
|
*/
|
|
|
|
struct merged_info merged;
|
|
|
|
|
|
|
|
/* oids & modes from each of the three trees for this path */
|
|
|
|
struct version_info stages[3];
|
|
|
|
|
|
|
|
/* pathnames for each stage; may differ due to rename detection */
|
|
|
|
const char *pathnames[3];
|
|
|
|
|
|
|
|
/* Whether this path is/was involved in a directory/file conflict */
|
|
|
|
unsigned df_conflict:1;
|
|
|
|
|
2020-12-03 16:59:42 +01:00
|
|
|
/*
|
|
|
|
* Whether this path is/was involved in a non-content conflict other
|
|
|
|
* than a directory/file conflict (e.g. rename/rename, rename/delete,
|
|
|
|
* file location based on possible directory rename).
|
|
|
|
*/
|
|
|
|
unsigned path_conflict:1;
|
|
|
|
|
2020-12-13 09:04:08 +01:00
|
|
|
/*
|
|
|
|
* For filemask and dirmask, the ith bit corresponds to whether the
|
|
|
|
* ith entry is a file (filemask) or a directory (dirmask). Thus,
|
|
|
|
* filemask & dirmask is always zero, and filemask | dirmask is at
|
|
|
|
* most 7 but can be less when a path does not appear as either a
|
|
|
|
* file or a directory on at least one side of history.
|
|
|
|
*
|
|
|
|
* Note that these masks are related to enum merge_side, as the ith
|
|
|
|
* entry corresponds to side i.
|
|
|
|
*
|
|
|
|
* These values come from a traverse_trees() call; more info may be
|
|
|
|
* found looking at tree-walk.h's struct traverse_info,
|
|
|
|
* particularly the documentation above the "fn" member (note that
|
|
|
|
* filemask = mask & ~dirmask from that documentation).
|
|
|
|
*/
|
|
|
|
unsigned filemask:3;
|
|
|
|
unsigned dirmask:3;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Optimization to track which stages match, to avoid the need to
|
|
|
|
* recompute it in multiple steps. Either 0 or at least 2 bits are
|
|
|
|
* set; if at least 2 bits are set, their corresponding stages match.
|
|
|
|
*/
|
|
|
|
unsigned match_mask:3;
|
|
|
|
};
|
|
|
|
|
2020-12-03 16:59:44 +01:00
|
|
|
/*** Function Grouping: various utility functions ***/
|
|
|
|
|
2020-12-13 09:04:16 +01:00
|
|
|
/*
|
|
|
|
* For the next three macros, see warning for conflict_info.merged.
|
|
|
|
*
|
|
|
|
* In each of the below, mi is a struct merged_info*, and ci was defined
|
|
|
|
* as a struct conflict_info* (but we need to verify ci isn't actually
|
|
|
|
* pointed at a struct merged_info*).
|
|
|
|
*
|
|
|
|
* INITIALIZE_CI: Assign ci to mi but only if it's safe; set to NULL otherwise.
|
|
|
|
* VERIFY_CI: Ensure that something we assigned to a conflict_info* is one.
|
|
|
|
* ASSIGN_AND_VERIFY_CI: Similar to VERIFY_CI but do assignment first.
|
|
|
|
*/
|
|
|
|
#define INITIALIZE_CI(ci, mi) do { \
|
|
|
|
(ci) = (!(mi) || (mi)->clean) ? NULL : (struct conflict_info *)(mi); \
|
|
|
|
} while (0)
|
|
|
|
#define VERIFY_CI(ci) assert(ci && !ci->merged.clean);
|
|
|
|
#define ASSIGN_AND_VERIFY_CI(ci, mi) do { \
|
|
|
|
(ci) = (struct conflict_info *)(mi); \
|
|
|
|
assert((ci) && !(mi)->clean); \
|
|
|
|
} while (0)
|
|
|
|
|
2020-12-13 09:04:27 +01:00
|
|
|
static void free_strmap_strings(struct strmap *map)
|
|
|
|
{
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
|
|
|
|
strmap_for_each_entry(map, &iter, entry) {
|
|
|
|
free((char*)entry->key);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-12-16 23:28:01 +01:00
|
|
|
static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
|
|
|
|
int reinitialize)
|
2020-12-03 16:59:41 +01:00
|
|
|
{
|
2021-01-07 22:35:50 +01:00
|
|
|
struct rename_info *renames = &opti->renames;
|
|
|
|
int i;
|
2021-07-30 13:47:36 +02:00
|
|
|
void (*strmap_clear_func)(struct strmap *, int) =
|
2020-12-16 23:28:01 +01:00
|
|
|
reinitialize ? strmap_partial_clear : strmap_clear;
|
2021-07-30 13:47:36 +02:00
|
|
|
void (*strintmap_clear_func)(struct strintmap *) =
|
2021-03-13 23:22:02 +01:00
|
|
|
reinitialize ? strintmap_partial_clear : strintmap_clear;
|
2021-07-30 13:47:36 +02:00
|
|
|
void (*strset_clear_func)(struct strset *) =
|
2021-05-20 08:09:34 +02:00
|
|
|
reinitialize ? strset_partial_clear : strset_clear;
|
2020-12-03 16:59:41 +01:00
|
|
|
|
2021-07-31 19:27:38 +02:00
|
|
|
strmap_clear_func(&opti->paths, 0);
|
2020-12-03 16:59:41 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* All keys and values in opti->conflicted are a subset of those in
|
|
|
|
* opti->paths. We don't want to deallocate anything twice, so we
|
|
|
|
* don't free the keys and we pass 0 for free_values.
|
|
|
|
*/
|
2021-07-30 13:47:36 +02:00
|
|
|
strmap_clear_func(&opti->conflicted, 0);
|
merge-ort: add modify/delete handling and delayed output processing
The focus here is on adding a path_msg() which will queue up
warning/conflict/notice messages about the merge for later processing,
storing these in a pathname -> strbuf map. It might seem like a big
change, but it really just is:
* declaration of necessary map with some comments
* initialization and recording of data
* a bunch of code to iterate over the map at print/free time
* at least one caller in order to avoid an error about having an
unused function (which we provide in the form of implementing
modify/delete conflict handling).
At this stage, it is probably not clear why I am opting for delayed
output processing. There are multiple reasons:
1. Merges are supposed to abort if they would overwrite dirty changes
in the working tree. We cannot correctly determine whether changes
would be overwritten until both rename detection has occurred and
full processing of entries with the renames has finalized.
Warning/conflict/notice messages come up at intermediate codepaths
along the way, so unless we want spurious conflict/warning messages
being printed when the merge will be aborted anyway, we need to
save these messages and only print them when relevant.
2. There can be multiple messages for a single path, and we want all
messages for a give path to appear together instead of having them
grouped by conflict/warning type. This was a problem already with
merge-recursive.c but became even more important due to the
splitting apart of conflict types as discussed in the commit
message for 1f3c9ba707 ("t6425: be more flexible with rename/delete
conflict messages", 2020-08-10)
3. Some callers might want to avoid showing the output in certain
cases, such as if the end result is a clean merge. Rebases have
typically done this.
4. Some callers might not want the output to go to stdout or even
stderr, but might want to do something else with it entirely.
For example, a --remerge-diff option to `git show` or `git log
-p` that remerges on the fly and diffs merge commits against the
remerged version would benefit from stdout/stderr not being
written to in the standard form.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-03 16:59:46 +01:00
|
|
|
|
merge-ort: have ll_merge() use a special attr_index for renormalization
ll_merge() needs an index when renormalization is requested. Create one
specifically for just this purpose with just the one needed entry. This
fixes t6418.4 and t6418.5 under GIT_TEST_MERGE_ALGORITHM=ort.
NOTE 1: Even if the user has a working copy or a real index (which is
not a given as merge-ort can be used in bare repositories), we
explicitly ignore any .gitattributes file from either of these
locations. merge-ort can be used to merge two branches that are
unrelated to HEAD, so .gitattributes from the working copy and current
index should not be considered relevant.
NOTE 2: Since we are in the middle of merging, there is a risk that
.gitattributes itself is conflicted...leaving us with an ill-defined
situation about how to perform the rest of the merge. It could be that
the .gitattributes file does not even exist on one of the sides of the
merge, or that it has been modified on both sides. If it's been
modified on both sides, it's possible that it could itself be merged
cleanly, though it's also possible that it only merges cleanly if you
use the right version of the .gitattributes file to drive the merge. It
gets kind of complicated. The only test we ever had that attempted to
test behavior in this area was seemingly unaware of the undefined
behavior, but knew the test wouldn't work for lack of attribute handling
support, marked it as test_expect_failure from the beginning, but
managed to fail for several reasons unrelated to attribute handling.
See commit 6f6e7cfb52 ("t6038: remove problematic test", 2020-08-03) for
details. So there are probably various ways to improve what
initialize_attr_index() picks in the case of a conflicted .gitattributes
but for now I just implemented something simple -- look for whatever
.gitattributes file we can find in any of the higher order stages and
use it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20 01:03:46 +01:00
|
|
|
if (opti->attr_index.cache_nr) /* true iff opt->renormalize */
|
2021-03-20 01:03:45 +01:00
|
|
|
discard_index(&opti->attr_index);
|
|
|
|
|
2021-01-07 22:35:50 +01:00
|
|
|
/* Free memory used by various renames maps */
|
|
|
|
for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
|
2021-07-30 13:47:36 +02:00
|
|
|
strintmap_clear_func(&renames->dirs_removed[i]);
|
|
|
|
strmap_clear_func(&renames->dir_renames[i], 0);
|
|
|
|
strintmap_clear_func(&renames->relevant_sources[i]);
|
2021-05-20 08:09:38 +02:00
|
|
|
if (!reinitialize)
|
|
|
|
assert(renames->cached_pairs_valid_side == 0);
|
merge-ort: restart merge with cached renames to reduce process entry cost
The merge algorithm mostly consists of the following three functions:
collect_merge_info()
detect_and_process_renames()
process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames(). When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.
Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames. It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.
Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.
However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
collect_merge_info()
detect_and_process_renames()
process_entries()
into
collect_merge_info()
detect_and_process_renames()
<cache all the renames, and restart>
collect_merge_info()
detect_and_process_renames()
process_entries()
This may seem odd and look like more work. However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small. While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up). And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.
This restarting only makes sense if we can save recursing into enough
directories to make it worth our while. Introduce a simple heuristic to
guide this. Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air. It could be that
tweaking this number to make it either higher or lower improves the
optimization. (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases. I was
certain it was higher than needed. Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little. If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)
Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 205.1 ms ± 3.8 ms 204.2 ms ± 3.0 ms
mega-renames: 1.564 s ± 0.010 s 1.076 s ± 0.015 s
just-one-mega: 479.5 ms ± 3.9 ms 364.1 ms ± 7.0 ms
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:37 +02:00
|
|
|
if (i != renames->cached_pairs_valid_side &&
|
|
|
|
-1 != renames->cached_pairs_valid_side) {
|
2021-07-30 13:47:36 +02:00
|
|
|
strset_clear_func(&renames->cached_target_names[i]);
|
|
|
|
strmap_clear_func(&renames->cached_pairs[i], 1);
|
|
|
|
strset_clear_func(&renames->cached_irrelevant[i]);
|
2021-05-20 08:09:38 +02:00
|
|
|
partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
|
|
|
|
if (!reinitialize)
|
|
|
|
strmap_clear(&renames->dir_rename_count[i], 1);
|
|
|
|
}
|
2021-01-07 22:35:50 +01:00
|
|
|
}
|
merge-ort: add data structures for allowable trivial directory resolves
As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match. We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.
If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early. Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.
In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present. Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources. However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.
Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources. Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.
Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.
Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:33 +02:00
|
|
|
for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
|
2021-07-30 13:47:36 +02:00
|
|
|
strintmap_clear_func(&renames->deferred[i].possible_trivial_merges);
|
|
|
|
strset_clear_func(&renames->deferred[i].target_dirs);
|
merge-ort: add data structures for allowable trivial directory resolves
As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match. We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.
If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early. Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.
In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present. Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources. However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.
Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources. Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.
Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.
Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-16 07:22:33 +02:00
|
|
|
renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
|
|
|
|
}
|
merge-ort: add code to check for whether cached renames can be reused
We need to know when renames detected in a previous merge operation can
be reused in a later merge operation. Consider the following setup
(from the git-rebase manpage):
A---B---C topic
/
D---E---F---G master
After rebasing, this will appear as:
A'--B'--C' topic
/
D---E---F---G master
Further, let's say that 'oldfile' was renamed to 'newfile' between E
and G. The rebase or cherry-pick of A onto G will involve a three-way
merge between E (as the merge base) and G and A. After detecting the
rename between E:oldfile and G:newfile, there will be a three-way
content merge of the following:
E:oldfile
G:newfile
A:oldfile
and produce a new result:
A':newfile
Now, when we want to pick B onto A', we will need to do a three-way
merge between A (as the merge-base) and A' and B. This will involve
a three-way content merge of
A:oldfile
A':newfile
B:oldfile
but only if we can detect that A:oldfile is similar enough to A':newfile
to be used together in a three-way content merge, i.e. only if we can
detect that A:oldfile and A':newfile are a rename. But we already know
that A:oldfile and A':newfile are similar enough to be used in a
three-way content merge, because that is precisely where A':newfile came
from in the previous merge.
Note that A & A' both appear in both merges. That gives us the
condition under which we can reuse renames.
There are a couple important points about this optimization:
- If the rebase or cherry-pick halts for user conflicts, these caches
are NOT saved anywhere. Thus, resuming a halted rebase or
cherry-pick will result in no reused renames for the next commit.
This is intentional, as user resolution can change files
significantly and in ways that violate the similarity assumptions
here.
- Technically, in a *very* narrow case this might give slightly
different results for rename detection. Using the example above,
if:
* E:oldfile had 20 lines
* G:newfile added 10 new lines at the beginning of the file
* A:oldfile deleted all but the first three lines of the file
then
=> A':newfile would have 13 lines, 3 of which matches those
in A:oldfile.
Consider the two cases:
* Without this optimization:
- the next step of the rebase operation (moving B to B')
would not detect the rename betwen A:oldfile and A':newfile
- we'd thus get a modify/delete conflict with the rebase
operation halting for the user to resolve, and have both
A':newfile and B:oldfile sitting in the working tree.
* With this optimization:
- the rename between A:oldfile and A':newfile would be detected
via the cache of renames
- a three-way merge between A:oldfile, A':newfile, and B:oldfile
would commence and be written to A':newfile
Now, is the difference in behavior a bug...or a bugfix? I can't
tell. Given that A:oldfile and A':newfile are not very similar,
when we three-way merge with B:oldfile it seems likely we'll hit a
conflict for the user to resolve. And it shouldn't be too hard for
users to see why we did that three-way merge; oldfile and newfile
*were* renames somewhere in the sequence. So, most of these corner
cases will still behave similarly -- namely, a conflict given to the
user to resolve. Also, consider the interesting case when commit B
is a clean revert of commit A. Without this optimization, a rebase
could not both apply a weird patch like A and then immediately
revert it; users would be forced to resolve merge conflicts. With
this optimization, it would successfully apply the clean revert.
So, there is certainly at least one case that behaves better. Even
if it's considered a "difference in behavior", I think both behaviors
are reasonable, and the time savings provided by this optimization
justify using the slightly altered rename heuristics.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20 08:09:36 +02:00
|
|
|
renames->cached_pairs_valid_side = 0;
|
|
|
|
renames->dir_rename_mask = 0;
|
2021-01-07 22:35:50 +01:00
|
|
|
|
merge-ort: add modify/delete handling and delayed output processing
The focus here is on adding a path_msg() which will queue up
warning/conflict/notice messages about the merge for later processing,
storing these in a pathname -> strbuf map. It might seem like a big
change, but it really just is:
* declaration of necessary map with some comments
* initialization and recording of data
* a bunch of code to iterate over the map at print/free time
* at least one caller in order to avoid an error about having an
unused function (which we provide in the form of implementing
modify/delete conflict handling).
At this stage, it is probably not clear why I am opting for delayed
output processing. There are multiple reasons:
1. Merges are supposed to abort if they would overwrite dirty changes
in the working tree. We cannot correctly determine whether changes
would be overwritten until both rename detection has occurred and
full processing of entries with the renames has finalized.
Warning/conflict/notice messages come up at intermediate codepaths
along the way, so unless we want spurious conflict/warning messages
being printed when the merge will be aborted anyway, we need to
save these messages and only print them when relevant.
2. There can be multiple messages for a single path, and we want all
messages for a give path to appear together instead of having them
grouped by conflict/warning type. This was a problem already with
merge-recursive.c but became even more important due to the
splitting apart of conflict types as discussed in the commit
message for 1f3c9ba707 ("t6425: be more flexible with rename/delete
conflict messages", 2020-08-10)
3. Some callers might want to avoid showing the output in certain
cases, such as if the end result is a clean merge. Rebases have
typically done this.
4. Some callers might not want the output to go to stdout or even
stderr, but might want to do something else with it entirely.
For example, a --remerge-diff option to `git show` or `git log
-p` that remerges on the fly and diffs merge commits against the
remerged version would benefit from stdout/stderr not being
written to in the standard form.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-03 16:59:46 +01:00
|
|
|
if (!reinitialize) {
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *e;
|
|
|
|
|
|
|
|
/* Release and free each strbuf found in output */
|
|
|
|
strmap_for_each_entry(&opti->output, &iter, e) {
|
|
|
|
struct strbuf *sb = e->value;
|
|
|
|
strbuf_release(sb);
|
|
|
|
/*
|
|
|
|
* While strictly speaking we don't need to free(sb)
|
|
|
|
* here because we could pass free_values=1 when
|
|
|
|
* calling strmap_clear() on opti->output, that would
|
|
|
|
* require strmap_clear to do another
|
|
|
|
* strmap_for_each_entry() loop, so we just free it
|
|
|
|
* while we're iterating anyway.
|
|
|
|
*/
|
|
|
|
free(sb);
|
|
|
|
}
|
|
|
|
strmap_clear(&opti->output, 0);
|
|
|
|
}
|
2021-03-11 01:38:26 +01:00
|
|
|
|
2021-07-31 19:27:38 +02:00
|
|
|
mem_pool_discard(&opti->pool, 0);
|
2021-07-30 13:47:39 +02:00
|
|
|
|
2021-03-11 01:38:26 +01:00
|
|
|
/* Clean out callback_data as well. */
|
|
|
|
FREE_AND_NULL(renames->callback_data);
|
|
|
|
renames->callback_data_nr = renames->callback_data_alloc = 0;
|
2020-12-03 16:59:41 +01:00
|
|
|
}
|
|
|
|
|
2021-07-13 10:05:18 +02:00
|
|
|
__attribute__((format (printf, 2, 3)))
|
2020-12-13 09:04:12 +01:00
|
|
|
static int err(struct merge_options *opt, const char *err, ...)
|
|
|
|
{
|
|
|
|
va_list params;
|
|
|
|
struct strbuf sb = STRBUF_INIT;
|
|
|
|
|
|
|
|
strbuf_addstr(&sb, "error: ");
|
|
|
|
va_start(params, err);
|
|
|
|
strbuf_vaddf(&sb, err, params);
|
|
|
|
va_end(params);
|
|
|
|
|
|
|
|
error("%s", sb.buf);
|
|
|
|
strbuf_release(&sb);
|
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2021-01-01 03:34:45 +01:00
|
|
|
static void format_commit(struct strbuf *sb,
|
|
|
|
int indent,
|
2021-10-08 23:08:17 +02:00
|
|
|
struct repository *repo,
|
2021-01-01 03:34:45 +01:00
|
|
|
struct commit *commit)
|
|
|
|
{
|
2021-01-01 03:34:46 +01:00
|
|
|
struct merge_remote_desc *desc;
|
|
|
|
struct pretty_print_context ctx = {0};
|
|
|
|
ctx.abbrev = DEFAULT_ABBREV;
|
|
|
|
|
|
|
|
strbuf_addchars(sb, ' ', indent);
|
|
|
|
desc = merge_remote_util(commit);
|
|
|
|
if (desc) {
|
|
|
|
strbuf_addf(sb, "virtual %s\n", desc->name);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-10-08 23:08:17 +02:00
|
|
|
repo_format_commit_message(repo, commit, "%h %s", sb, &ctx);
|
2021-01-01 03:34:46 +01:00
|
|
|
strbuf_addch(sb, '\n');
|
2021-01-01 03:34:45 +01:00
|
|
|
}
|
|
|
|
|
merge-ort: add modify/delete handling and delayed output processing
The focus here is on adding a path_msg() which will queue up
warning/conflict/notice messages about the merge for later processing,
storing these in a pathname -> strbuf map. It might seem like a big
change, but it really just is:
* declaration of necessary map with some comments
* initialization and recording of data
* a bunch of code to iterate over the map at print/free time
* at least one caller in order to avoid an error about having an
unused function (which we provide in the form of implementing
modify/delete conflict handling).
At this stage, it is probably not clear why I am opting for delayed
output processing. There are multiple reasons:
1. Merges are supposed to abort if they would overwrite dirty changes
in the working tree. We cannot correctly determine whether changes
would be overwritten until both rename detection has occurred and
full processing of entries with the renames has finalized.
Warning/conflict/notice messages come up at intermediate codepaths
along the way, so unless we want spurious conflict/warning messages
being printed when the merge will be aborted anyway, we need to
save these messages and only print them when relevant.
2. There can be multiple messages for a single path, and we want all
messages for a give path to appear together instead of having them
grouped by conflict/warning type. This was a problem already with
merge-recursive.c but became even more important due to the
splitting apart of conflict types as discussed in the commit
message for 1f3c9ba707 ("t6425: be more flexible with rename/delete
conflict messages", 2020-08-10)
3. Some callers might want to avoid showing the output in certain
cases, such as if the end result is a clean merge. Rebases have
typically done this.
4. Some callers might not want the output to go to stdout or even
stderr, but might want to do something else with it entirely.
For example, a --remerge-diff option to `git show` or `git log
-p` that remerges on the fly and diffs merge commits against the
remerged version would benefit from stdout/stderr not being
written to in the standard form.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-03 16:59:46 +01:00
|
|
|
__attribute__((format (printf, 4, 5)))
|
|
|
|
static void path_msg(struct merge_options *opt,
|
|
|
|
const char *path,
|
|
|
|
int omittable_hint, /* skippable under --remerge-diff */
|
|
|
|
const char *fmt, ...)
|
|
|
|
{
|
|
|
|
va_list ap;
|
2022-02-02 03:37:33 +01:00
|
|
|
struct strbuf *sb, *dest;
|
|
|
|
struct strbuf tmp = STRBUF_INIT;
|
|
|
|
|
|
|
|
if (opt->record_conflict_msgs_as_headers && omittable_hint)
|
2022-02-02 03:37:36 +01:00
|
|
|
return; /* Do not record mere hints in headers */
|
2022-03-02 05:19:21 +01:00
|
|
|
if (opt->priv->call_depth && opt->verbosity < 5)
|
|
|
|
return; /* Ignore messages from inner merges */
|
|
|
|
|
2022-02-02 03:37:33 +01:00
|
|
|
sb = strmap_get(&opt->priv->output, path);
|
|