2006-02-26 01:19:46 +01:00
|
|
|
#include "cache.h"
|
2018-05-16 01:42:15 +02:00
|
|
|
#include "object-store.h"
|
2006-02-26 01:19:46 +01:00
|
|
|
#include "tag.h"
|
|
|
|
#include "blob.h"
|
|
|
|
#include "tree.h"
|
|
|
|
#include "commit.h"
|
2006-02-28 20:24:00 +01:00
|
|
|
#include "diff.h"
|
2020-12-21 16:19:33 +01:00
|
|
|
#include "diff-merges.h"
|
2006-02-26 01:19:46 +01:00
|
|
|
#include "refs.h"
|
|
|
|
#include "revision.h"
|
2018-04-12 02:21:09 +02:00
|
|
|
#include "repository.h"
|
2008-05-04 12:36:54 +02:00
|
|
|
#include "graph.h"
|
2006-09-18 00:43:40 +02:00
|
|
|
#include "grep.h"
|
2007-01-11 11:47:48 +01:00
|
|
|
#include "reflog-walk.h"
|
2007-04-09 12:40:38 +02:00
|
|
|
#include "patch-ids.h"
|
2008-04-03 11:12:06 +02:00
|
|
|
#include "decorate.h"
|
2008-11-03 20:25:46 +01:00
|
|
|
#include "log-tree.h"
|
2010-03-12 18:04:26 +01:00
|
|
|
#include "string-list.h"
|
Implement line-history search (git log -L)
This is a rewrite of much of Bo's work, mainly in an effort to split
it into smaller, easier to understand routines.
The algorithm is built around the struct range_set, which encodes a
series of line ranges as intervals [a,b). This is used in two
contexts:
* A set of lines we are tracking (which will change as we dig through
history).
* To encode diffs, as pairs of ranges.
The main routine is range_set_map_across_diff(). It processes the
diff between a commit C and some parent P. It determines which diff
hunks are relevant to the ranges tracked in C, and computes the new
ranges for P.
The algorithm is then simply to process history in topological order
from newest to oldest, computing ranges and (partial) diffs. At
branch points, we need to merge the ranges we are watching. We will
find that many commits do not affect the chosen ranges, and mark them
TREESAME (in addition to those already filtered by pathspec limiting).
Another pass of history simplification then gets rid of such commits.
This is wired as an extra filtering pass in the log machinery. This
currently only reduces code duplication, but should allow for other
simplifications and options to be used.
Finally, we hook a diff printer into the output chain. Ideally we
would wire directly into the diff logic, to optionally use features
like word diff. However, that will require some major reworking of
the diff chain, so we completely replace the output with our own diff
for now.
As this was a GSoC project, and has quite some history by now, many
people have helped. In no particular order, thanks go to
Jakub Narebski <jnareb@gmail.com>
Jens Lehmann <Jens.Lehmann@web.de>
Jonathan Nieder <jrnieder@gmail.com>
Junio C Hamano <gitster@pobox.com>
Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Will Palmer <wmpalmer@gmail.com>
Apologies to everyone I forgot.
Signed-off-by: Bo Yang <struggleyb.nku@gmail.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-03-28 17:47:32 +01:00
|
|
|
#include "line-log.h"
|
2013-01-05 22:26:45 +01:00
|
|
|
#include "mailmap.h"
|
log: use true parents for diff even when rewriting
When using pathspec filtering in combination with diff-based log
output, parent simplification happens before the diff is computed.
The diff is therefore against the *simplified* parents.
This works okay, arguably by accident, in the normal case:
simplification reduces to one parent as long as the commit is TREESAME
to it. So the simplified parent of any given commit must have the
same tree contents on the filtered paths as its true (unfiltered)
parent.
However, --full-diff breaks this guarantee, and indeed gives pretty
spectacular results when comparing the output of
git log --graph --stat ...
git log --graph --full-diff --stat ...
(--graph internally kicks in parent simplification, much like
--parents).
To fix it, store a copy of the parent list before simplification (in a
slab) whenever --full-diff is in effect. Then use the stored parents
instead of the simplified ones in the commit display code paths. The
latter do not actually check for --full-diff to avoid duplicated code;
they just grab the original parents if save_parents() has not been
called for this revision walk.
For ordinary commits it should be obvious that this is the right thing
to do.
Merge commits are a bit subtle. Observe that with default
simplification, merge simplification is an all-or-nothing decision:
either the merge is TREESAME to one parent and disappears, or it is
different from all parents and the parent list remains intact.
Redundant parents are not pruned, so the existing code also shows them
as a merge.
So if we do show a merge commit, the parent list just consists of the
rewrite result on each parent. Running, e.g., --cc on this in
--full-diff mode is not very useful: if any commits were skipped, some
hunks will disagree with all sides of the merge (with one side,
because commits were skipped; with the others, because they didn't
have those changes in the first place). This triggers --cc showing
these hunks spuriously.
Therefore I believe that even for merge commits it is better to show
the diffs wrt. the original parents.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-31 22:13:20 +02:00
|
|
|
#include "commit-slab.h"
|
2014-01-24 14:40:28 +01:00
|
|
|
#include "dir.h"
|
2014-10-17 02:44:23 +02:00
|
|
|
#include "cache-tree.h"
|
2015-06-29 17:40:30 +02:00
|
|
|
#include "bisect.h"
|
2017-08-19 00:20:36 +02:00
|
|
|
#include "packfile.h"
|
2017-08-23 14:36:52 +02:00
|
|
|
#include "worktree.h"
|
2020-07-28 22:23:39 +02:00
|
|
|
#include "strvec.h"
|
2018-07-20 18:33:04 +02:00
|
|
|
#include "commit-reach.h"
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 14:46:20 +01:00
|
|
|
#include "commit-graph.h"
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 14:46:22 +01:00
|
|
|
#include "prio-queue.h"
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
#include "hashmap.h"
|
2019-06-28 01:39:05 +02:00
|
|
|
#include "utf8.h"
|
2020-04-06 18:59:52 +02:00
|
|
|
#include "bloom.h"
|
2020-04-06 18:59:53 +02:00
|
|
|
#include "json-writer.h"
|
2006-02-26 01:19:46 +01:00
|
|
|
|
2007-11-03 19:11:10 +01:00
|
|
|
volatile show_early_output_fn_t show_early_output;
|
|
|
|
|
2015-06-29 17:40:30 +02:00
|
|
|
static const char *term_bad;
|
|
|
|
static const char *term_good;
|
|
|
|
|
2018-05-19 07:28:24 +02:00
|
|
|
implement_shared_commit_slab(revision_sources, char *);
|
|
|
|
|
line-log: more responsive, incremental 'git log -L'
The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range. This preprocessing affects both responsiveness
and correctness:
- Git doesn't produce any output during this preprocessing step.
Checking whether a commit modified the given line range is
somewhat expensive, so depending on the size of the given revision
range this preprocessing can result in a significant delay before
the first commit is shown.
- Limiting the number of displayed commits (e.g. 'git log -3 -L...')
doesn't limit the amount of work during preprocessing, because
that limit is applied during history traversal. Alas, by that
point this expensive preprocessing step has already churned
through the whole revision range to find all commits modifying the
revision range, even though only a few of them need to be shown.
- It rewrites parents, with no way to turn it off. Without the user
explicitly requesting parent rewriting any parent object ID shown
should be that of the immediate parent, just like in case of a
pathspec-limited history traversal without parent rewriting.
However, after that preprocessing step rewrote history, the
subsequent "regular" history traversal (i.e. get_revision() in a
loop) only sees commits modifying the given line range.
Consequently, it can only show the object ID of the last ancestor
that modified the given line range (which might happen to be the
immediate parent, but many-many times it isn't).
This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:
- Make process_ranges_arbitrary_commit(), the static function in
'line-log.c' deciding whether a commit modifies the given line
range, public by removing the static keyword and adding the
'line_log_' prefix, so it can be called from other parts of the
revision walking machinery.
- If the user didn't explicitly ask for parent rewriting (which, I
believe, is the most common case):
- Call this now-public function during regular history traversal,
namely from get_commit_action() to ignore any commits not
modifying the given line range.
Note that while this check is relatively expensive, it must be
performed before other, much cheaper conditions, because the
tracked line range must be adjusted even when the commit will
end up being ignored by other conditions.
- Skip the line_log_filter() call, i.e. the expensive
preprocessing step, in prepare_revision_walk(), because, thanks
to the above points, the revision walking machinery is now able
to filter out commits not modifying the given line range while
traversing history.
This way the regular history traversal sees the unmodified
history, and is therefore able to print the object ids of the
immediate parents of the listed commits. The eliminated
preprocessing step can greatly reduce the delay before the first
commit is shown, see the numbers below.
- However, if the user did explicitly ask for parent rewriting via
'--parents' or a similar option, then stick with the current
implementation for now, i.e. perform that expensive filtering and
history rewriting in the preprocessing step just like we did
before, leaving the initial delay as long as it was.
I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :) Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.
This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.
The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:
# no parent rewriting requested, no commit-graph present
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m9.570s
user 0m9.494s
sys 0m0.076s
After:
real 0m0.718s
user 0m0.674s
sys 0m0.044s
A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list(). With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:
Before:
real 0m8.874s
user 0m8.816s
sys 0m0.057s
After:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
The next patch will further reduce the remaining delay.
To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.
Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 3.88 s
After: 0.71 s
If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.
As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 13:56:17 +02:00
|
|
|
static inline int want_ancestry(const struct rev_info *revs);
|
|
|
|
|
list-objects: pass full pathname to callbacks
When we find a blob at "a/b/c", we currently pass this to
our show_object_fn callbacks as two components: "a/b/" and
"c". Callbacks which want the full value then call
path_name(), which concatenates the two. But this is an
inefficient interface; the path is a strbuf, and we could
simply append "c" to it temporarily, then roll back the
length, without creating a new copy.
So we could improve this by teaching the callsites of
path_name() this trick (and there are only 3). But we can
also notice that no callback actually cares about the
broken-down representation, and simply pass each callback
the full path "a/b/c" as a string. The callback code becomes
even simpler, then, as we do not have to worry about freeing
an allocated buffer, nor rolling back our modification to
the strbuf.
This is theoretically less efficient, as some callbacks
would not bother to format the final path component. But in
practice this is not measurable. Since we use the same
strbuf over and over, our work to grow it is amortized, and
we really only pay to memcpy a few bytes.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-11 23:28:36 +01:00
|
|
|
void show_object_with_name(FILE *out, struct object *obj, const char *name)
|
2006-02-26 01:19:46 +01:00
|
|
|
{
|
list-objects: pass full pathname to callbacks
When we find a blob at "a/b/c", we currently pass this to
our show_object_fn callbacks as two components: "a/b/" and
"c". Callbacks which want the full value then call
path_name(), which concatenates the two. But this is an
inefficient interface; the path is a strbuf, and we could
simply append "c" to it temporarily, then roll back the
length, without creating a new copy.
So we could improve this by teaching the callsites of
path_name() this trick (and there are only 3). But we can
also notice that no callback actually cares about the
broken-down representation, and simply pass each callback
the full path "a/b/c" as a string. The callback code becomes
even simpler, then, as we do not have to worry about freeing
an allocated buffer, nor rolling back our modification to
the strbuf.
This is theoretically less efficient, as some callbacks
would not bother to format the final path component. But in
practice this is not measurable. Since we use the same
strbuf over and over, our work to grow it is amortized, and
we really only pay to memcpy a few bytes.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-11 23:28:36 +01:00
|
|
|
const char *p;
|
2011-08-17 23:30:34 +02:00
|
|
|
|
2015-11-10 03:22:28 +01:00
|
|
|
fprintf(out, "%s ", oid_to_hex(&obj->oid));
|
2016-02-11 23:24:18 +01:00
|
|
|
for (p = name; *p && *p != '\n'; p++)
|
|
|
|
fputc(*p, out);
|
2011-08-17 23:30:35 +02:00
|
|
|
fputc('\n', out);
|
2011-08-17 23:30:34 +02:00
|
|
|
}
|
|
|
|
|
2006-02-26 01:19:46 +01:00
|
|
|
static void mark_blob_uninteresting(struct blob *blob)
|
|
|
|
{
|
2008-02-18 21:47:54 +01:00
|
|
|
if (!blob)
|
|
|
|
return;
|
2006-02-26 01:19:46 +01:00
|
|
|
if (blob->object.flags & UNINTERESTING)
|
|
|
|
return;
|
|
|
|
blob->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:39 +02:00
|
|
|
static void mark_tree_contents_uninteresting(struct repository *r,
|
|
|
|
struct tree *tree)
|
2006-02-26 01:19:46 +01:00
|
|
|
{
|
2006-05-29 21:20:14 +02:00
|
|
|
struct tree_desc desc;
|
tree_entry(): new tree-walking helper function
This adds a "tree_entry()" function that combines the common operation of
doing a "tree_entry_extract()" + "update_tree_entry()".
It also has a simplified calling convention, designed for simple loops
that traverse over a whole tree: the arguments are pointers to the tree
descriptor and a name_entry structure to fill in, and it returns a boolean
"true" if there was an entry left to be gotten in the tree.
This allows tree traversal with
struct tree_desc desc;
struct name_entry entry;
desc.buf = tree->buffer;
desc.size = tree->size;
while (tree_entry(&desc, &entry) {
... use "entry.{path, sha1, mode, pathlen}" ...
}
which is not only shorter than writing it out in full, it's hopefully less
error prone too.
[ It's actually a tad faster too - we don't need to recalculate the entry
pathlength in both extract and update, but need to do it only once.
Also, some callers can avoid doing a "strlen()" on the result, since
it's returned as part of the name_entry structure.
However, by now we're talking just 1% speedup on "git-rev-list --objects
--all", and we're definitely at the point where tree walking is no
longer the issue any more. ]
NOTE! Not everybody wants to use this new helper function, since some of
the tree walkers very much on purpose do the descriptor update separately
from the entry extraction. So the "extract + update" sequence still
remains as the core sequence, this is just a simplified interface.
We should probably add a silly two-line inline helper function for
initializing the descriptor from the "struct tree" too, just to cut down
on the noise from that common "desc" initializer.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-30 18:45:45 +02:00
|
|
|
struct name_entry entry;
|
2006-02-26 01:19:46 +01:00
|
|
|
|
2018-05-11 20:00:55 +02:00
|
|
|
if (parse_tree_gently(tree, 1) < 0)
|
2006-02-26 01:19:46 +01:00
|
|
|
return;
|
2006-05-29 21:20:14 +02:00
|
|
|
|
2007-03-21 18:08:25 +01:00
|
|
|
init_tree_desc(&desc, tree->buffer, tree->size);
|
tree_entry(): new tree-walking helper function
This adds a "tree_entry()" function that combines the common operation of
doing a "tree_entry_extract()" + "update_tree_entry()".
It also has a simplified calling convention, designed for simple loops
that traverse over a whole tree: the arguments are pointers to the tree
descriptor and a name_entry structure to fill in, and it returns a boolean
"true" if there was an entry left to be gotten in the tree.
This allows tree traversal with
struct tree_desc desc;
struct name_entry entry;
desc.buf = tree->buffer;
desc.size = tree->size;
while (tree_entry(&desc, &entry) {
... use "entry.{path, sha1, mode, pathlen}" ...
}
which is not only shorter than writing it out in full, it's hopefully less
error prone too.
[ It's actually a tad faster too - we don't need to recalculate the entry
pathlength in both extract and update, but need to do it only once.
Also, some callers can avoid doing a "strlen()" on the result, since
it's returned as part of the name_entry structure.
However, by now we're talking just 1% speedup on "git-rev-list --objects
--all", and we're definitely at the point where tree walking is no
longer the issue any more. ]
NOTE! Not everybody wants to use this new helper function, since some of
the tree walkers very much on purpose do the descriptor update separately
from the entry extraction. So the "extract + update" sequence still
remains as the core sequence, this is just a simplified interface.
We should probably add a silly two-line inline helper function for
initializing the descriptor from the "struct tree" too, just to cut down
on the noise from that common "desc" initializer.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-30 18:45:45 +02:00
|
|
|
while (tree_entry(&desc, &entry)) {
|
2007-11-12 00:35:23 +01:00
|
|
|
switch (object_type(entry.mode)) {
|
|
|
|
case OBJ_TREE:
|
2019-01-15 01:39:44 +01:00
|
|
|
mark_tree_uninteresting(r, lookup_tree(r, &entry.oid));
|
2007-11-12 00:35:23 +01:00
|
|
|
break;
|
|
|
|
case OBJ_BLOB:
|
2019-01-15 01:39:44 +01:00
|
|
|
mark_blob_uninteresting(lookup_blob(r, &entry.oid));
|
2007-11-12 00:35:23 +01:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Subproject commit - not in this repository */
|
|
|
|
break;
|
|
|
|
}
|
2006-02-26 01:19:46 +01:00
|
|
|
}
|
2006-05-29 21:20:14 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't care about the tree any more
|
|
|
|
* after it has been marked uninteresting.
|
|
|
|
*/
|
2013-06-06 00:37:39 +02:00
|
|
|
free_tree_buffer(tree);
|
2006-02-26 01:19:46 +01:00
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:39 +02:00
|
|
|
void mark_tree_uninteresting(struct repository *r, struct tree *tree)
|
2014-01-16 00:38:01 +01:00
|
|
|
{
|
2015-12-05 16:27:24 +01:00
|
|
|
struct object *obj;
|
2014-01-16 00:38:01 +01:00
|
|
|
|
|
|
|
if (!tree)
|
|
|
|
return;
|
2015-12-05 16:27:24 +01:00
|
|
|
|
|
|
|
obj = &tree->object;
|
2014-01-16 00:38:01 +01:00
|
|
|
if (obj->flags & UNINTERESTING)
|
|
|
|
return;
|
|
|
|
obj->flags |= UNINTERESTING;
|
2018-09-21 17:57:39 +02:00
|
|
|
mark_tree_contents_uninteresting(r, tree);
|
2006-02-26 01:19:46 +01:00
|
|
|
}
|
|
|
|
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
struct path_and_oids_entry {
|
|
|
|
struct hashmap_entry ent;
|
|
|
|
char *path;
|
|
|
|
struct oidset trees;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
|
2019-10-07 01:30:37 +02:00
|
|
|
const struct hashmap_entry *eptr,
|
|
|
|
const struct hashmap_entry *entry_or_key,
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
const void *keydata)
|
|
|
|
{
|
2019-10-07 01:30:37 +02:00
|
|
|
const struct path_and_oids_entry *e1, *e2;
|
|
|
|
|
|
|
|
e1 = container_of(eptr, const struct path_and_oids_entry, ent);
|
|
|
|
e2 = container_of(entry_or_key, const struct path_and_oids_entry, ent);
|
|
|
|
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
return strcmp(e1->path, e2->path);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void paths_and_oids_clear(struct hashmap *map)
|
|
|
|
{
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct path_and_oids_entry *entry;
|
|
|
|
|
2019-10-07 01:30:41 +02:00
|
|
|
hashmap_for_each_entry(map, &iter, entry, ent /* member name */) {
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
oidset_clear(&entry->trees);
|
|
|
|
free(entry->path);
|
|
|
|
}
|
|
|
|
|
2020-11-02 19:55:05 +01:00
|
|
|
hashmap_clear_and_free(map, struct path_and_oids_entry, ent);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void paths_and_oids_insert(struct hashmap *map,
|
|
|
|
const char *path,
|
|
|
|
const struct object_id *oid)
|
|
|
|
{
|
|
|
|
int hash = strhash(path);
|
|
|
|
struct path_and_oids_entry key;
|
|
|
|
struct path_and_oids_entry *entry;
|
|
|
|
|
2019-10-07 01:30:27 +02:00
|
|
|
hashmap_entry_init(&key.ent, hash);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
|
|
|
|
/* use a shallow copy for the lookup */
|
|
|
|
key.path = (char *)path;
|
|
|
|
oidset_init(&key.trees, 0);
|
|
|
|
|
2019-10-07 01:30:42 +02:00
|
|
|
entry = hashmap_get_entry(map, &key, ent, NULL);
|
2019-10-07 01:30:30 +02:00
|
|
|
if (!entry) {
|
2021-03-13 17:17:22 +01:00
|
|
|
CALLOC_ARRAY(entry, 1);
|
2019-10-07 01:30:27 +02:00
|
|
|
hashmap_entry_init(&entry->ent, hash);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
entry->path = xstrdup(key.path);
|
|
|
|
oidset_init(&entry->trees, 16);
|
2019-10-07 01:30:32 +02:00
|
|
|
hashmap_put(map, &entry->ent);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
oidset_insert(&entry->trees, oid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void add_children_by_path(struct repository *r,
|
|
|
|
struct tree *tree,
|
|
|
|
struct hashmap *map)
|
|
|
|
{
|
|
|
|
struct tree_desc desc;
|
|
|
|
struct name_entry entry;
|
|
|
|
|
|
|
|
if (!tree)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (parse_tree_gently(tree, 1) < 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
init_tree_desc(&desc, tree->buffer, tree->size);
|
|
|
|
while (tree_entry(&desc, &entry)) {
|
|
|
|
switch (object_type(entry.mode)) {
|
|
|
|
case OBJ_TREE:
|
2019-02-07 07:05:24 +01:00
|
|
|
paths_and_oids_insert(map, entry.path, &entry.oid);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
|
|
|
|
if (tree->object.flags & UNINTERESTING) {
|
2019-02-07 07:05:24 +01:00
|
|
|
struct tree *child = lookup_tree(r, &entry.oid);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
if (child)
|
|
|
|
child->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case OBJ_BLOB:
|
|
|
|
if (tree->object.flags & UNINTERESTING) {
|
2019-02-07 07:05:24 +01:00
|
|
|
struct blob *child = lookup_blob(r, &entry.oid);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
if (child)
|
|
|
|
child->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Subproject commit - not in this repository */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
free_tree_buffer(tree);
|
|
|
|
}
|
|
|
|
|
2019-01-16 19:25:58 +01:00
|
|
|
void mark_trees_uninteresting_sparse(struct repository *r,
|
|
|
|
struct oidset *trees)
|
|
|
|
{
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
unsigned has_interesting = 0, has_uninteresting = 0;
|
2020-11-11 21:02:20 +01:00
|
|
|
struct hashmap map = HASHMAP_INIT(path_and_oids_cmp, NULL);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
struct hashmap_iter map_iter;
|
|
|
|
struct path_and_oids_entry *entry;
|
2019-01-16 19:25:58 +01:00
|
|
|
struct object_id *oid;
|
|
|
|
struct oidset_iter iter;
|
|
|
|
|
|
|
|
oidset_iter_init(trees, &iter);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
while ((!has_interesting || !has_uninteresting) &&
|
|
|
|
(oid = oidset_iter_next(&iter))) {
|
2019-01-16 19:25:58 +01:00
|
|
|
struct tree *tree = lookup_tree(r, oid);
|
|
|
|
|
|
|
|
if (!tree)
|
|
|
|
continue;
|
|
|
|
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
if (tree->object.flags & UNINTERESTING)
|
|
|
|
has_uninteresting = 1;
|
|
|
|
else
|
|
|
|
has_interesting = 1;
|
2019-01-16 19:25:58 +01:00
|
|
|
}
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
|
|
|
|
/* Do not walk unless we have both types of trees. */
|
|
|
|
if (!has_uninteresting || !has_interesting)
|
|
|
|
return;
|
|
|
|
|
|
|
|
oidset_iter_init(trees, &iter);
|
|
|
|
while ((oid = oidset_iter_next(&iter))) {
|
|
|
|
struct tree *tree = lookup_tree(r, oid);
|
|
|
|
add_children_by_path(r, tree, &map);
|
|
|
|
}
|
|
|
|
|
2019-10-07 01:30:41 +02:00
|
|
|
hashmap_for_each_entry(&map, &map_iter, entry, ent /* member name */)
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-16 19:25:59 +01:00
|
|
|
mark_trees_uninteresting_sparse(r, &entry->trees);
|
|
|
|
|
|
|
|
paths_and_oids_clear(&map);
|
2019-01-16 19:25:58 +01:00
|
|
|
}
|
|
|
|
|
2018-05-11 20:02:27 +02:00
|
|
|
struct commit_stack {
|
|
|
|
struct commit **items;
|
|
|
|
size_t nr, alloc;
|
|
|
|
};
|
2021-09-27 14:54:25 +02:00
|
|
|
#define COMMIT_STACK_INIT { 0 }
|
2018-05-11 20:02:27 +02:00
|
|
|
|
|
|
|
static void commit_stack_push(struct commit_stack *stack, struct commit *commit)
|
|
|
|
{
|
|
|
|
ALLOC_GROW(stack->items, stack->nr + 1, stack->alloc);
|
|
|
|
stack->items[stack->nr++] = commit;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct commit *commit_stack_pop(struct commit_stack *stack)
|
|
|
|
{
|
|
|
|
return stack->nr ? stack->items[--stack->nr] : NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void commit_stack_clear(struct commit_stack *stack)
|
|
|
|
{
|
|
|
|
FREE_AND_NULL(stack->items);
|
|
|
|
stack->nr = stack->alloc = 0;
|
|
|
|
}
|
|
|
|
|
2018-05-11 20:03:15 +02:00
|
|
|
static void mark_one_parent_uninteresting(struct commit *commit,
|
|
|
|
struct commit_stack *pending)
|
|
|
|
{
|
|
|
|
struct commit_list *l;
|
|
|
|
|
|
|
|
if (commit->object.flags & UNINTERESTING)
|
|
|
|
return;
|
|
|
|
commit->object.flags |= UNINTERESTING;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Normally we haven't parsed the parent
|
|
|
|
* yet, so we won't have a parent of a parent
|
|
|
|
* here. However, it may turn out that we've
|
|
|
|
* reached this commit some other way (where it
|
|
|
|
* wasn't uninteresting), in which case we need
|
|
|
|
* to mark its parents recursively too..
|
|
|
|
*/
|
|
|
|
for (l = commit->parents; l; l = l->next)
|
|
|
|
commit_stack_push(pending, l->item);
|
|
|
|
}
|
|
|
|
|
2006-02-26 01:19:46 +01:00
|
|
|
void mark_parents_uninteresting(struct commit *commit)
|
|
|
|
{
|
2018-05-11 20:02:27 +02:00
|
|
|
struct commit_stack pending = COMMIT_STACK_INIT;
|
|
|
|
struct commit_list *l;
|
2012-01-14 13:19:53 +01:00
|
|
|
|
|
|
|
for (l = commit->parents; l; l = l->next)
|
2018-05-11 20:03:15 +02:00
|
|
|
mark_one_parent_uninteresting(l->item, &pending);
|
|
|
|
|
|
|
|
while (pending.nr > 0)
|
|
|
|
mark_one_parent_uninteresting(commit_stack_pop(&pending),
|
|
|
|
&pending);
|
2018-05-11 20:02:27 +02:00
|
|
|
|
|
|
|
commit_stack_clear(&pending);
|
2006-02-26 01:19:46 +01:00
|
|
|
}
|
|
|
|
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 00:43:19 +02:00
|
|
|
static void add_pending_object_with_path(struct rev_info *revs,
|
2013-05-25 11:08:07 +02:00
|
|
|
struct object *obj,
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 00:43:19 +02:00
|
|
|
const char *name, unsigned mode,
|
|
|
|
const char *path)
|
2006-02-26 01:19:46 +01:00
|
|
|
{
|
2020-09-02 00:28:07 +02:00
|
|
|
struct interpret_branch_name_options options = { 0 };
|
2011-05-19 03:08:09 +02:00
|
|
|
if (!obj)
|
|
|
|
return;
|
2007-02-28 01:22:52 +01:00
|
|
|
if (revs->no_walk && (obj->flags & UNINTERESTING))
|
Make 'git show' more useful
For some reason, I ended up doing
git show HEAD~5..
as an odd way of asking for a log. I realize I should just have used "git
log", but at the same time it does make perfect conceptual sense. After
all, you _could_ have done
git show HEAD HEAD~1 HEAD~2 HEAD~3 HEAD~4
and saying "git show HEAD~5.." is pretty natural. It's not like "git show"
only ever showed a single commit (or other object) before either! So
conceptually, giving a commit range is a very sensible operation, even
though you'd traditionally have used "git log" for that.
However, doing that currently results in an error
fatal: object ranges do not make sense when not walking revisions
which admittedly _also_ makes perfect sense - from an internal git
implementation standpoint in 'revision.c'.
However, I think that asking to show a range makes sense to a user, while
saying "object ranges no not make sense when not walking revisions" only
makes sense to a git developer.
So on the whole, of the two different "makes perfect sense" behaviors, I
think I originally picked the wrong one. And quite frankly, I don't really
see anybody actually _depending_ on that error case. So why not change it?
So rather than error out, just turn that non-walking error case into a
"silently turn on walking" instead.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-07-13 23:41:12 +02:00
|
|
|
revs->no_walk = 0;
|
2010-01-26 22:48:28 +01:00
|
|
|
if (revs->reflog_info && obj->type == OBJ_COMMIT) {
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
2021-06-10 15:06:43 +02:00
|
|
|
size_t namelen = strlen(name);
|
|
|
|
int len = interpret_branch_name(name, namelen, &buf, &options);
|
2010-01-26 22:48:28 +01:00
|
|
|
|
2021-06-10 15:06:43 +02:00
|
|
|
if (0 < len && len < namelen && buf.len)
|
2010-01-26 22:48:28 +01:00
|
|
|
strbuf_addstr(&buf, name + len);
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 11:14:07 +02:00
|
|
|
add_reflog_for_walk(revs->reflog_info,
|
|
|
|
(struct commit *)obj,
|
|
|
|
buf.buf[0] ? buf.buf: name);
|
2010-01-26 22:48:28 +01:00
|
|
|
strbuf_release(&buf);
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 11:14:07 +02:00
|
|
|
return; /* do not add the commit itself */
|
2010-01-26 22:48:28 +01:00
|
|
|
}
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 00:43:19 +02:00
|
|
|
add_object_array_with_path(obj, name, &revs->pending, mode, path);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void add_pending_object_with_mode(struct rev_info *revs,
|
|
|
|
struct object *obj,
|
|
|
|
const char *name, unsigned mode)
|
|
|
|
{
|
|
|
|
add_pending_object_with_path(revs, obj, name, mode, NULL);
|
2006-02-26 01:19:46 +01:00
|
|
|
}
|
|
|
|
|
2013-05-25 11:08:07 +02:00
|
|
|
void add_pending_object(struct rev_info *revs,
|
|
|
|
struct object *obj, const char *name)
|
2007-06-08 11:24:58 +02:00
|
|
|
{
|
|
|
|
add_pending_object_with_mode(revs, obj, name, S_IFINVALID);
|
|
|
|
}
|
|
|
|
|
2007-12-11 19:09:04 +01:00
|
|
|
void add_head_to_pending(struct rev_info *revs)
|
|
|
|
{
|
2017-05-07 00:10:27 +02:00
|
|
|
struct object_id oid;
|
2007-12-11 19:09:04 +01:00
|
|
|
struct object *obj;
|
2017-05-07 00:10:27 +02:00
|
|
|
if (get_oid("HEAD", &oid))
|
2007-12-11 19:09:04 +01:00
|
|
|
return;
|
2018-09-21 17:57:39 +02:00
|
|
|
obj = parse_object(revs->repo, &oid);
|
2007-12-11 19:09:04 +01:00
|
|
|
if (!obj)
|
|
|
|
return;
|
|
|
|
add_pending_object(revs, obj, "HEAD");
|
|
|
|
}
|
|
|
|
|
2013-05-25 11:08:07 +02:00
|
|
|
static struct object *get_reference(struct rev_info *revs, const char *name,
|
2017-05-07 00:10:27 +02:00
|
|
|
const struct object_id *oid,
|
2013-05-25 11:08:07 +02:00
|
|
|
unsigned int flags)
|
2006-02-26 01:19:46 +01:00
|
|
|
{
|
|
|
|
struct object *object;
|
2021-08-09 10:12:03 +02:00
|
|
|
struct commit *commit;
|
2006-02-26 01:19:46 +01:00
|
|
|
|
2018-12-04 23:42:38 +01:00
|
|
|
/*
|
2021-08-09 10:12:03 +02:00
|
|
|
* If the repository has commit graphs, we try to opportunistically
|
|
|
|
* look up the object ID in those graphs. Like this, we can avoid
|
|
|
|
* parsing commit data from disk.
|
2018-12-04 23:42:38 +01:00
|
|
|
*/
|
2021-08-09 10:12:03 +02:00
|
|
|
commit = lookup_commit_in_graph(revs->repo, oid);
|
|
|
|
if (commit)
|
|
|
|
object = &commit->object;
|
|
|
|
else
|
2018-12-04 23:42:38 +01:00
|
|
|
object = parse_object(revs->repo, oid);
|
|
|
|
|
2011-05-19 03:08:09 +02:00
|
|
|
if (!object) {
|
|
|
|
if (revs->ignore_missing)
|
|
|
|
return object;
|
2017-12-08 16:27:15 +01:00
|
|
|
if (revs->exclude_promisor_objects && is_promisor_object(oid))
|
|
|
|
return NULL;
|
2006-02-26 01:19:46 +01:00
|
|
|
die("bad object %s", name);
|
2011-05-19 03:08:09 +02:00
|
|
|
}
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 01:52:13 +02:00
|
|
|
object->flags |= flags;
|
|
|
|
return object;
|
|
|
|
}
|
|
|
|
|
2017-05-07 00:10:26 +02:00
|
|
|
void add_pending_oid(struct rev_info *revs, const char *name,
|
|
|
|
const struct object_id *oid, unsigned int flags)
|
2011-10-01 17:43:52 +02:00
|
|
|
{
|
2017-05-07 00:10:27 +02:00
|
|
|
struct object *object = get_reference(revs, name, oid, flags);
|
2011-10-01 17:43:52 +02:00
|
|
|
add_pending_object(revs, object, name);
|
|
|
|
}
|
|
|
|
|
2013-05-25 11:08:07 +02:00
|
|
|
static struct commit *handle_commit(struct rev_info *revs,
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 00:43:19 +02:00
|
|
|
struct object_array_entry *entry)
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 01:52:13 +02:00
|
|
|
{
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 00:43:19 +02:00
|
|
|
struct object *object = entry->item;
|
|
|
|
const char *name = entry->name;
|
|
|
|
const char *path = entry->path;
|
|
|
|
unsigned int mode = entry->mode;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 01:52:13 +02:00
|
|
|
unsigned long flags = object->flags;
|
2006-02-26 01:19:46 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Tag object? Look what it points to..
|
|
|
|
*/
|
2006-07-12 05:45:31 +02:00
|
|
|
while (object->type == OBJ_TAG) {
|
2006-02-26 01:19:46 +01:00
|
|
|
struct tag *tag = (struct tag *) object;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 01:52:13 +02:00
|
|
|
if (revs->tag_objects && !(flags & UNINTERESTING))
|
2006-02-26 01:19:46 +01:00
|
|
|
add_pending_object(revs, object, tag->tag);
|
2019-09-05 21:55:55 +02:00
|
|
|
object = parse_object(revs->repo, get_tagged_oid(tag));
|
2009-01-28 08:19:30 +01:00
|
|