2005-04-08 00:16:10 +02:00
|
|
|
/*
|
|
|
|
* GIT - The information manager from hell
|
|
|
|
*
|
|
|
|
* Copyright (C) Linus Torvalds, 2005
|
|
|
|
*/
|
2005-04-08 00:13:13 +02:00
|
|
|
#include "cache.h"
|
2017-06-14 20:07:36 +02:00
|
|
|
#include "config.h"
|
2018-07-01 03:24:55 +02:00
|
|
|
#include "diff.h"
|
|
|
|
#include "diffcore.h"
|
2015-08-10 11:47:45 +02:00
|
|
|
#include "tempfile.h"
|
2014-10-01 12:28:42 +02:00
|
|
|
#include "lockfile.h"
|
2006-04-25 06:18:58 +02:00
|
|
|
#include "cache-tree.h"
|
2007-04-10 06:20:29 +02:00
|
|
|
#include "refs.h"
|
2007-08-11 23:59:01 +02:00
|
|
|
#include "dir.h"
|
2018-05-16 01:42:15 +02:00
|
|
|
#include "object-store.h"
|
2008-07-21 10:24:17 +02:00
|
|
|
#include "tree.h"
|
|
|
|
#include "commit.h"
|
2008-08-21 10:44:53 +02:00
|
|
|
#include "blob.h"
|
2009-12-25 09:30:51 +01:00
|
|
|
#include "resolve-undo.h"
|
2019-02-15 18:59:21 +01:00
|
|
|
#include "run-command.h"
|
2012-04-04 00:53:15 +02:00
|
|
|
#include "strbuf.h"
|
|
|
|
#include "varint.h"
|
2014-06-13 14:19:36 +02:00
|
|
|
#include "split-index.h"
|
2014-12-16 00:15:20 +01:00
|
|
|
#include "utf8.h"
|
2017-09-22 18:35:40 +02:00
|
|
|
#include "fsmonitor.h"
|
2018-10-10 17:59:36 +02:00
|
|
|
#include "thread-utils.h"
|
2018-09-15 19:56:04 +02:00
|
|
|
#include "progress.h"
|
sparse-index: convert from full to sparse
If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.
For now, we avoid converting the index to a sparse index if:
1. the index is split.
2. the index is already sparse.
3. sparse-checkout is disabled.
4. sparse-checkout does not use cone mode.
Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.
The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.
The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.
Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."
We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.
The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30 15:10:55 +02:00
|
|
|
#include "sparse-index.h"
|
read-cache: use hashfile instead of git_hash_ctx
The do_write_index() method in read-cache.c has its own hashing logic
and buffering mechanism. Specifically, the ce_write() method was
introduced by 4990aadc (Speed up index file writing by chunking it
nicely, 2005-04-20) and similar mechanisms were introduced a few months
later in c38138cd (git-pack-objects: write the pack files with a SHA1
csum, 2005-06-26). Based on the timing, in the early days of the Git
codebase, I figured that these roughly equivalent code paths were never
unified only because it got lost in the shuffle. The hashfile API has
since been used extensively in other file formats, such as pack-indexes,
multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to
unify the index writing code to use the same mechanism.
I discovered this disparity while trying to create a new index format
that uses the chunk-format API. That API uses a hashfile as its base, so
it is incompatible with the custom code in read-cache.c.
This rewrite is rather straightforward. It replaces all writes to the
temporary file with writes to the hashfile struct. This takes care of
many of the direct interactions with the_hash_algo.
There are still some git_hash_ctx uses remaining: the extension headers
are hashed for use in the End of Index Entries (EOIE) extension. This
use of the git_hash_ctx is left as-is. There are multiple reasons to not
use a hashfile here, including the fact that the data is not actually
writing to a file, just a hash computation. These hashes do not block
our adoption of the chunk-format API in a future change to the index, so
leave it as-is.
The internals of the algorithms are mostly identical. Previously, the
hashfile API used a smaller 8KB buffer instead of the 128KB buffer from
read-cache.c. The previous change already unified these sizes.
There is one subtle point: we do not pass the CSUM_FSYNC to the
finalize_hashfile() method, which differs from most consumers of the
hashfile API. The extra fsync() call indicated by this flag causes a
significant peformance degradation that is noticeable for quick
commands that write the index, such as "git add". Other consumers can
absorb this cost with their more complicated data structure
organization, and further writing structures such as pack-files and
commit-graphs is rarely in the critical path for common user
interactions.
Some static methods become orphaned in this diff, so I marked them as
MAYBE_UNUSED. The diff is much harder to read if they are deleted during
this change. Instead, they will be deleted in the following change.
In addition to the test suite passing, I computed indexes using the
previous binaries and the binaries compiled after this change, and found
the index data to be exactly equal. Finally, I did extensive performance
testing of "git update-index --force-write" on repos of various sizes,
including one with over 2 million paths at HEAD. These tests
demonstrated less than 1% difference in behavior. As expected, the
performance should be considered unchanged. The previous changes to
increase the hashfile buffer size from 8K to 128K ensured this change
would not create a peformance regression.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-18 20:32:47 +02:00
|
|
|
#include "csum-file.h"
|
2021-07-23 20:52:22 +02:00
|
|
|
#include "promisor-remote.h"
|
2006-04-25 06:18:58 +02:00
|
|
|
|
2012-07-11 11:22:37 +02:00
|
|
|
/* Mask for the name length in ce_flags in the on-disk index */
|
|
|
|
|
|
|
|
#define CE_NAMEMASK (0x0fff)
|
|
|
|
|
2006-04-25 06:18:58 +02:00
|
|
|
/* Index extensions.
|
|
|
|
*
|
|
|
|
* The first letter should be 'A'..'Z' for extensions that are not
|
|
|
|
* necessary for a correct operation (i.e. optimization data).
|
|
|
|
* When new extensions are added that _needs_ to be understood in
|
|
|
|
* order to correctly interpret the index file, pick character that
|
|
|
|
* is outside the range, to cause the reader to abort.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define CACHE_EXT(s) ( (s[0]<<24)|(s[1]<<16)|(s[2]<<8)|(s[3]) )
|
|
|
|
#define CACHE_EXT_TREE 0x54524545 /* "TREE" */
|
2010-02-02 16:33:28 +01:00
|
|
|
#define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
|
2014-06-13 14:19:36 +02:00
|
|
|
#define CACHE_EXT_LINK 0x6c696e6b /* "link" */
|
2015-03-08 11:12:33 +01:00
|
|
|
#define CACHE_EXT_UNTRACKED 0x554E5452 /* "UNTR" */
|
2017-09-22 18:35:40 +02:00
|
|
|
#define CACHE_EXT_FSMONITOR 0x46534D4E /* "FSMN" */
|
2018-10-10 17:59:34 +02:00
|
|
|
#define CACHE_EXT_ENDOFINDEXENTRIES 0x454F4945 /* "EOIE" */
|
2018-10-10 17:59:37 +02:00
|
|
|
#define CACHE_EXT_INDEXENTRYOFFSETTABLE 0x49454F54 /* "IEOT" */
|
2021-03-30 15:10:54 +02:00
|
|
|
#define CACHE_EXT_SPARSE_DIRECTORIES 0x73646972 /* "sdir" */
|
2014-06-13 14:19:36 +02:00
|
|
|
|
|
|
|
/* changes that can be kept in $GIT_DIR/index (basically all extensions) */
|
2014-06-13 14:19:37 +02:00
|
|
|
#define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \
|
2014-06-13 14:19:44 +02:00
|
|
|
CE_ENTRY_ADDED | CE_ENTRY_REMOVED | CE_ENTRY_CHANGED | \
|
2017-09-22 18:35:40 +02:00
|
|
|
SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED | FSMONITOR_CHANGED)
|
2005-04-08 00:13:13 +02:00
|
|
|
|
block alloc: allocate cache entries from mem_pool
When reading large indexes from disk, a portion of the time is
dominated in malloc() calls. This can be mitigated by allocating a
large block of memory and manage it ourselves via memory pools.
This change moves the cache entry allocation to be on top of memory
pools.
Design:
The index_state struct will gain a notion of an associated memory_pool
from which cache_entries will be allocated from. When reading in the
index from disk, we have information on the number of entries and
their size, which can guide us in deciding how large our initial
memory allocation should be. When an index is discarded, the
associated memory_pool will be discarded as well - so the lifetime of
a cache_entry is tied to the lifetime of the index_state that it was
allocated for.
In the case of a Split Index, the following rules are followed. 1st,
some terminology is defined:
Terminology:
- 'the_index': represents the logical view of the index
- 'split_index': represents the "base" cache entries. Read from the
split index file.
'the_index' can reference a single split_index, as well as
cache_entries from the split_index. `the_index` will be discarded
before the `split_index` is. This means that when we are allocating
cache_entries in the presence of a split index, we need to allocate
the entries from the `split_index`'s memory pool. This allows us to
follow the pattern that `the_index` can reference cache_entries from
the `split_index`, and that the cache_entries will not be freed while
they are still being referenced.
Managing transient cache_entry structs:
Cache entries are usually allocated for an index, but this is not always
the case. Cache entries are sometimes allocated because this is the
type that the existing checkout_entry function works with. Because of
this, the existing code needs to handle cache entries associated with an
index / memory pool, and those that only exist transiently. Several
strategies were contemplated around how to handle this:
Chosen approach:
An extra field was added to the cache_entry type to track whether the
cache_entry was allocated from a memory pool or not. This is currently
an int field, as there are no more available bits in the existing
ce_flags bit field. If / when more bits are needed, this new field can
be turned into a proper bit field.
Alternatives:
1) Do not include any information about how the cache_entry was
allocated. Calling code would be responsible for tracking whether the
cache_entry needed to be freed or not.
Pro: No extra memory overhead to track this state
Con: Extra complexity in callers to handle this correctly.
The extra complexity and burden to not regress this behavior in the
future was more than we wanted.
2) cache_entry would gain knowledge about which mem_pool allocated it
Pro: Could (potentially) do extra logic to know when a mem_pool no
longer had references to any cache_entry
Con: cache_entry would grow heavier by a pointer, instead of int
We didn't see a tangible benefit to this approach
3) Do not add any extra information to a cache_entry, but when freeing a
cache entry, check if the memory exists in a region managed by existing
mem_pools.
Pro: No extra memory overhead to track state
Con: Extra computation is performed when freeing cache entries
We decided tracking and iterating over known memory pool regions was
less desirable than adding an extra field to track this stae.
Signed-off-by: Jameson Miller <jamill@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-02 21:49:37 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is an estimate of the pathname length in the index. We use
|
|
|
|
* this for V4 index files to guess the un-deltafied size of the index
|
|
|
|
* in memory because of pathname deltafication. This is not required
|
|
|
|
* for V2/V3 index formats because their pathnames are not compressed.
|
|
|
|
* If the initial amount of memory set aside is not sufficient, the
|
|
|
|
* mem pool will allocate extra memory.
|
|
|
|
*/
|
|
|
|
#define CACHE_ENTRY_PATH_LENGTH 80
|
|
|
|
|
2021-11-29 16:52:41 +01:00
|
|
|
enum index_search_mode {
|
|
|
|
NO_EXPAND_SPARSE = 0,
|
|
|
|
EXPAND_SPARSE = 1
|
|
|
|
};
|
|
|
|
|
block alloc: allocate cache entries from mem_pool
When reading large indexes from disk, a portion of the time is
dominated in malloc() calls. This can be mitigated by allocating a
large block of memory and manage it ourselves via memory pools.
This change moves the cache entry allocation to be on top of memory
pools.
Design:
The index_state struct will gain a notion of an associated memory_pool
from which cache_entries will be allocated from. When reading in the
index from disk, we have information on the number of entries and
their size, which can guide us in deciding how large our initial
memory allocation should be. When an index is discarded, the
associated memory_pool will be discarded as well - so the lifetime of
a cache_entry is tied to the lifetime of the index_state that it was
allocated for.
In the case of a Split Index, the following rules are followed. 1st,
some terminology is defined:
Terminology:
- 'the_index': represents the logical view of the index
- 'split_index': represents the "base" cache entries. Read from the
split index file.
'the_index' can reference a single split_index, as well as
cache_entries from the split_index. `the_index` will be discarded
before the `split_index` is. This means that when we are allocating
cache_entries in the presence of a split index, we need to allocate
the entries from the `split_index`'s memory pool. This allows us to
follow the pattern that `the_index` can reference cache_entries from
the `split_index`, and that the cache_entries will not be freed while
they are still being referenced.
Managing transient cache_entry structs:
Cache entries are usually allocated for an index, but this is not always
the case. Cache entries are sometimes allocated because this is the
type that the existing checkout_entry function works with. Because of
this, the existing code needs to handle cache entries associated with an
index / memory pool, and those that only exist transiently. Several
strategies were contemplated around how to handle this:
Chosen approach:
An extra field was added to the cache_entry type to track whether the
cache_entry was allocated from a memory pool or not. This is currently
an int field, as there are no more available bits in the existing
ce_flags bit field. If / when more bits are needed, this new field can
be turned into a proper bit field.
Alternatives:
1) Do not include any information about how the cache_entry was
allocated. Calling code would be responsible for tracking whether the
cache_entry needed to be freed or not.
Pro: No extra memory overhead to track this state
Con: Extra complexity in callers to handle this correctly.
The extra complexity and burden to not regress this behavior in the
future was more than we wanted.
2) cache_entry would gain knowledge about which mem_pool allocated it
Pro: Could (potentially) do extra logic to know when a mem_pool no
longer had references to any cache_entry
Con: cache_entry would grow heavier by a pointer, instead of int
We didn't see a tangible benefit to this approach
3) Do not add any extra information to a cache_entry, but when freeing a
cache entry, check if the memory exists in a region managed by existing
mem_pools.
Pro: No extra memory overhead to track state
Con: Extra computation is performed when freeing cache entries
We decided tracking and iterating over known memory pool regions was
less desirable than adding an extra field to track this stae.
Signed-off-by: Jameson Miller <jamill@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-02 21:49:37 +02:00
|
|
|
static inline struct cache_entry *mem_pool__ce_alloc(struct mem_pool *mem_pool, size_t len)
|
|
|
|
{
|
|
|
|
struct cache_entry *ce;
|
|
|
|
ce = mem_pool_alloc(mem_pool, cache_entry_size(len));
|
|
|
|
ce->mem_pool_allocated = 1;
|
|
|
|
return ce;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct cache_entry *mem_pool__ce_calloc(struct mem_pool *mem_pool, size_t len)
|
|
|
|
{
|
|
|
|
struct cache_entry * ce;
|
|
|
|
ce = mem_pool_calloc(mem_pool, 1, cache_entry_size(len));
|
|
|
|
ce->mem_pool_allocated = 1;
|
|
|
|
return ce;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct mem_pool *find_mem_pool(struct index_state *istate)
|
|
|
|
{
|
|
|
|
struct mem_pool **pool_ptr;
|
|
|
|
|
|
|
|
if (istate->split_index && istate->split_index->base)
|
|
|
|
pool_ptr = &istate->split_index->base->ce_mem_pool;
|
|
|
|
else
|
|
|
|
pool_ptr = &istate->ce_mem_pool;
|
|
|
|
|
mem-pool: use more standard initialization and finalization
A typical memory type, such as strbuf, hashmap, or string_list can be
stored on the stack or embedded within another structure. mem_pool
cannot be, because of how mem_pool_init() and mem_pool_discard() are
written. mem_pool_init() does essentially the following (simplified
for purposes of explanation here):
void mem_pool_init(struct mem_pool **pool...)
{
*pool = xcalloc(1, sizeof(*pool));
It seems weird to require that mem_pools can only be accessed through a
pointer. It also seems slightly dangerous: unlike strbuf_release() or
strbuf_reset() or string_list_clear(), all of which put the data
structure into a state where it can be re-used after the call,
mem_pool_discard(pool) will leave pool pointing at free'd memory.
read-cache (and split-index) are the only current users of mem_pools,
and they haven't fallen into a use-after-free mistake here, but it seems
likely to be problematic for future users especially since several of
the current callers of mem_pool_init() will only call it when the
mem_pool* is not already allocated (i.e. is NULL).
This type of mechanism also prevents finding synchronization
points where one can free existing memory and then resume more
operations. It would be natural at such points to run something like
mem_pool_discard(pool...);
and, if necessary,
mem_pool_init(&pool...);
and then carry on continuing to use the pool. However, this fails badly
if several objects had a copy of the value of pool from before these
commands; in such a case, those objects won't get the updated value of
pool that mem_pool_init() overwrites pool with and they'll all instead
be reading and writing from free'd memory.
Modify mem_pool_init()/mem_pool_discard() to behave more like
strbuf_init()/strbuf_release()
or
string_list_init()/string_list_clear()
In particular: (1) make mem_pool_init() just take a mem_pool* and have
it only worry about allocating struct mp_blocks, not the struct mem_pool
itself, (2) make mem_pool_discard() free the memory that the pool was
responsible for, but leave it in a state where it can be used to
allocate more memory afterward (without the need to call mem_pool_init()
again).
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-15 19:37:56 +02:00
|
|
|
if (!*pool_ptr) {
|
|
|
|
*pool_ptr = xmalloc(sizeof(**pool_ptr));
|
|
|
|
mem_pool_init(*pool_ptr, 0);
|
|
|
|
}
|
block alloc: allocate cache entries from mem_pool
When reading large indexes from disk, a portion of the time is
dominated in malloc() calls. This can be mitigated by allocating a
large block of memory and manage it ourselves via memory pools.
This change moves the cache entry allocation to be on top of memory
pools.
Design:
The index_state struct will gain a notion of an associated memory_pool
from which cache_entries will be allocated from. When reading in the
index from disk, we have information on the number of entries and
their size, which can guide us in deciding how large our initial
memory allocation should be. When an index is discarded, the
associated memory_pool will be discarded as well - so the lifetime of
a cache_entry is tied to the lifetime of the index_state that it was
allocated for.
In the case of a Split Index, the following rules are followed. 1st,
some terminology is defined:
Terminology:
- 'the_index': represents the logical view of the index
- 'split_index': represents the "base" cache entries. Read from the
split index file.
'the_index' can reference a single split_index, as well as
cache_entries from the split_index. `the_index` will be discarded
before the `split_index` is. This means that when we are allocating
cache_entries in the presence of a split index, we need to allocate
the entries from the `split_index`'s memory pool. This allows us to
follow the pattern that `the_index` can reference cache_entries from
the `split_index`, and that the cache_entries will not be freed while
they are still being referenced.
Managing transient cache_entry structs:
Cache entries are usually allocated for an index, but this is not always
the case. Cache entries are sometimes allocated because this is the
type that the existing checkout_entry function works with. Because of
this, the existing code needs to handle cache entries associated with an
index / memory pool, and those that only exist transiently. Several
strategies were contemplated around how to handle this:
Chosen approach:
An extra field was added to the cache_entry type to track whether the
cache_entry was allocated from a memory pool or not. This is currently
an int field, as there are no more available bits in the existing
ce_flags bit field. If / when more bits are needed, this new field can
be turned into a proper bit field.
Alternatives:
1) Do not include any information about how the cache_entry was
allocated. Calling code would be responsible for tracking whether the
cache_entry needed to be freed or not.
Pro: No extra memory overhead to track this state
Con: Extra complexity in callers to handle this correctly.
The extra complexity and burden to not regress this behavior in the
future was more than we wanted.
2) cache_entry would gain knowledge about which mem_pool allocated it
Pro: Could (potentially) do extra logic to know when a mem_pool no
longer had references to any cache_entry
Con: cache_entry would grow heavier by a pointer, instead of int
We didn't see a tangible benefit to this approach
3) Do not add any extra information to a cache_entry, but when freeing a
cache entry, check if the memory exists in a region managed by existing
mem_pools.
Pro: No extra memory overhead to track state
Con: Extra computation is performed when freeing cache entries
We decided tracking and iterating over known memory pool regions was
less desirable than adding an extra field to track this stae.
Signed-off-by: Jameson Miller <jamill@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-02 21:49:37 +02:00
|
|
|
|
|
|
|
return *pool_ptr;
|
|
|
|
}
|
|
|
|
|
2014-06-13 14:19:24 +02:00
|
|
|
static const char *alternate_index_output;
|
2006-07-26 06:32:18 +02:00
|
|
|
|
2008-01-23 08:01:13 +01:00
|
|
|
static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
|
|
|
|
{
|
2021-03-30 15:10:48 +02:00
|
|
|
if (S_ISSPARSEDIR(ce->ce_mode))
|
|
|
|
istate->sparse_index = 1;
|
|
|
|
|
2008-01-23 08:01:13 +01:00
|
|
|
istate->cache[nr] = ce;
|
2008-03-21 21:16:24 +01:00
|
|
|
add_name_hash(istate, ce);
|
2008-01-23 08:01:13 +01:00
|
|
|
}
|
|
|
|
|
Create pathname-based hash-table lookup into index
This creates a hash index of every single file added to the index.
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.
No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.
That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.
Guiding principles behind this patch:
- it shouldn't be too costly. In fact, my primary goal here was to
actually speed up "git commit" with a fully populated kernel tree, by
being faster at checking whether a file already existed in the index. I
did succeed, but only barely:
Best before:
[torvalds@woody linux]$ time git commit > /dev/null
real 0m0.255s
user 0m0.168s
sys 0m0.088s
Best after:
[torvalds@woody linux]$ time ~/git/git commit > /dev/null
real 0m0.233s
user 0m0.144s
sys 0m0.088s
so some things are actually faster (~8%).
Caveat: that's really the best case. Other things are invariably going
to be slightly slower, since we populate that index cache, and quite
frankly, few things really use it to look things up.
That said, the cost is really quite small. The worst case is probably
doing a "git ls-files", which will do very little except puopulate the
index, and never actually looks anything up in it, just lists it.
Before:
[torvalds@woody linux]$ time git ls-files > /dev/null
real 0m0.016s
user 0m0.016s
sys 0m0.000s
After:
[torvalds@woody linux]$ time ~/git/git ls-files > /dev/null
real 0m0.021s
user 0m0.012s
sys 0m0.008s
and while the thing has really gotten relatively much slower, we're
still talking about something almost unmeasurable (eg 5ms). And that
really should be pretty much the worst case.
So we lose 5ms on one "benchmark", but win 22ms on another. Pick your
poison - this patch has the advantage that it will _likely_ speed up
the cases that are complex and expensive more than it slows down the
cases that are already so fast that nobody cares. But if you look at
relative speedups/slowdowns, it doesn't look so good.
- It should be simple and clean
The code may be a bit subtle (the reasons I do hash removal the way I
do etc), but it re-uses the existing hash.c files, so it really is
fairly small and straightforward apart from a few odd details.
Now, this patch on its own doesn't really do much, but I think it's worth
looking at, if only because if done correctly, the name hashing really can
make an improvement to the whole issue of "do we have a filename that
looks like this in the index already". And at least it gets real testing
by being used even by default (ie there is a real use-case for it even
without any insane filesystems).
NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just
not ashamed of it enough to really care. I took all the numbers out of my
nether regions - I'm sure it's good enough that it works in practice, but
the whole point was that you can make a really much fancier hash that
hashes characters not directly, but by their upper-case value or something
like that, and thus you get a case-insensitive hash, while still keeping
the name and the index itself totally case sensitive.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-23 03:41:14 +01:00
|
|
|
static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
|
|
|
|
{
|
|
|
|
struct cache_entry *old = istate->cache[nr];
|
|
|
|
|
2014-06-13 14:19:39 +02:00
|
|
|
replace_index_entry_in_base(istate, old, ce);
|
2013-02-28 00:57:48 +01:00
|
|
|
remove_name_hash(istate, old);
|
2018-07-02 21:49:31 +02:00
|
|
|
discard_cache_entry(old);
|
2018-03-15 16:25:20 +01:00
|
|
|
ce->ce_flags &= ~CE_HASHED;
|
Fix name re-hashing semantics
We handled the case of removing and re-inserting cache entries badly,
which is something that merging commonly needs to do (removing the
different stages, and then re-inserting one of them as the merged
state).
We even had a rather ugly special case for this failure case, where
replace_index_entry() basically turned itself into a no-op if the new
and the old entries were the same, exactly because the hash routines
didn't handle it on their own.
So what this patch does is to not just have the UNHASHED bit, but a
HASHED bit too, and when you insert an entry into the name hash, that
involves:
- clear the UNHASHED bit, because now it's valid again for lookup
(which is really all that UNHASHED meant)
- if we're being lazy, we're done here (but we still want to clear the
UNHASHED bit regardless of lazy mode, since we can become unlazy
later, and so we need the UNHASHED bit to always be set correctly,
even if we never actually insert the entry into the hash list)
- if it was already hashed, we just leave it on the list
- otherwise mark it HASHED and insert it into the list
this all means that unhashing and rehashing a name all just works
automatically. Obviously, you cannot change the name of an entry (that
would be a serious bug), but nothing can validly do that anyway (you'd
have to allocate a new struct cache_entry anyway since the name length
could change), so that's not a new limitation.
The code actually gets simpler in many ways, although the lazy hashing
does mean that there are a few odd cases (ie something can be marked
unhashed even though it was never on the hash in the first place, and
isn't actually marked hashed!).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-23 05:37:40 +01:00
|
|
|
set_index_entry(istate, nr, ce);
|
2014-06-13 14:19:39 +02:00
|
|
|
ce->ce_flags |= CE_UPDATE_IN_BASE;
|
2017-09-22 18:35:40 +02:00
|
|
|
mark_fsmonitor_invalid(istate, ce);
|
2014-06-13 14:19:27 +02:00
|
|
|
istate->cache_changed |= CE_ENTRY_CHANGED;
|
Create pathname-based hash-table lookup into index
This creates a hash index of every single file added to the index.
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.
No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.
That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.
Guiding principles behind this patch:
- it shouldn't be too costly. In fact, my primary goal here was to
actually speed up "git commit" with a fully populated kernel tree, by
being faster at checking whether a file already existed in the index. I
did succeed, but only barely:
Best before:
[torvalds@woody linux]$ time git commit > /dev/null
real 0m0.255s
user 0m0.168s
sys 0m0.088s
Best after:
[torvalds@woody linux]$ time ~/git/git commit > /dev/null
real 0m0.233s
user 0m0.144s
sys 0m0.088s
so some things are actually faster (~8%).
Caveat: that's really the best case. Other things are invariably going
to be slightly slower, since we populate that index cache, and quite
frankly, few things really use it to look things up.
That said, the cost is really quite small. The worst case is probably
doing a "git ls-files", which will do very little except puopulate the
index, and never actually looks anything up in it, just lists it.
Before:
[torvalds@woody linux]$ time git ls-files > /dev/null
real 0m0.016s
user 0m0.016s
sys 0m0.000s
After:
[torvalds@woody linux]$ time ~/git/git ls-files > /dev/null
real 0m0.021s
user 0m0.012s
sys 0m0.008s
and while the thing has really gotten relatively much slower, we're
still talking about something almost unmeasurable (eg 5ms). And that
really should be pretty much the worst case.
So we lose 5ms on one "benchmark", but win 22ms on another. Pick your
poison - this patch has the advantage that it will _likely_ speed up
the cases that are complex and expensive more than it slows down the
cases that are already so fast that nobody cares. But if you look at
relative speedups/slowdowns, it doesn't look so good.
- It should be simple and clean
The code may be a bit subtle (the reasons I do hash removal the way I
do etc), but it re-uses the existing hash.c files, so it really is
fairly small and straightforward apart from a few odd details.
Now, this patch on its own doesn't really do much, but I think it's worth
looking at, if only because if done correctly, the name hashing really can
make an improvement to the whole issue of "do we have a filename that
looks like this in the index already". And at least it gets real testing
by being used even by default (ie there is a real use-case for it even
without any insane filesystems).
NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just
not ashamed of it enough to really care. I took all the numbers out of my
nether regions - I'm sure it's good enough that it works in practice, but
the whole point was that you can make a really much fancier hash that
hashes characters not directly, but by their upper-case value or something
like that, and thus you get a case-insensitive hash, while still keeping
the name and the index itself totally case sensitive.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-23 03:41:14 +01:00
|
|
|
}
|
|
|
|
|
2008-07-21 02:25:56 +02:00
|
|
|
void rename_index_entry_at(struct index_state *istate, int nr, const char *new_name)
|
|
|
|
{
|
2018-02-14 19:59:45 +01:00
|
|
|
struct cache_entry *old_entry = istate->cache[nr], *new_entry;
|
2008-07-21 02:25:56 +02:00
|
|
|
int namelen = strlen(new_name);
|
|
|
|
|
2018-07-02 21:49:31 +02:00
|
|
|
new_entry = make_empty_cache_entry(istate, namelen);
|
2018-02-14 19:59:45 +01:00
|
|
|
copy_cache_entry(new_entry, old_entry);
|
|
|
|
new_entry->ce_flags &= ~CE_HASHED;
|
|
|
|
new_entry->ce_namelen = namelen;
|
|
|
|
new_entry->index = 0;
|
|
|
|
memcpy(new_entry->name, new_name, namelen + 1);
|
2008-07-21 02:25:56 +02:00
|
|
|
|
2018-02-14 19:59:45 +01:00
|
|
|
cache_tree_invalidate_path(istate, old_entry->name);
|
|
|
|
untracked_cache_remove_from_index(istate, old_entry->name);
|
2008-07-21 02:25:56 +02:00
|
|
|
remove_index_entry_at(istate, nr);
|
2018-02-14 19:59:45 +01:00
|
|
|
add_index_entry(istate, new_entry, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE);
|
2008-07-21 02:25:56 +02:00
|
|
|
}
|
|
|
|
|
2013-06-20 10:37:50 +02:00
|
|
|
void fill_stat_data(struct stat_data *sd, struct stat *st)
|
|
|
|
{
|
|
|
|
sd->sd_ctime.sec = (unsigned int)st->st_ctime;
|
|
|
|
sd->sd_mtime.sec = (unsigned int)st->st_mtime;
|
|
|
|
sd->sd_ctime.nsec = ST_CTIME_NSEC(*st);
|
|
|
|
sd->sd_mtime.nsec = ST_MTIME_NSEC(*st);
|
|
|
|
sd->sd_dev = st->st_dev;
|
|
|
|
sd->sd_ino = st->st_ino;
|
|
|
|
sd->sd_uid = st->st_uid;
|
|
|
|
sd->sd_gid = st->st_gid;
|
|
|
|
sd->sd_size = st->st_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
int match_stat_data(const struct stat_data *sd, struct stat *st)
|
|
|
|
{
|
|
|
|
int changed = 0;
|
|
|
|
|
|
|
|
if (sd->sd_mtime.sec != (unsigned int)st->st_mtime)
|
|
|
|
changed |= MTIME_CHANGED;
|
|
|
|
if (trust_ctime && check_stat &&
|
|
|
|
sd->sd_ctime.sec != (unsigned int)st->st_ctime)
|
|
|
|
changed |= CTIME_CHANGED;
|
|
|
|
|
|
|
|
#ifdef USE_NSEC
|
|
|
|
if (check_stat && sd->sd_mtime.nsec != ST_MTIME_NSEC(*st))
|
|
|
|
changed |= MTIME_CHANGED;
|
|
|
|
if (trust_ctime && check_stat &&
|
|
|
|
sd->sd_ctime.nsec != ST_CTIME_NSEC(*st))
|
|
|
|
changed |= CTIME_CHANGED;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (check_stat) {
|
|
|
|
if (sd->sd_uid != (unsigned int) st->st_uid ||
|
|
|
|
sd->sd_gid != (unsigned int) st->st_gid)
|
|
|
|
changed |= OWNER_CHANGED;
|
|
|
|
if (sd->sd_ino != (unsigned int) st->st_ino)
|
|
|
|
changed |= INODE_CHANGED;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef USE_STDEV
|
|
|
|
/*
|
|
|
|
* st_dev breaks on network filesystems where different
|
|
|
|
* clients will have different views of what "device"
|
|
|
|
* the filesystem is on
|
|
|
|
*/
|
|
|
|
if (check_stat && sd->sd_dev != (unsigned int) st->st_dev)
|
|
|
|
changed |= INODE_CHANGED;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (sd->sd_size != (unsigned int) st->st_size)
|
|
|
|
changed |= DATA_CHANGED;
|
|
|
|
|
|
|
|
return changed;
|
|
|
|
}
|
|
|
|
|
2005-05-15 23:23:12 +02:00
|
|
|
/*
|
|
|
|
* This only updates the "non-critical" parts of the directory
|
|
|
|
* cache, ie the parts that aren't tracked by GIT, and only used
|
|
|
|
* to validate the cache.
|
|
|
|
*/
|
2019-05-24 14:23:47 +02:00
|
|
|
void fill_stat_cache_info(struct index_state *istate, struct cache_entry *ce, struct stat *st)
|
2005-05-15 23:23:12 +02:00
|
|
|
{
|
2013-06-20 10:37:50 +02:00
|
|
|
fill_stat_data(&ce->ce_stat_data, st);
|
2006-02-09 06:15:24 +01:00
|
|
|
|
|
|
|
if (assume_unchanged)
|
2008-01-15 01:03:17 +01:00
|
|
|
ce->ce_flags |= CE_VALID;
|
2008-01-19 08:45:24 +01:00
|
|
|
|
2017-09-22 18:35:40 +02:00
|
|
|
if (S_ISREG(st->st_mode)) {
|
2008-01-19 08:45:24 +01:00
|
|
|
ce_mark_uptodate(ce);
|
mark_fsmonitor_valid(): mark the index as changed if needed
Without this bug fix, t7519's four "status doesn't detect unreported
modifications" test cases would fail occasionally (and, oddly enough,
*a lot* more frequently on Windows).
The reason is that these test cases intentionally use the side effect of
`git status` to re-write the index if any updates were detected: they
first clean the worktree, run `git status` to update the index as well
as show the output to the casual reader, then make the worktree dirty
again and expect no changes to reported if running with a mocked
fsmonitor hook.
The problem with this strategy was that the index was written during
said `git status` on the clean worktree for the *wrong* reason: not
because the index was marked as changed (it wasn't), but because the
recorded mtimes were racy with the index' own mtime.
As the mtime granularity on Windows is 100 nanoseconds (see e.g.
https://docs.microsoft.com/en-us/windows/desktop/SysInfo/file-times),
the mtimes of the files are often enough *not* racy with the index', so
that that `git status` call currently does not always update the index
(including the fsmonitor extension), causing the test case to fail.
The obvious fix: if we change *any* index entry's `CE_FSMONITOR_VALID`
flag, we should also mark the index as changed. That will cause the
index to be written upon `git status`, *including* an updated fsmonitor
extension.
Side note: Even though the reader might think that the t7519 issue
should be *much* more prevalent on Linux, given that the ext4 filesystem
(that seems to be used by every Linux distribution) stores mtimes in
nanosecond precision. However, ext4 uses `current_kernel_time()` (see
https://unix.stackexchange.com/questions/11599#comment762968_11599; it
is *amazingly* hard to find any proper source of information about such
ext4 questions) whose accuracy seems to depend on many factors but is
safely worse than the 100-nanosecond granularity of NTFS (again, it is
*horribly* hard to find anything remotely authoritative about this
question). So it seems that the racy index condition that hid the bug
fixed by this patch simply is a lot more likely on Linux than on
Windows. But not impossible ;-)
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-24 14:23:48 +02:00
|
|
|
mark_fsmonitor_valid(istate, ce);
|
2017-09-22 18:35:40 +02:00
|
|
|
}
|
2005-05-15 23:23:12 +02:00
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:31 +02:00
|
|
|
static int ce_compare_data(struct index_state *istate,
|
|
|
|
const struct cache_entry *ce,
|
|
|
|
struct stat *st)
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
{
|
|
|
|
int match = -1;
|
2016-10-28 15:23:07 +02:00
|
|
|
int fd = git_open_cloexec(ce->name, O_RDONLY);
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
|
|
|
|
if (fd >= 0) {
|
2017-08-20 22:09:27 +02:00
|
|
|
struct object_id oid;
|
2018-09-21 17:57:31 +02:00
|
|
|
if (!index_fd(istate, &oid, fd, st, OBJ_BLOB, ce->name, 0))
|
2018-08-28 23:22:59 +02:00
|
|
|
match = !oideq(&oid, &ce->oid);
|
2006-07-31 18:55:15 +02:00
|
|
|
/* index_fd() closed the file descriptor already */
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
}
|
|
|
|
return match;
|
|
|
|
}
|
|
|
|
|
2013-06-02 17:46:52 +02:00
|
|
|
static int ce_compare_link(const struct cache_entry *ce, size_t expected_size)
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
{
|
|
|
|
int match = -1;
|
|
|
|
void *buffer;
|
|
|
|
unsigned long size;
|
2007-02-26 20:55:59 +01:00
|
|
|
enum object_type type;
|
2008-12-17 18:47:27 +01:00
|
|
|
struct strbuf sb = STRBUF_INIT;
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
|
2008-12-17 18:47:27 +01:00
|
|
|
if (strbuf_readlink(&sb, ce->name, expected_size))
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
return -1;
|
2008-12-17 18:47:27 +01:00
|
|
|
|
sha1_file: convert read_sha1_file to struct object_id
Convert read_sha1_file to take a pointer to struct object_id and rename
it read_object_file. Do the same for read_sha1_file_extended.
Convert one use in grep.c to use the new function without any other code
change, since the pointer being passed is a void pointer that is already
initialized with a pointer to struct object_id. Update the declaration
and definitions of the modified functions, and apply the following
semantic patch to convert the remaining callers:
@@
expression E1, E2, E3;
@@
- read_sha1_file(E1.hash, E2, E3)
+ read_object_file(&E1, E2, E3)
@@
expression E1, E2, E3;
@@
- read_sha1_file(E1->hash, E2, E3)
+ read_object_file(E1, E2, E3)
@@
expression E1, E2, E3, E4;
@@
- read_sha1_file_extended(E1.hash, E2, E3, E4)
+ read_object_file_extended(&E1, E2, E3, E4)
@@
expression E1, E2, E3, E4;
@@
- read_sha1_file_extended(E1->hash, E2, E3, E4)
+ read_object_file_extended(E1, E2, E3, E4)
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-12 03:27:53 +01:00
|
|
|
buffer = read_object_file(&ce->oid, &type, &size);
|
2008-12-17 18:47:27 +01:00
|
|
|
if (buffer) {
|
|
|
|
if (size == sb.len)
|
|
|
|
match = memcmp(buffer, sb.buf, size);
|
|
|
|
free(buffer);
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
}
|
2008-12-17 18:47:27 +01:00
|
|
|
strbuf_release(&sb);
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
return match;
|
|
|
|
}
|
|
|
|
|
2013-06-02 17:46:52 +02:00
|
|
|
static int ce_compare_gitlink(const struct cache_entry *ce)
|
2007-04-10 06:20:29 +02:00
|
|
|
{
|
2017-10-16 00:07:06 +02:00
|
|
|
struct object_id oid;
|
2007-04-10 06:20:29 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't actually require that the .git directory
|
2007-05-21 22:08:28 +02:00
|
|
|
* under GITLINK directory be a valid git directory. It
|
2007-04-10 06:20:29 +02:00
|
|
|
* might even be missing (in case nobody populated that
|
|
|
|
* sub-project).
|
|
|
|
*
|
|
|
|
* If so, we consider it always to match.
|
|
|
|
*/
|
refs: convert resolve_gitlink_ref to struct object_id
Convert the declaration and definition of resolve_gitlink_ref to use
struct object_id and apply the following semantic patch:
@@
expression E1, E2, E3;
@@
- resolve_gitlink_ref(E1, E2, E3.hash)
+ resolve_gitlink_ref(E1, E2, &E3)
@@
expression E1, E2, E3;
@@
- resolve_gitlink_ref(E1, E2, E3->hash)
+ resolve_gitlink_ref(E1, E2, E3)
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-16 00:07:07 +02:00
|
|
|
if (resolve_gitlink_ref(ce->name, "HEAD", &oid) < 0)
|
2007-04-10 06:20:29 +02:00
|
|
|
return 0;
|
2018-08-28 23:22:59 +02:00
|
|
|
return !oideq(&oid, &ce->oid);
|
2007-04-10 06:20:29 +02:00
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:31 +02:00
|
|
|
static int ce_modified_check_fs(struct index_state *istate,
|
|
|
|
const struct cache_entry *ce,
|
|
|
|
struct stat *st)
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
{
|
|
|
|
switch (st->st_mode & S_IFMT) {
|
|
|
|
case S_IFREG:
|
2018-09-21 17:57:31 +02:00
|
|
|
if (ce_compare_data(istate, ce, st))
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
return DATA_CHANGED;
|
|
|
|
break;
|
|
|
|
case S_IFLNK:
|
2007-03-07 02:44:37 +01:00
|
|
|
if (ce_compare_link(ce, xsize_t(st->st_size)))
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
return DATA_CHANGED;
|
|
|
|
break;
|
2007-04-13 18:24:13 +02:00
|
|
|
case S_IFDIR:
|
2008-01-15 01:03:17 +01:00
|
|
|
if (S_ISGITLINK(ce->ce_mode))
|
2008-07-29 10:13:44 +02:00
|
|
|
return ce_compare_gitlink(ce) ? DATA_CHANGED : 0;
|
consistently use "fallthrough" comments in switches
Gcc 7 adds -Wimplicit-fallthrough, which can warn when a
switch case falls through to the next case. The general idea
is that the compiler can't tell if this was intentional or
not, so you should annotate any intentional fall-throughs as
such, leaving it to complain about any unannotated ones.
There's a GNU __attribute__ which can be used for
annotation, but of course we'd have to #ifdef it away on
non-gcc compilers. Gcc will also recognize
specially-formatted comments, which matches our current
practice. Let's extend that practice to all of the
unannotated sites (which I did look over and verify that
they were behaving as intended).
Ideally in each case we'd actually give some reasons in the
comment about why we're falling through, or what we're
falling through to. And gcc does support that with
-Wimplicit-fallthrough=2, which relaxes the comment pattern
matching to anything that contains "fallthrough" (or a
variety of spelling variants). However, this isn't the
default for -Wimplicit-fallthrough, nor for -Wextra. In the
name of simplicity, it's probably better for us to support
the default level, which requires "fallthrough" to be the
only thing in the comment (modulo some window dressing like
"else" and some punctuation; see the gcc manual for the
complete set of patterns).
This patch suppresses all warnings due to
-Wimplicit-fallthrough. We might eventually want to add that
to the DEVELOPER Makefile knob, but we should probably wait
until gcc 7 is more widely adopted (since earlier versions
will complain about the unknown warning type).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-21 08:25:41 +02:00
|
|
|
/* else fallthrough */
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
default:
|
|
|
|
return TYPE_CHANGED;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-06-02 17:46:52 +02:00
|
|
|
static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
|
2005-04-09 18:48:20 +02:00
|
|
|
{
|
|
|
|
unsigned int changed = 0;
|
|
|
|
|
2008-01-15 01:03:17 +01:00
|
|
|
if (ce->ce_flags & CE_REMOVE)
|
|
|
|
return MODE_CHANGED | DATA_CHANGED | TYPE_CHANGED;
|
|
|
|
|
|
|
|
switch (ce->ce_mode & S_IFMT) {
|
2005-05-05 14:38:25 +02:00
|
|
|
case S_IFREG:
|
|
|
|
changed |= !S_ISREG(st->st_mode) ? TYPE_CHANGED : 0;
|
2005-10-12 03:45:33 +02:00
|
|
|
/* We consider only the owner x bit to be relevant for
|
|
|
|
* "mode changes"
|
|
|
|
*/
|
|
|
|
if (trust_executable_bit &&
|
2008-01-15 01:03:17 +01:00
|
|
|
(0100 & (ce->ce_mode ^ st->st_mode)))
|
2005-05-06 15:45:01 +02:00
|
|
|
changed |= MODE_CHANGED;
|
2005-05-05 14:38:25 +02:00
|
|
|
break;
|
|
|
|
case S_IFLNK:
|
2007-03-02 22:11:30 +01:00
|
|
|
if (!S_ISLNK(st->st_mode) &&
|
|
|
|
(has_symlinks || !S_ISREG(st->st_mode)))
|
|
|
|
changed |= TYPE_CHANGED;
|
2005-05-05 14:38:25 +02:00
|
|
|
break;
|
2007-05-21 22:08:28 +02:00
|
|
|
case S_IFGITLINK:
|
2008-07-29 10:13:44 +02:00
|
|
|
/* We ignore most of the st_xxx fields for gitlinks */
|
2007-04-10 06:20:29 +02:00
|
|
|
if (!S_ISDIR(st->st_mode))
|
|
|
|
changed |= TYPE_CHANGED;
|
|
|
|
else if (ce_compare_gitlink(ce))
|
|
|
|
changed |= DATA_CHANGED;
|
2007-04-13 18:24:13 +02:00
|
|
|
return changed;
|
2005-05-05 14:38:25 +02:00
|
|
|
default:
|
2018-11-10 06:16:04 +01:00
|
|
|
BUG("unsupported ce_mode: %o", ce->ce_mode);
|
2005-05-05 14:38:25 +02:00
|
|
|
}
|
2005-05-23 00:08:15 +02:00
|
|
|
|
2013-06-20 10:37:50 +02:00
|
|
|
changed |= match_stat_data(&ce->ce_stat_data, st);
|
2005-09-20 00:11:15 +02:00
|
|
|
|
2008-06-10 19:44:43 +02:00
|
|
|
/* Racily smudged entry? */
|
2013-06-20 10:37:50 +02:00
|
|
|
if (!ce->ce_stat_data.sd_size) {
|
2016-09-05 22:07:52 +02:00
|
|
|
if (!is_empty_blob_sha1(ce->oid.hash))
|
2008-06-10 19:44:43 +02:00
|
|
|
changed |= DATA_CHANGED;
|
|
|
|
}
|
|
|
|
|
2005-12-20 21:12:18 +01:00
|
|
|
return changed;
|
|
|
|
}
|
|
|
|
|
2015-03-08 11:12:36 +01:00
|
|
|
static int is_racy_stat(const struct index_state *istate,
|
|
|
|
const struct stat_data *sd)
|
2008-01-21 09:44:50 +01:00
|
|
|
{
|
2015-03-08 11:12:36 +01:00
|
|
|
return (istate->timestamp.sec &&
|
make USE_NSEC work as expected
Since the filesystem ext4 is now defined as stable in Linux v2.6.28,
and ext4 supports nanonsecond resolution timestamps natively, it is
time to make USE_NSEC work as expected.
This will make racy git situations less likely to happen. For 'git
checkout' this means it will be less likely that we have to open, read
the contents of the file into RAM, and check if file is really
modified or not. The result sould be a litle less used CPU time, less
pagefaults and a litle faster program, at least for 'git checkout'.
Since the number of possible racy git situations would increase when
disks gets faster, this patch would be more and more helpfull as times
go by. For a fast Solid State Disk, this patch should be helpfull.
Note that, when file operations starts to take less than 1 nanosecond,
one would again start to get more racy git situations.
For more info on racy git, see Documentation/technical/racy-git.txt
For more info on ext4, see http://kernelnewbies.org/Ext4
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-02-19 21:08:29 +01:00
|
|
|
#ifdef USE_NSEC
|
|
|
|
/* nanosecond timestamped files can also be racy! */
|
2015-03-08 11:12:36 +01:00
|
|
|
(istate->timestamp.sec < sd->sd_mtime.sec ||
|
|
|
|
(istate->timestamp.sec == sd->sd_mtime.sec &&
|
|
|
|
istate->timestamp.nsec <= sd->sd_mtime.nsec))
|
make USE_NSEC work as expected
Since the filesystem ext4 is now defined as stable in Linux v2.6.28,
and ext4 supports nanonsecond resolution timestamps natively, it is
time to make USE_NSEC work as expected.
This will make racy git situations less likely to happen. For 'git
checkout' this means it will be less likely that we have to open, read
the contents of the file into RAM, and check if file is really
modified or not. The result sould be a litle less used CPU time, less
pagefaults and a litle faster program, at least for 'git checkout'.
Since the number of possible racy git situations would increase when
disks gets faster, this patch would be more and more helpfull as times
go by. For a fast Solid State Disk, this patch should be helpfull.
Note that, when file operations starts to take less than 1 nanosecond,
one would again start to get more racy git situations.
For more info on racy git, see Documentation/technical/racy-git.txt
For more info on ext4, see http://kernelnewbies.org/Ext4
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-02-19 21:08:29 +01:00
|
|
|
#else
|
2015-03-08 11:12:36 +01:00
|
|
|
istate->timestamp.sec <= sd->sd_mtime.sec
|
make USE_NSEC work as expected
Since the filesystem ext4 is now defined as stable in Linux v2.6.28,
and ext4 supports nanonsecond resolution timestamps natively, it is
time to make USE_NSEC work as expected.
This will make racy git situations less likely to happen. For 'git
checkout' this means it will be less likely that we have to open, read
the contents of the file into RAM, and check if file is really
modified or not. The result sould be a litle less used CPU time, less
pagefaults and a litle faster program, at least for 'git checkout'.
Since the number of possible racy git situations would increase when
disks gets faster, this patch would be more and more helpfull as times
go by. For a fast Solid State Disk, this patch should be helpfull.
Note that, when file operations starts to take less than 1 nanosecond,
one would again start to get more racy git situations.
For more info on racy git, see Documentation/technical/racy-git.txt
For more info on ext4, see http://kernelnewbies.org/Ext4
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-02-19 21:08:29 +01:00
|
|
|
#endif
|
2015-03-08 11:12:36 +01:00
|
|
|
);
|
|
|
|
}
|
|
|
|
|
split-index: smudge and add racily clean cache entries to split index
Ever since the split index feature was introduced [1], refreshing a
split index is prone to a variant of the classic racy git problem.
Consider the following sequence of commands updating the split index
when the shared index contains a racily clean cache entry, i.e. an
entry whose cached stat data matches with the corresponding file in
the worktree and the cached mtime matches that of the index:
echo "cached content" >file
git update-index --split-index --add file
echo "dirty worktree" >file # size stays the same!
# ... wait ...
git update-index --add other-file
Normally, when a non-split index is updated, then do_write_index()
(the function responsible for writing all kinds of indexes, "regular",
split, and shared) recognizes racily clean cache entries, and writes
them with smudged stat data, i.e. with file size set to 0. When
subsequent git commands read the index, they will notice that the
smudged stat data doesn't match with the file in the worktree, and
then go on to check the file's content and notice its dirtiness.
In the above example, however, in the second 'git update-index'
prepare_to_write_split_index() decides which cache entries stored only
in the shared index should be replaced in the new split index. Alas,
this function never looks out for racily clean cache entries, and
since the file's stat data in the worktree hasn't changed since the
shared index was written, it won't be replaced in the new split index.
Consequently, do_write_index() doesn't even get this racily clean
cache entry, and can't smudge its stat data. Subsequent git commands
will then see that the index has more recent mtime than the file and
that the (not smudged) cached stat data still matches with the file in
the worktree, and, ultimately, will erroneously consider the file
clean.
Modify prepare_to_write_split_index() to recognize racily clean cache
entries, and mark them to be added to the split index. Note that
there are two places where it should check raciness: first those cache
entries that are only stored in the shared index, and then those that
have been copied by unpack_trees() from the shared index while it
constructed a new index. This way do_write_index() will get these
racily clean cache entries as well, and will then write them with
smudged stat data to the new split index.
This change makes all tests in 't1701-racy-split-index.sh' pass, so
flip the two 'test_expect_failure' tests to success. Also add the '#'
(as in nr. of trial) to those tests' description that were omitted
when the tests expected failure.
Note that after this change if the index is split when it contains a
racily clean cache entry, then a smudged cache entry will be written
both to the new shared and to the new split indexes. This doesn't
affect regular git commands: as far as they are concerned this is just
an entry in the split index replacing an outdated entry in the shared
index. It did affect a few tests in 't1700-split-index.sh', though,
because they actually check which entries are stored in the split
index; a previous patch in this series has already made the necessary
adjustments in 't1700'. And racily clean cache entries and index
splitting are rare enough to not worry about the resulting duplicated
smudged cache entries, and the additional complexity required to
prevent them is not worth it.
Several tests failed occasionally when the test suite was run with
'GIT_TEST_SPLIT_INDEX=yes'. Here are those that I managed to trace
back to this racy split index problem, starting with those failing
more frequently, with a link to a failing Travis CI build job for
each. The highlighted line [2] shows when the racy file was written,
which is not always in the failing test but in a preceeding setup
test.
t3903-stash.sh:
https://travis-ci.org/git/git/jobs/385542084#L5858
t4024-diff-optimize-common.sh:
https://travis-ci.org/git/git/jobs/386531969#L3174
t4015-diff-whitespace.sh:
https://travis-ci.org/git/git/jobs/360797600#L8215
t2200-add-update.sh:
https://travis-ci.org/git/git/jobs/382543426#L3051
t0090-cache-tree.sh:
https://travis-ci.org/git/git/jobs/416583010#L3679
There might be others, e.g. perhaps 't1000-read-tree-m-3way.sh' and
others using 'lib-read-tree-m-3way.sh', but I couldn't confirm yet.
[1] In the branch leading to the merge commit v2.1.0-rc0~45 (Merge
branch 'nd/split-index', 2014-07-16).
[2] Note that those highlighted lines are in the 'after failure' fold,
and your browser might unhelpfully fold it up before you could
take a good look.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-11 11:43:09 +02:00
|
|
|
int is_racy_timestamp(const struct index_state *istate,
|
2015-03-08 11:12:36 +01:00
|
|
|
const struct cache_entry *ce)
|
|
|
|
{
|
|
|
|
return (!S_ISGITLINK(ce->ce_mode) &&
|
|
|
|
is_racy_stat(istate, &ce->ce_stat_data));
|
2008-01-21 09:44:50 +01:00
|
|
|
}
|
|
|
|
|
2015-03-08 11:12:37 +01:00
|
|
|
int match_stat_data_racy(const struct index_state *istate,
|
|
|
|
const struct stat_data *sd, struct stat *st)
|
|
|
|
{
|
|
|
|
if (is_racy_stat(istate, sd))
|
|
|
|
return MTIME_CHANGED;
|
|
|
|
return match_stat_data(sd, st);
|
2008-01-21 09:44:50 +01:00
|
|
|
}
|
|
|
|
|
2017-09-22 18:35:40 +02:00
|
|
|
int ie_match_stat(struct index_state *istate,
|
2013-06-02 17:46:52 +02:00
|
|
|
const struct cache_entry *ce, struct stat *st,
|
2007-11-10 09:15:03 +01:00
|
|
|
unsigned int options)
|
2005-12-20 21:12:18 +01:00
|
|
|
{
|
2006-02-09 06:15:24 +01:00
|
|
|
unsigned int changed;
|
2007-11-10 09:15:03 +01:00
|
|
|
int ignore_valid = options & CE_MATCH_IGNORE_VALID;
|
2009-12-14 12:43:58 +01:00
|
|
|
int ignore_skip_worktree = options & CE_MATCH_IGNORE_SKIP_WORKTREE;
|
2007-11-10 09:15:03 +01:00
|
|
|
int assume_racy_is_modified = options & CE_MATCH_RACY_IS_DIRTY;
|
2017-09-22 18:35:40 +02:00
|
|
|
int ignore_fsmonitor = options & CE_MATCH_IGNORE_FSMONITOR;
|
2006-02-09 06:15:24 +01:00
|
|
|
|
2017-09-22 18:35:40 +02:00
|
|
|
if (!ignore_fsmonitor)
|
|
|
|
refresh_fsmonitor(istate);
|
2006-02-09 06:15:24 +01:00
|
|
|
/*
|
|
|
|
* If it's marked as always valid in the index, it's
|
|
|
|
* valid whatever the checked-out copy says.
|
2009-12-14 12:43:58 +01:00
|
|
|
*
|
|
|
|
* skip-worktree has the same effect with higher precedence
|
2006-02-09 06:15:24 +01:00
|
|
|
*/
|
2009-12-14 12:43:58 +01:00
|
|
|
if (!ignore_skip_worktree && ce_skip_worktree(ce))
|
|
|
|
return 0;
|
2008-01-15 01:03:17 +01:00
|
|
|
if (!ignore_valid && (ce->ce_flags & CE_VALID))
|
2006-02-09 06:15:24 +01:00
|
|
|
return 0;
|
2017-09-22 18:35:40 +02:00
|
|
|
if (!ignore_fsmonitor && (ce->ce_flags & CE_FSMONITOR_VALID))
|
|
|
|
return 0;
|
2006-02-09 06:15:24 +01:00
|
|
|
|
2008-11-29 04:56:34 +01:00
|
|
|
/*
|
|
|
|
* Intent-to-add entries have not been added, so the index entry
|
|
|
|
* by definition never matches what is in the work tree until it
|
|
|
|
* actually gets added.
|
|
|
|
*/
|
2015-08-22 03:08:05 +02:00
|
|
|
if (ce_intent_to_add(ce))
|
2008-11-29 04:56:34 +01:00
|
|
|
return DATA_CHANGED | TYPE_CHANGED | MODE_CHANGED;
|
|
|
|
|
2006-02-09 06:15:24 +01:00
|
|
|
changed = ce_match_stat_basic(ce, st);
|
2005-12-20 21:12:18 +01:00
|
|
|
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
/*
|
|
|
|
* Within 1 second of this sequence:
|
|
|
|
* echo xyzzy >file && git-update-index --add file
|
|
|
|
* running this command:
|
|
|
|
* echo frotz >file
|
|
|
|
* would give a falsely clean cache entry. The mtime and
|
|
|
|
* length match the cache, and other stat fields do not change.
|
|
|
|
*
|
|
|
|
* We could detect this at update-index time (the cache entry
|
|
|
|
* being registered/updated records the same time as "now")
|
|
|
|
* and delay the return from git-update-index, but that would
|
|
|
|
* effectively mean we can make at most one commit per second,
|
|
|
|
* which is not acceptable. Instead, we check cache entries
|
|
|
|
* whose mtime are the same as the index file timestamp more
|
2006-02-09 06:15:24 +01:00
|
|
|
* carefully than others.
|
Racy GIT
This fixes the longstanding "Racy GIT" problem, which was pretty
much there from the beginning of time, but was first
demonstrated by Pasky in this message on October 24, 2005:
http://marc.theaimsgroup.com/?l=git&m=113014629716878
If you run the following sequence of commands:
echo frotz >infocom
git update-index --add infocom
echo xyzzy >infocom
so that the second update to file "infocom" does not change
st_mtime, what is recorded as the stat information for the cache
entry "infocom" exactly matches what is on the filesystem
(owner, group, inum, mtime, ctime, mode, length). After this
sequence, we incorrectly think "infocom" file still has string
"frotz" in it, and get really confused. E.g. git-diff-files
would say there is no change, git-update-index --refresh would
not even look at the filesystem to correct the situation.
Some ways of working around this issue were already suggested by
Linus in the same thread on the same day, including waiting
until the next second before returning from update-index if a
cache entry written out has the current timestamp, but that
means we can make at most one commit per second, and given that
the e-mail patch workflow used by Linus needs to process at
least 5 commits per second, it is not an acceptable solution.
Linus notes that git-apply is primarily used to update the index
while processing e-mailed patches, which is true, and
git-apply's up-to-date check is fooled by the same problem but
luckily in the other direction, so it is not really a big issue,
but still it is disturbing.
The function ce_match_stat() is called to bypass the comparison
against filesystem data when the stat data recorded in the cache
entry matches what stat() returns from the filesystem. This
patch tackles the problem by changing it to actually go to the
filesystem data for cache entries that have the same mtime as
the index file itself. This works as long as the index file and
working tree files are on the filesystems that share the same
monotonic clock. Files on network mounted filesystems sometimes
get skewed timestamps compared to "date" output, but as long as
working tree files' timestamps are skewed the same way as the
index file's, this approach still works. The only problematic
files are the ones that have the same timestamp as the index
file's, because two file updates that sandwitch the index file
update must happen within the same second to trigger the
problem.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20 09:02:15 +01:00
|
|
|
*/
|
2008-01-21 09:44:50 +01:00
|
|
|
if (!changed && is_racy_timestamp(istate, ce)) {
|
2006-08-16 06:38:07 +02:00
|
|
|
if (assume_racy_is_modified)
|
|
|
|
changed |= DATA_CHANGED;
|
|
|
|
else
|
2018-09-21 17:57:31 +02:00
|
|
|
changed |= ce_modified_check_fs(istate, ce, st);
|
2006-08-16 06:38:07 +02:00
|
|
|
}
|
2005-09-20 00:11:15 +02:00
|
|
|
|
|