Using background and static indexes simultaneously for large codebases

Hi all, I’m looking for some feedback on the following problem and proposed solution.

Summary:

I want to add a new mode where the background index only indexes changed files and a static index is used for unchanged files.

Problem:

For large codebases, a static index can take hours to generate. It is therefore infeasible to regenerate it every time a developer wants to pull changes into their workspace. The background indexer is also unable to help for large codebases, because it indexes all files, making it slow to find the files of interest and using significant system resources while doing so. If the background index is used with a static index, it also duplicates effort for files that have not changed since the static index was generated.

A developer can open the changed files so that clangd builds them in the in-memory FileIndex, but
this is cumbersome and does not persist between loads.

Proposed solution:

An extra ‘mode’ is added to background indexing where it only indexes files changed after a certain time. A static index is still supplied (remote, or with --index-file) to support unchanged files. The background indexing bridges the gap between the static index and the current state of the codebase, without having to open all the changed files.

Summary of modes:

–background-index=off
Equivalent to current --background-index=false

–background-index=full
Equivalent to --background-index=true

–background-index=changed
Only indexes files changed after a certain time (baseline time). The baseline time can be startup or the modification timestamp of a certain file.

In ‘changed’ mode, the background index does not observe updates to the CDB (which is what currently happens for --background-index=true). Instead, a ‘BackgroundFileWatcher’ starts a thread that traverses the file system under a specific directory (given as an argument at startup) looking for
files whose changed time is after the baseline time. It enqueues those files as they are found, using a lambda of the BackgroundIndexer’s ‘enqueue’ method (like BackgroundIdx, BackgroundFileWatcher is a field of ClangdServer).

Prototype:

I’ve created a prototype for the above solution that works, and would be interested in making a pull request if there’s interest/approval. I have found that the background index and the static index work well together as designed.

The prototype uses timestamps to find changed files, but the final implementation could support swapping this for different modules depending on how the user wants the filesystem to be monitored; e.g., a git module for git repos.

I’m looking for feedback on whether this problem is considered an actual problem, and, if so, whether the proposed solution sounds suitable.

Thanks in advance.

Hi @patrick-doolan !

Thanks a lot for bringing this up and exploring the area. This is an issue that has been bugging me as well.

So there are basically two pieces to the puzzle:

  • Figuring out which files are to be considered changed on disk vs on an external index (let it be an index file or remote index).
  • Having a way to filter files to be indexed in BackgroundIndex.

Also from your proposal it is unclear whether you are trying to solve the problem of:

  • External index might be built at a different revision than local checkout, hence we’d like to index all the files that have been modified since the external index built by everyone.
  • User might have disjoint editing sessions, in which different subsets of the files are edited. Any edits done in the previous sessions are lost and we’d like to preserve them.

The former definitely covers the latter, so I am assuming you are implying that one. But I’d like to note that the latter is substantially different if solved on its own.

Figuring out changes

Using timestamps

The solution you propose using timestamps for this change detection makes sense, but there are some shortcomings.

Before we dive into details I would like to remind that most VCS (at least git, svn and perforce) by default make use of the checkout/sync time for the files being copied into the local workspace, not their checkout times. In other words, if you do a clean checkout of LLVM today, all of the source files will be marked with now as their modification time.
We can solve this issue by integrating with VCS, as they usually provide a way to query the time a file was checked-in to the repository, but this would again be another system to design and implement.

Using clangd startup time

So one decision that needs to be made, as you mentioned, is figuring out a baseline to start indexing.
I am not sure if making use of clangd startup time makes sense in that regard:

  • Surely this will catch any changes to on disk contents while a particular clangd instance is running, but it won’t detect those changes if you modify contents on disk while there were no clangd processes.
  • Maybe I am getting this all wrong and we are actually after the changes done by the user through their editor (and then saved on disk). So that we can make sure developer’s state is preserved across different editing sessions (i.e. you edit a couple files today and close your editor, tomorrow you make more edits to different files that depend on your previous changes, without necessarily opening those files). In such a scenario we actually have didSave notifications from LSP and can persist in-memory index contents to disk for further retrieval (this gives birth to another problem about when to load a shard though).
  • Moreover, as mentioned once you do a checkout, no matter which revision of the repo you sync to, your VCS will probably mark those files as modified now so clangd will end up indexing all the files no matter what version of those files you have in the external index.
  • As an extra complication, clangd now needs a filewatcher implementation, which was something we considered in the past but was hard to get an efficient solution that’s working across multiple platforms. LSP actually provides support for watching file changes, but it is unclear which LSP clients implement that. Hence we didn’t want to put too much effort on a solution that would just work for a fraction of the user base.
  • Now we also have the problem of making a decision when loading background index shards on startup. Since we are using the startup time, all the shards are definitely stale from that perspective. So user will have a “fresh” index until they restart clangd. If we load them unconditionally, these shards will definitely go stale at some point. So we’ll probably need to load them iff they match the current contents of the file. Which is still imperfect.
  • Let’s say a header has been modified, but most of its dependents were untouched (e.g. you add a default parameter to a function signature). How do we know what other files to index? Do we just index a single TU depending on the header (as that’s the granularity we run our indexer, we don’t have compile flags otherwise) or do we index all the dependents (otherwise xrefs will be broken for that symbol). Indexing all dependents might imply big portions of the codebase if the edited header was a common one.

Summary

Considering all of these, I think we actually need some information from the underlying index about a file’s status (e.g. when it was indexed or what were the contents) and it is probably not feasible to make use of timestamps directly (i.e. without integrating with VCS) to make them work in general.
Unfortunately this solution is not that great as well. Since file contents are not linear like timestamps, so we’ll end up indexing not for only “new” files but also for “old” files (but at least we’ll provide a “fresh” index for the local workspace). Indexing “old” files might imply developers lagging long behind head might end up indexing huge portions of the codebase locally. Moreover the same problem around modified headers is also going to surface with a content based change detection approach.

Filtering files to be indexed

This is somewhat easier compared to previous problem. But I believe we should still trigger background-indexing only at file discovery (or on save events published by the editors). That way we won’t need an extra file watcher and hopefully keep background-index and clangdserver’s interactions with it simple (well at least not more complicated than today). I suppose we can discuss this bit further if we figure out what to do about the first bit.

Oh one thing I forgot to mention, clangd actually allows turning on background indexing for only certain parts of a codebase. Would that be an applicable middle-ground solution for you?

e.g. for LLVM one can turn on bg-index only for clangd subdirectory via:

If:
  PathMatch: /path/to/llvm/clang-tools-extra/clangd/.*
Index:
  Background: Build

This will turn on bg-index for all files matching that regex, no matter what the status of the external index is.

Thanks for the detailed and thoughtful feedback, Kadir. You raise important points. I’m going to
reply to each of your points to make sure I understand the issues.

The former is indeed the problem I am trying to solve.

Timestamps vs VCS

I think I communicated this poorly - for my prototype, which is used with an unusual VCS, timestamps are the best option. But for other VCS modules I was imagining clangd would harness the diff utilities of the VCS. So, for example, with git you could supply an extra argument for a commit hash which the static index was generated against. Clangd would then exec something like git diff <static index commit hash> HEAD --name-only (perhaps also a git status also to get uncommitted changes) and monitor the output for changed files. This would avoid dealing with checkout/sync times etc. Alternatively, perhaps this commit hash could be embedded as information in the index file itself, similar to what you were suggesting?

Using clangd startup time

Again, this would only be for my use case, not in an environment with a normal VCS, and I only use this as a fallback if no other time is given. In this case, as you mention, we’re not achieving the main goal of indexing the files which differ from the external static index and the local checkout.

For the baseline time I generally use a file our VCS creates in a new checkout before a user makes their changes, so in my case the files modified after that are the local changes. For non-VCS or unsupported VCS users, they would need to identify (if possible) such a file for their case. For other VCS, which I don’t anticipate will use timestamps (see above), this case should not arise.

I think is addressed above in the ‘timestamps vs VCS’ comment, but harnessing the VCS would hopefully identify which files in the checkout where different to the external index, provided some hash/identifier of the external index version was given.

For my prototype I found spaced-out polling with llvm::sys::fs::recursive_directory_iterator worked. I agree that doing this through LSP would be too great a requirement on the clients.

It’s also worth noting that the filewatcher would only be in the ‘timestamps’ module, not any VCS modules. The polling for those would be running the VCS command to find changed files.

This is a good point. For the prototype, all the changed files are enqueued for indexing on each startup, which addresses staleness for those shards. In my use case, the cache is deleted between local change sets, so I don’t worry about leftover shards from previous workspaces being erroneously loaded. However, we could enforce loading shards iff they meet the criteria for being enqueued in the first place.

This is an excellent point, and one I had not considered. As you mention, for large codebases and commonly-used headers, trying to index all usages could negate the performance benefit of this feature relative to full background indexing. From my experience with the prototype, I find this to be a small enough case that the feature is still useful without perfect xref for the symbols such as the one in this example.

Can you elaborate on what clangd does in this circumstance with regular background indexing? Does it reindex all files that include the header on any change to the header?

My prototype supports pulling in changes after clangd has started, without having opened them. My understanding is that if this happens after file discovery, these wouldn’t be detected by the existing background indexer? If not, maybe on file discovery would be okay?

Thanks, this is helpful to know. It doesn’t quite reach my use case because our changed files can be spread over the codebase.

So, from this, it seems like the key issues are:

  • Does using VCS features to detect changes sound reasonable?
  • Does the process for handling stale shards I’ve described sound reasonable?
  • Is the issue with missing xrefs because of changed header files enough to prevent this being
    upstreamed?

Again, thanks for your detailed feedback and help.

One thing to consider here is: would the VCS interaction be done by clangd directly, or mediated via LSP (with the actual VCS interaction done by the client)? The latter would have the advantage of avoiding duplicating work (since clients typically already have some sort of VCS integration, at least VSCode does), and supporting an open-ended set of VCSs (whichever ones the client has a plugin for) rather than a finite set implemented in clangd.