Glad this is interesting to others too!
Can you please elaborate on what kind of index shards are being pushed to the git repo (binary/text full /partial index shards)? Would it be a repo that’s setup specifically for storing indexing data, or would you be pushing index-specific refs to the project’s own repo (the project that’s being indexed)?
The idea (and it’s not essential to the concept) is to have a dedicated repo for the index, just used to distribute the data to the serving processes. Some distribution is necessary because the indexer is an intensive batch job that we should run on one machine, and the serving is a latency-sensitive job you probably want to replicate, and certainly don’t want fighting with the indexer for resources.
Because adding deps to LLVM is hard, and networking and security and such is hard, and git is ubiquitous and well-understood and handles all these problems, it’s tempting to just shell out to it (consider what the configuration space for network shares, cloud storage, etc look like). But git may not make sense (e.g. if all index files mostly change every run).
What kind of complexities do you see when it comes to testing the RPC layer? I think that testing it locally by using in-process test RPC should be fairly reliable as it should avoid the IPC/network issues. But of course it’s necessary to have actual IPC/network integration tests as well.
In my experience using real RPCs across components in one process is fairly fine/easy (mostly you’re worrying about picking unused ports, right security settings etc). Faking out the transport is definitely possible but I’m not sure it’s actually worth it (extra code and you test less of the real code). Actual multi-process tests are definitely more work (coordination, debugging failures etc) and I’d think we could limit this to lit integration tests.
I’m looking for a solution for indexing a large codebase that has several challenges (a lot of generated source, and a lot of build flavors, and many individual translation units that are compiled multiple times with different flags and different headers, within the same project)
An index server will definitely help with generated source, since you can just generate everything before indexing (vs background-indexing which can’t really do this). Build flavors is less clear - of course you can run one index server for each and let clients choose. But supporting multiple “colors” of symbols within one server would require further design. TUs compiled multiple times… hard to say! We might need to iterate on some of these.
I’ve worked on a project that used XMLRPC in the past, with good results and encryption support. Don’t know much about other tools, but there are several that are open-source. A JSON-based RPC implementation might be a good choice, for consistency with other clang-related services
I think JSON would be preferable to XML as a format for consistency (particularly within clangd to reuse ways of marshalling data) and for simplicity.
I’m wary of falling into wiring together an RPC system ourselves out of a JSON encoder, an HTTP client etc - doing that for LSP was fairly expensive (despite only stdin/stdout) and will be a maintenance burden when used over real networks.
Another issue is that we care a lot about latency, and servicing a typical code completion request from the index requires fetching quite a lot of data. A binary format and/or compression, and an RPC system optimized for latency will likely make a measurable difference to user experience. JSON over HTTP is commonly used, and various blogs that benchmarked them claim that e.g. gRPC is 5-10x faster. Warrants further investigation.
I am just wondering - are you guys also thinking about support for global index for multiple projects? For example features like “give me all references in Google monorepo to this symbol”.
I’m not quite sure what you’re asking here - surfacing choice of what index to use to the user? Currently the index, once configured, mostly hides silently behind various features.
I think we probably want the ability to specify the index server to use on a per-codebase basis, maybe with a file similar to .clang-format or compile_commands.json.
It seems that with the kind of a setup you are aiming for it might be possible to get a lot of that done mostly by just adding some extra information to USR / SymbolId (maybe just a path?) and possibly some kind of FS overlay to translate between paths on end-user’s machine and “indexer/server” paths.
Certainly such path translation is needed. Clangd uses URIs in the index interface rather than absolute paths to support such translation. (e.g. Symbol.Definition.FileURI is “google3://relative/path.h” in our internal index). I’m not sure what we need to add to SymbolID, though - it would be useful to keep this integer-sized.
I’m thinking about a scenario where a global index could cover multiple projects but also multiple active branches all within the same index.
Yeah, modelling source control is complicated. Even in the absence of branches, developers have source code checked out at different revisions, so one global index won’t reflect what’s actually available.
We haven’t found this to be a big problem in practice, but then again most developers at Google don’t use long-lived branches and this also encourages frequent commits.
Taking the union of multiple indexes from different configurations/branches might work well. Clangd’s index infrastructure has support for overlays too (this is how we combine the static/background index with the dynamic index of opened files). That’s particularly useful when team X owns a branch that modifies only a certain subdirectory, you can build a small index for the branch and overlay it on the main one.
One point that is not clear to me is how to represent file paths for the different branches that might also not be on the client’s file system - possibly some unique URI can be built. Then navigating to those URI would need to fetch the content transparently by the LSP client
Yeah, this is an interesting question - LSP seems designed around the idea that URIs will be
file:/// for the local system, though it’s not explicit, and the use of URIs is an obvious extension point. There are other uses beyond indexes too: it’d be nice to get rid of the requirement to ship built-in-headers around (just link their content into clangd), but you still want go-to-definition to work. I think this is best thought of as a separate extension, as you say.
About access control, it might be useful to have a customizable layer for this to accommodate various possible corporate authentication mechanism.
Yes. Though the right tradeoff here might be to write a plugin and re-link clangd - dynamic plugins (processes or shared objects) may be too much complexity for the value they bring. Most options are probably amenable to this: with HTTP-based options we can let plugins mangle the headers, gRPC has a pluggable credential framework, etc.