Add GRPC endpoint with auth that can merge results from multiple zoekt nodes
Problem to solve
Currently, we combine search results from multiple Zoekt nodes using threads, which is inefficient and can lead to excessive memory usage when processing up to 5,000 results. We need a more efficient approach that allows for streaming results and early termination when limits are reached.
Proposal
Implement a gRPC endpoint in the gitlab-zoekt-webserver
that enables:
- Streaming search results from multiple nodes concurrently
- Early termination when result limits are reached
- Result ranking and deduplication across nodes
- Secure access with JWT authentication
- Client-side pagination without fetching all results
This will improve scalability, reduce memory usage, and enable GitLab to display search results as they become available.
Implementation Components
1. gRPC Service Definition
- Define protobuf service with streaming search endpoint
- Include support for pagination, filtering, and result limits
- Add metadata fields for ranking and relevance scoring
2. Node Discovery and Coordination
- Design aggregation strategy to combine results efficiently
- Use a channel-based approach to collect results from multiple nodes
- Implement early cancellation when result limit is reached
- Consider using a priority queue for merging results based on relevance
- Implement load balancing across nodes
3. Results Streaming and Ranking
- Stream results as they become available from each node
- Implement relevance ranking algorithm
- Support early termination when limits are reached
4. Authentication and Security
- Implement JWT validation for secure access
- Secure node-to-node communication
5. Error Handling and Resilience
- Gracefully handle node failures
- Support partial results when some nodes are unavailable
- Implement timeout and cancellation mechanisms
6. Client Integration (separate issue)
- Update GitLab search service to consume streaming API
- Support pagination without fetching all results
Technical Considerations
-
Backward Compatibility:
- Maintain existing JSON API for backward compatibility
- Gradually migrate clients to gRPC API
-
Security:
- Ensure JWT authentication uses industry best practices
-
Deployment:
- Update nginx config in helm chart to route gRPC endpoint correctly
Success Metrics
- Reduce memory usage by at least 50% during large searches
- Efficiently handle up to 5,000 results without performance degradation
References
- Zoekt maintainer's comment on aggregation
- Add temporary search endpoint to zoekt indexer ... (gitlab-zoekt-indexer!310 - merged)
Click to see previous issue description
Problem to solve
Currently, we have to use threads to combine results from multiple nodes. We also have gitlab-zoekt-indexer!310 (merged) to move the existing logic to the indexer, but I think our long-term solution should be an gRPC endpoint added to the webserver binary.
Proposal
If we use a GRPC endpoint, that allows us to stream results from multiple nodes at the same time and stopping it as soon as we reach 5,000 results (our limit). Also, this will open a possibility to stream results to GitLab (or other clients).
sequenceDiagram
participant Client as Client
participant Zoekt1 as Zoekt Node 1
participant Zoekt2 as Zoekt Node 2
participant Zoekt3 as Zoekt Node 3
Client->>Zoekt1: Send search request via gRPC
Zoekt1-->>Zoekt1: Process local search
Zoekt1->>Zoekt2: Send search request to Zoekt Node 2
Zoekt1->>Zoekt3: Send search request to Zoekt Node 3
Zoekt2-->>Zoekt1: Return partial search results
Zoekt3-->>Zoekt1: Return partial search results
Zoekt1-->>Client: Aggregate and return results
Context
Other important items:
- The new endpoint should be a part of
gitlab-zoekt-webserver
, not as part of the indexer since we want to split reads and writes to different processes. That was the reason behind gitlab-zoekt-indexer!218 (merged) and gitlab-org/build/CNG!2227 (merged) - We should add JWT authentication to the new GRPC endpoint
- In this comment, the Zoekt maintainer shares some ideas how to aggregate shards efficiently
Technical Challenges Discussed
-
Current JSON Endpoint Issues:
- Sometimes returns hundreds of megabytes of data
- Must receive entire payload of 5,000 results for parsing
-
Results Streaming Discussion:
- Current needs are typically first 10 pages of results
- Page size requirements differ between legacy and new UI
- Proposal to implement streaming to receive only necessary results
- Total required results can potentially reach up to 5,000 matches, especially for new UI
-
Pagination and Limits:
- Discussion around defining "necessary" results and predicting final limits
- Challenge in determining optimal limit due to varying use cases between UIs