Skip to content

Add GRPC endpoint with auth that can merge results from multiple zoekt nodes

Problem to solve

Currently, we combine search results from multiple Zoekt nodes using threads, which is inefficient and can lead to excessive memory usage when processing up to 5,000 results. We need a more efficient approach that allows for streaming results and early termination when limits are reached.

Proposal

Implement a gRPC endpoint in the gitlab-zoekt-webserver that enables:

  1. Streaming search results from multiple nodes concurrently
  2. Early termination when result limits are reached
  3. Result ranking and deduplication across nodes
  4. Secure access with JWT authentication
  5. Client-side pagination without fetching all results

This will improve scalability, reduce memory usage, and enable GitLab to display search results as they become available.

Implementation Components

1. gRPC Service Definition

  • Define protobuf service with streaming search endpoint
  • Include support for pagination, filtering, and result limits
  • Add metadata fields for ranking and relevance scoring

2. Node Discovery and Coordination

  • Design aggregation strategy to combine results efficiently
    • Use a channel-based approach to collect results from multiple nodes
    • Implement early cancellation when result limit is reached
    • Consider using a priority queue for merging results based on relevance
  • Implement load balancing across nodes

3. Results Streaming and Ranking

  • Stream results as they become available from each node
  • Implement relevance ranking algorithm
  • Support early termination when limits are reached

4. Authentication and Security

  • Implement JWT validation for secure access
  • Secure node-to-node communication

5. Error Handling and Resilience

  • Gracefully handle node failures
  • Support partial results when some nodes are unavailable
  • Implement timeout and cancellation mechanisms

6. Client Integration (separate issue)

  • Update GitLab search service to consume streaming API
  • Support pagination without fetching all results

Technical Considerations

  1. Backward Compatibility:

    • Maintain existing JSON API for backward compatibility
    • Gradually migrate clients to gRPC API
  2. Security:

    • Ensure JWT authentication uses industry best practices
  3. Deployment:

    • Update nginx config in helm chart to route gRPC endpoint correctly

Success Metrics

  • Reduce memory usage by at least 50% during large searches
  • Efficiently handle up to 5,000 results without performance degradation

References

Click to see previous issue description

Problem to solve

Currently, we have to use threads to combine results from multiple nodes. We also have gitlab-zoekt-indexer!310 (merged) to move the existing logic to the indexer, but I think our long-term solution should be an gRPC endpoint added to the webserver binary.

Proposal

If we use a GRPC endpoint, that allows us to stream results from multiple nodes at the same time and stopping it as soon as we reach 5,000 results (our limit). Also, this will open a possibility to stream results to GitLab (or other clients).

sequenceDiagram
    participant Client as Client
    participant Zoekt1 as Zoekt Node 1
    participant Zoekt2 as Zoekt Node 2
    participant Zoekt3 as Zoekt Node 3

    Client->>Zoekt1: Send search request via gRPC
    Zoekt1-->>Zoekt1: Process local search
    Zoekt1->>Zoekt2: Send search request to Zoekt Node 2
    Zoekt1->>Zoekt3: Send search request to Zoekt Node 3
    Zoekt2-->>Zoekt1: Return partial search results
    Zoekt3-->>Zoekt1: Return partial search results
    Zoekt1-->>Client: Aggregate and return results

Context

Other important items:

Technical Challenges Discussed

  1. Current JSON Endpoint Issues:
    • Sometimes returns hundreds of megabytes of data
    • Must receive entire payload of 5,000 results for parsing
  2. Results Streaming Discussion:
    • Current needs are typically first 10 pages of results
    • Page size requirements differ between legacy and new UI
    • Proposal to implement streaming to receive only necessary results
    • Total required results can potentially reach up to 5,000 matches, especially for new UI
  3. Pagination and Limits:
    • Discussion around defining "necessary" results and predicting final limits
    • Challenge in determining optimal limit due to varying use cases between UIs
Edited by Dmitry Gruzd
OSZAR »