What AI Engineers Can Learn From OpenSearch's Move to gRPC: Performance, Scalability

Why a search engine changing its transport protocol quietly became one of the most important infrastructure decisions for RAG and vector workloads in 2025

Let's start with a number that deserves more attention than it's getting.

Uber's data engineering team migrated their Apache Spark bulk ingestion jobs from the standard REST API to OpenSearch's new gRPC Bulk endpoint. The result? A 20–35% reduction in job runtime. Not 2–3%. Not a marginal improvement you'd chalk up to noise. Twenty to thirty-five percent — on jobs that were already running in production, at scale, on tuned infrastructure.

That number tells a bigger story than most people realize. Because it isn't really about gRPC. It's about what happens when a system built for text search suddenly has to move floating-point vectors at volume — and the protocol underneath it hasn't caught up.

The Setup: Why OpenSearch Matters to AI Engineers

If you're building a RAG pipeline today, your stack probably looks something like this:

User query
   ↓
LLM (OpenAI / Bedrock / Gemini)
   ↓
Embedding model
   ↓
Vector store / Search engine  ← This is where OpenSearch lives
   ↓
Retrieved context → LLM response

OpenSearch sits at a critical junction: it's where your embeddings live, where semantic search happens, and increasingly where hybrid retrieval (combining BM25 with k-NN) runs. If the ingestion pipeline into that layer is slow or resource-hungry, everything upstream feels it.

OpenSearch has been around since 2021, when AWS forked it from Elasticsearch 7.10. For most of that time, it spoke one language to the outside world: HTTP + JSON over REST. That was fine. JSON is readable. REST is predictable. The entire ecosystem understood it.

Then AI workloads showed up.

The Real Problem Isn't HTTP

Here's where most discussions around this topic get it wrong. The common assumption is that REST is slow because of HTTP/1.1 limitations — connection overhead, head-of-line blocking, the cost of establishing a new TCP connection for every request.

But OpenSearch was already mitigating this. HTTP Keep-Alive was supported. Connection pooling existed. Engineers tuning production deployments knew how to reuse connections.

The actual bottleneck sits somewhere else entirely: JSON serialization.

Consider what a vector looks like in a bulk indexing request:

{
  "index": { "_index": "product-embeddings", "_id": "12345" }
}
{
  "product_id": "12345",
  "name": "Running shoe",
  "embedding": [0.023, -0.187, 0.412, 0.009, -0.341, ... ]
}

That embedding field? A single text-embedding-3-small vector from OpenAI has 1,536 dimensions. Each float, serialized as a decimal string in JSON, takes roughly 8–12 bytes. So one embedding field alone can easily be 15–18 KB of ASCII text in a JSON payload.

Now multiply that by a Spark job ingesting a few million product records.

Every document goes through the same cycle: serialize the Python/JVM float array to a JSON string, send it over the wire, then OpenSearch receives it, parses the JSON string back into a float array, and passes it to Lucene for indexing. Two full float-array conversions — write and read — for every single document. At millions of documents, that's not a transport problem. That's a compute problem hiding inside the transport layer.

Anyone who's worked with high-throughput data pipelines recognizes this pattern. The network isn't the constraint. The serialization tax is.

What gRPC + Protobuf Actually Changes

gRPC is Google's open-source RPC framework. It uses Protocol Buffers (Protobuf) as its wire format — a binary encoding that's typed, compact, and schema-defined.

For a repeated float field in Protobuf, the encoding is straightforward: a field tag, a length prefix, and then raw IEEE 754 bytes. No decimal conversion. No string parsing. The float goes in as bytes, travels as bytes, and comes out as bytes.

Here's what the Protobuf definition for a vector document might look like:

message VectorDocument {
  string id = 1;
  string name = 2;
  repeated float embedding = 3 [packed = true];
}

That packed = true on the float field tells Protobuf to encode all 1,536 floats as a contiguous binary block rather than individually tagged values. The result is a payload that's a fraction of the JSON equivalent.

OpenSearch's own benchmarks back this up: Protobuf delivers a 53% reduction in payload size compared to JSON for the same data. And if you stack SMILE (a binary JSON encoding from the Jackson project) on top for the document content, the total payload reduction hits 65% versus REST + JSON.

Beyond payload size, gRPC brings HTTP/2 multiplexing — multiple concurrent requests over a single connection, no head-of-line blocking, built-in flow control. For bulk ingestion pipelines hammering a cluster with parallel requests, this matters more than most benchmarks capture.

Uber's Results Tell the Full Story

Uber is one of the largest OpenSearch operators in the world. Their M3 metrics platform, their Eats recommendations, their driver matching — significant parts of their data infrastructure run through OpenSearch clusters.

According to Uber's engineering blog, when they migrated their Spark ingestion jobs to the gRPC Bulk API, the 20–35% runtime reduction wasn't the only improvement. They also measured maximum indexing delay during failovers — a metric that directly impacts customer experience. Under high write throughput, the delay dropped by 20–35% with gRPC compared to REST.

The gains were especially pronounced for vector search queries, which makes intuitive sense when you think about the serialization argument. A vector query has to serialize the query embedding, send it to OpenSearch, where it gets deserialized, then used for k-NN search. With JSON, that query vector alone is a multi-kilobyte ASCII blob. With Protobuf, it's raw bytes.

Uber's delivery shopping list recommendations — a workload involving grocery item embeddings for search on Uber Eats — saw roughly a 53% reduction in p50 search latency (from 83ms to 38ms) and around a 43% reduction in p95 latency (from 114ms to 64ms). Those aren't micro-benchmark numbers. Those are production search latencies on real user traffic.

The pattern is clear: the more vector-heavy the workload, the bigger the gRPC advantage. And as AI workloads push more embeddings through more pipelines, this advantage only compounds.

What This Means for Your RAG Stack

Here's the core thesis, broken down to a typical RAG ingestion flow:

The REST path:

Embedding model output (float32 array)
   → serialize to JSON string       # CPU cost
   → HTTP/1.1 request               # network
   → OpenSearch receives JSON       # 53% larger than needed
   → parse JSON back to float32     # CPU cost again
   → k-NN index

The gRPC path:

Embedding model output (float32 array)
   → encode to Protobuf bytes       # near-zero cost
   → HTTP/2 gRPC request            # smaller payload, multiplexed
   → OpenSearch receives Protobuf   # already binary
   → decode bytes to float32        # trivial
   → k-NN index

Two expensive JSON serialization round-trips collapse into one cheap binary encoding step. For teams ingesting embeddings at any meaningful volume — processing PDFs, chunking documents, running nightly re-embedding jobs — this translates directly to measurable time and cost savings.

There's also an agentic AI angle worth considering. As RAG systems evolve toward multi-step retrieval with agents that issue dozens of search calls per user request, latency compounds. A 4.74% improvement in p50 search latency (as measured in OpenSearch's own k-NN benchmarks) doesn't sound dramatic until you're making 20 sequential retrieval calls in an agentic pipeline. At that point, every millisecond removed from each hop adds up to a noticeably faster user experience.

Should Your Team Consider Migrating?

OpenSearch's gRPC support landed experimentally in version 3.0 (May 2025) and matured through 3.2 and 3.3, where it was highlighted as one of the key contributors to query latency improvements averaging 11x compared to OpenSearch 1.3. The 2026 roadmap explicitly targets enhanced gRPC APIs as a major architectural initiative.

That said, migration isn't a universal recommendation. It depends on what you're actually running.

Strong case for gRPC:

You're ingesting embeddings at scale (millions of documents, nightly batch jobs)
You're running vector search on a latency-sensitive, user-facing path
Your Spark or Flink jobs are a meaningful line item on the cloud bill
You're building agentic pipelines with high retrieval fan-out

Less urgent:

Small-scale RAG prototypes or internal tools
Primarily text-based, non-vector workloads
Teams still on OpenSearch 2.x (gRPC isn't available yet)

One practical consideration: the gRPC client ecosystem for OpenSearch is newer and less battle-tested than the REST clients. If your team runs a polyglot environment, verify client support for your language stack before committing to a migration timeline. OpenSearch's gRPC transport also provides an SPI for plugins like k-NN to extend, which is a good sign for long-term coverage — but it's still evolving.

What About Elasticsearch?

Elasticsearch doesn't currently offer gRPC transport support in its public API. This is becoming a meaningful divergence point between the two projects.

If you're on Elasticsearch and running high-volume vector workloads, you're working around the same JSON serialization cost — but without this particular escape valve. That doesn't make Elasticsearch the wrong choice; it has its own strengths on other dimensions. But on the specific problem of vector ingestion and query throughput, OpenSearch has moved faster, and that gap is worth tracking.

The Bigger Pattern

What makes this story worth paying attention to goes beyond one feature in one search engine. It illustrates something that happens repeatedly in infrastructure: a system designed for one workload gets adopted for a different one, and eventually the protocol has to catch up.

OpenSearch was built for log search and full-text retrieval. JSON was the right wire format for that world — human-readable, flexible, universally supported. Then vector search arrived, and suddenly the system was being asked to move gigabytes of floating-point arrays through a format optimized for structured text. The fit was never wrong, exactly. It was just inefficient in ways that only became visible at scale.

gRPC didn't fix OpenSearch. It fixed the mismatch between what OpenSearch was designed to carry and what AI infrastructure actually needs to send.

That's the insight worth carrying into your own architectural decisions. As your systems absorb more AI workloads — more embeddings, more vector queries, more high-dimensional data — the question isn't just which model or which vector store. It's whether the pipes connecting them are sized for the data you're actually moving.

Sources: Uber Engineering Blog — "Accelerating Search and Ingestion with High-Performance gRPC in OpenSearch" (April 2026). OpenSearch Blog — "Advancing OpenSearch with gRPC and Protocol Buffers" (March 2026). OpenSearch 3.0–3.3 release notes.

What AI Engineers Can Learn From OpenSearch's Move to gRPC

The Setup: Why OpenSearch Matters to AI Engineers

The Real Problem Isn't HTTP

What gRPC + Protobuf Actually Changes

Uber's Results Tell the Full Story

What This Means for Your RAG Stack

Should Your Team Consider Migrating?

What About Elasticsearch?

The Bigger Pattern

Comments (1)

More from this blog

Lambda vs. EKS for AI Orchestration at Scale

Command Palette

The Setup: Why OpenSearch Matters to AI Engineers

The Real Problem Isn't HTTP

What gRPC + Protobuf Actually Changes

Uber's Results Tell the Full Story

What This Means for Your RAG Stack

Should Your Team Consider Migrating?

What About Elasticsearch?

The Bigger Pattern

Comments (1)

More from this blog