The Ingestion Bottleneck
When dealing with high-velocity data streams in AI systems, the traditional "Request-Response" model for document ingestion is a recipe for disaster. If your embedding service goes down or experiences a spike in latency, your entire API hangs.
In our recent infrastructure overhaul, we moved to a Kafka-first ingestion architecture.
The Architecture: Decoupling for Resilience
We designed a three-tier pipeline:
- The Ingestor (Producer): A lightweight Spring Boot service that validates the schema and drops the raw payload into a
raw-documentstopic. - The Processor (Consumer): A pool of workers that handle chunking and embedding generation.
- The Indexer (Consumer): A final stage that writes to both PostgreSQL and the Vector Store.
Why Kafka instead of RabbitMQ?
We chose Kafka for its Replayability. If we decide to change our chunking strategy or move to a different embedding model (e.g., switching from OpenAI to a self-hosted Llama-index), we can simply reset the consumer offsets and re-process months of data without re-triggering the source system.
Throughput Optimization: Batching and Parallelism
To achieve our 30% throughput improvement, we tuned the Kafka consumers:
- Batch Size: We increased
fetch.min.bytesto 50KB to reduce the number of I/O requests. - Parallelism: We matched the number of partitions to our consumer thread pool size to ensure no worker was idle.
@KafkaListener(topics = "document-chunks", containerFactory = "batchFactory")
public void handleBatch(List<ConsumerRecord<String, ChunkDTO>> records) {
// Parallel processing using Java Parallel Streams for CPU-bound embedding tasks
records.parallelStream().forEach(this::processChunk);
}
Handling Failures: The Dead Letter Topic (DLT) Pattern
In a distributed system, failure is a guarantee. We implemented a robust retry strategy with exponential backoff:
- Retry Topic: Transient failures (like a 503 from the embedding API) are moved to a retry topic.
- DLT: Permanent failures (invalid content, encoding errors) are moved to a DLT for manual inspection.
This ensures that one bad document doesn't stall the entire pipeline.
Lessons from Production
One mistake we made early on was not monitoring Consumer Lag closely enough. A slow embedding service caused a 2-hour backlog in minutes. We now have Prometheus/Grafana alerts that trigger auto-scaling of our consumer pods whenever the lag exceeds 5,000 records.
Summary
Designing for scale means assuming every downstream service will eventually fail. By using Kafka as our "architectural buffer," we turned a brittle ingestion system into a fault-tolerant engine capable of sustaining 100k+ events/sec.