July 3, 2026

What I learned building a RAG pipeline in Go

aigo
RAG Pipeline cover graphic for erkshitiz.com.np

I spent a few weekends building a small retrieval-augmented generation pipeline over a pile of internal docs: runbooks, incident postmortems, old setup guides, the kind of stuff that lives in a wiki nobody opens until something is on fire. The goal was plain: type a question in normal language, get back an answer grounded in an actual document instead of a guess.

We already had keyword search over the same docs, and it was fine for exact terms. Search “connection pool exhausted” and you’d find the postmortem that used those exact words. Search “why does the API slow down under load” and you’d get nothing, because none of the postmortems phrased it that way. That gap, matching intent instead of matching words, is the entire reason to reach for embeddings instead of full text search.

The pipeline is boring on purpose

Nothing about the architecture is exotic. Documents get split into chunks, each chunk gets embedded, the embeddings go into Postgres with pgvector, and at query time the question gets embedded and compared against every chunk with cosine distance:

func embed(ctx context.Context, client *openai.Client, text string) ([]float32, error) {
	resp, err := client.CreateEmbeddings(ctx, openai.EmbeddingRequest{
		Model: openai.SmallEmbedding3,
		Input: []string{text},
	})
	if err != nil {
		return nil, fmt.Errorf("embed: %w", err)
	}
	return resp.Data[0].Embedding, nil
}
select content, doc_id, 1 - (embedding <=> $1) as similarity
from chunks
order by embedding <=> $1
limit 5;

That is the whole trick. The interesting part, the part that actually determines whether the answers are any good, is everything upstream of this query: how the chunks get made in the first place.

Chunking is where most of the real work is

My first pass split documents by a fixed character count with no overlap. It worked well enough to demo and badly enough in practice that I almost gave up on the idea. Chunks that end mid-sentence hurt less than you’d think, embeddings are fairly tolerant of that. What actually hurts is a chunk boundary that falls between a step and the caveat that makes the step safe.

One runbook had a restart procedure that read roughly like this:

Step 3: Set DRAIN_TIMEOUT=30s before restarting, otherwise in-flight
requests get dropped instead of finishing.
Step 4: Restart the service with systemctl restart api.

My fixed-size chunking split right between those two lines. A question like “how do I restart the API service” retrieved the chunk starting at “Step 4”, which is a correct-looking, confidently wrong answer, because the one line that made the restart safe was sitting in the previous chunk that never got retrieved. Nobody hit this in testing because I was testing with questions close to how the docs were written. It showed up the first time a teammate asked a question in their own words.

The fix was overlap, a couple hundred characters carried over from the end of one chunk into the start of the next, plus splitting on paragraph and list boundaries instead of raw character counts wherever the source format allowed it. Overlap alone does not make chunking correct, but it makes the failure mode softer: a caveat near a boundary now shows up in two chunks instead of exactly zero.

Similar wording is not the same as similar meaning

The second failure was subtler. A question like “how do we roll back a bad deploy” kept surfacing a chunk about rolling back a database migration instead of the actual deployment rollback runbook. Both chunks share the word “rollback” and a lot of surrounding vocabulary about services and versions, and at the embedding-similarity level that was enough to occasionally outrank the chunk that was actually relevant.

Embeddings capture a lot of semantic meaning, but for short, jargon-heavy technical text, shared vocabulary still pulls a lot of weight in the similarity score. The fix that helped most was not a fancier embedding model, it was giving each chunk more surrounding context before embedding it: prefixing every chunk with its document title and section heading, so “Deployment rollback procedure” chunks and “Database migration rollback” chunks stopped looking as similar to each other just because they both contained the word “rollback” in isolation.

Latency and cost show up in specific places

Embedding the whole corpus once at ingestion time is cheap and forgettable. What is not cheap is doing it again every time you change the chunking strategy, and I changed the chunking strategy more times than I want to admit, which meant re-embedding a few thousand chunks more than once.

Per-query latency is dominated by the generation step, not retrieval. The vector search against an indexed pgvector column comes back in single-digit milliseconds even with tens of thousands of rows. The embedding call for the incoming question adds maybe 100 to 200 milliseconds. The actual LLM completion call is the slow and expensive part, and it gets slower and more expensive in direct proportion to how much retrieved context you stuff into the prompt. Pulling the top 10 chunks “to be safe” felt harmless until the token count made every query noticeably slower and noticeably pricier than pulling the top 3 and trusting the ranking.

The surprising part

Swapping the generation model for a better one barely moved answer quality. Fixing chunk boundaries and adding title context to each chunk moved it a lot. I went in assuming the interesting engineering problem was picking the right model, and it turned out to be closer to data hygiene: how the source text gets cut up and labeled before it ever reaches a model at all.

Where this leaves me on RAG

I do not think retrieval-augmented generation is some kind of universal upgrade to how software gets built, it is a specific, useful answer to a specific problem: getting a model to answer from text you can point at instead of from whatever it memorized during training. That matters because it makes wrong answers traceable and cheap to fix. If the answer is wrong, you fix the document or the chunking, not the model. That is a much smaller feedback loop than anything involving retraining, and it is the actual reason I would reach for this pattern again, not because it is trendy but because being able to point at the source is worth the extra moving parts.