Prompt engineering in production: what we learned building with LLMs

The most underrated part of building an LLM product is not the model, the RAG, or fine-tuning. It's the discipline of iterating prompts until they produce what you need, again and again, without surprises. I'm not going to publish ReclamaAI's specific prompts — those are the work of many iterations and are a real part of the product. But the principles can be shared, and they're the ones I wish someone had told me at the start.

Models learn by imitation, not by instruction

The difference between writing "draft this professionally, concise, in formal Spanish" and showing two or three complete examples of what you want is enormous. Instructions are ambiguous: "professional" means different things to different humans and is noise to the model. Examples are unambiguous.

If you find yourself writing three paragraphs of instructions about tone, format, or style, what you probably need instead is one or two well-chosen examples. That alone changes how much and how consistently the model produces.

Separate "who you are" from "what I do"

The system prompt defines personality — the "you": tone, audience, what to avoid, what to prioritize. The user prompt defines the concrete task — the "this": case data, user input, what to produce.

When you mix the two in one block, the model loses track of who it's being as context fills up. Keeping them separated isn't just cleaner organizationally; it's more reliable in production.

Anti-examples matter as much as examples

Showing what NOT to do is sometimes more useful than showing what to do. A well-chosen anti-example eliminates an error pattern that ten lines of instruction can't erase. It works especially well for protocol errors, tone errors, or domain-specific conventions: things where "you know it when you see it" but you struggle to articulate the rule.

Evaluation matters more than the prompt itself

The question isn't "is this prompt good?". The question is "how do I know this prompt is better than the previous one?". Without systematic evaluation, you're iterating in the dark.

You don't need fancy evaluation to start. What you do need is a set of representative cases covering your product's real variety, and a consistent review process. If every prompt change doesn't pass through that filter, you'll eventually break something in production that was invisible locally.

Treat prompts like code

Versioned in git. With history. With a change process. If "updating the prompt" in your process means "opening a file and editing it in production", you'll have uncomfortable nights.

The simple mental rule I apply: if you break a prompt, you should be able to revert in under 30 seconds. If you break a prompt and need an hour to remember what you changed, your versioning system failed.

What doesn't work (and people keep trying)

Some things I tried that don't deliver what they promise:

Asking the model to "think step by step" when it doesn't apply. Chain-of-thought helps in explicit reasoning, not in creative or writing tasks where the expected output already has its own structure.
Making ultra-long prompts "just in case". More instructions isn't more quality. Each additional instruction disperses the model's attention and increases cost. Long prompts rarely win against short prompts with good examples.
Using another LLM to verify the first one. Sounds safe, isn't. The model that got it wrong has a high chance of "validating" its own mistake. Reliable verification comes from explicit rules or humans, not from another model layer.

What's still hard

The first one is how to measure "subjective quality" reproducibly. In domains where an answer can be correct in many different ways, your eval ends up being opinion dressed in metric clothing. I've tried rubrics, Likert scales, paired comparisons, and they all share the same problem: the result depends more on the evaluator than on the model.

The second one is how to evolve a prompt system when the model changes. Every new provider version can break subtleties you invested weeks in: tones that no longer come out the same, formats the new model prefers to ignore, instructions that used to work and now don't. There's no clean fix — only the discipline of having a stable eval that catches the regression fast.

What did change how I think about the problem: I stopped looking for isolated tricks and started thinking about the whole system of how I write, evaluate, version, and revert prompts. That's the difference between a pipeline that improves month over month and one that breaks the week the provider ships a new version.

Prompt engineering in production: what we learned building with LLMs

Models learn by imitation, not by instruction

Separate "who you are" from "what I do"

Anti-examples matter as much as examples

Evaluation matters more than the prompt itself

Treat prompts like code

What doesn't work (and people keep trying)

What's still hard

The cron job that cost three days: real BullMQ + Upstash edge cases

Tutela in 4 minutes: anatomy of ReclamaAI's most-used document

Designing for non-power-user audiences in Latin America

Efrain Hernandez

Prompt engineering in production: what we learned building with LLMs

Models learn by imitation, not by instruction

Separate "who you are" from "what I do"

Anti-examples matter as much as examples

Evaluation matters more than the prompt itself

Treat prompts like code

What doesn't work (and people keep trying)

What's still hard

Keep reading

The cron job that cost three days: real BullMQ + Upstash edge cases

Tutela in 4 minutes: anatomy of ReclamaAI's most-used document

Designing for non-power-user audiences in Latin America

Efrain Hernandez