Boolean & Beyond — Field Notes

Nº 01 · LLM Engineering · 2026

The pillar essay

We build around
the model.

Anyone can call a model. The advantage is the engineering around it: retrieval, tools, evaluation, and guardrails. We design, build, and hand over LLM applications that are accurate, private, and cheap enough to run.

AI & LLM Engineering/Bangalore · Coimbatore/built & owned by you

Contents

2 wks

to a costed plan

4–6 wks

to a live pilot

100%

yours at handover

every release

regression-tested

§01

The real work

A model is a stateless function. Everything that makes it useful sits outside it.

The model is one box in nine.

A large language model knows nothing about your business, cannot act, and cannot be trusted without verification. Production AI is the system that closes those gaps. The model is one component, and rarely the one that decides whether the project succeeds.

Teams that treat AI as “add a chatbot” ship a convincing demo and stall. Teams that treat it as systems engineering ship something that survives contact with real users. This page is how we do the second thing.

The Model

stateless · one box in nine

01Retrieval

02Tools

03Memory

04Evaluation

05Guardrails

06Permissions

07Routing

08Observability

Fig. 01 · Anatomy of a production AI system — the model (00) and its eight supporting subsystems

§02

Model selection

We are model-agnostic. The choice is a measurement, not a preference.

Two families. We choose by measurement.

Frontier, hosted

Claude · GPT

—Strongest reasoning
—Least operational burden
—Per-token cost
—Data leaves your network

Open-weight, private

Llama · Mistral · Qwen · Phi

—Runs on your hardware
—Flat, predictable cost
—Full data control
—More setup, slightly weaker

Many systems we build use both: a small private model for high-volume retrieval and classification, and a frontier model reserved for the few requests that genuinely need deep reasoning.

§03

The build decision

The most expensive mistake in AI is training when a cheaper rung would do.

Climb only as far as the problem needs.

We climb this ladder in order and stop at the first rung that meets the goal. Each rung up adds cost, latency, and maintenance. In practice, most business value is captured on the first three.

Fig. 02 · The build-decision ladder — cost rises with each rung; the shaded zone is where most work ships

01Prompt & contextlowest

02Retrieval (RAG)low

03Tools & agentsmedium

04Fine-tuningmed–high

05Preference tuninghigh

06Pretrainingnear-never

§04

The toolkit

Each links to a deeper page. This is the index above them.

Specialised practices, one standard.

Retrieval & knowledge systems

→

Chunking, embeddings, hybrid search, rerankers, grounded answers with citations.

Agents, tools & MCP

→

Models that plan and act through clean tool contracts, with hard human-in-the-loop limits.

Private & on-prem LLMs

→

Open-weight models in your own cloud or data centre. Fine-tuned with LoRA where it earns its place.

Enterprise copilots

→

Embedded assistants wired into the systems and permissions your team already uses.

MCP & tool integration

→

Reusable, typed interfaces between models and the tools they are allowed to call.

§05

Why demos lie

Before much code is written, the eval set exists.

A demo works once. Production answers the rest.

We build an evaluation set from real questions with known-correct answers, then run every change against the whole set. When a user reports a bad answer, that case joins the set permanently. A good demo is not evidence; a passing suite is.

$ eval run --suite=release --cases=412

answer correctness98%PASS

citation accuracy95%PASS

p95 latency1.2sPASS

refusal rate3%PASS

2 of 412 cases regressed → release blocked

§06

Trust is engineered

A failure in one layer is caught by the next.

Five layers of defence, not one promise.

L1Grounded generation

answers only from retrieved evidence

L2Citations required

every claim carries a source you can verify

L3Permission-aware retrieval

never reads a document the asker cannot see

L4PII detection & redaction

sensitive fields caught at indexing time

L5Refusal path

a clean “I do not know” over a confident wrong answer

§07

Built for India

DPDP Act, sector rules, residency clauses — all push the same way.

Your data never has to leave your network.

For many enterprises, where data lives is a legal requirement, not a preference. We design for that from day one: documents in your cloud or data centre, the index in your network, and the model a private open-weight model on your own GPU. For the strictest cases, fully on-premise with no outbound internet.

Your boundary

Documents

Vector index

Private model

Audit logs

— nothing crosses this line —

Case study · Nº 01 · proof, not theory

An enterprise AI agent, designed, built, and handed over.

Retrieval, tool use, and strict guardrails inside the client’s own environment, with an evaluation suite gating every release. The decision ladder applied end to end, from discovery to a system the client’s team now runs.

→ Read the case study

§09

How we work

Discovery to ownership, in stages.

Discovery

≈ 2 weeks

Real documents, users, and goals become a costed roadmap with the rung each part needs.

Pilot

4–6 weeks

One workflow, in front of real users, measured against an evaluation set from day one.

Production

3–6 months

Harden, integrate, monitor, and expand to more workflows and sources.

Handover

ongoing

Documented and trained. Your team owns and operates it. No lock-in.

§10

A good partner says no

When not to use an LLM

If a problem is deterministic and well-specified, ordinary software is cheaper, faster, and more reliable. If you cannot measure correctness, you are not ready to ship AI into anything that matters. If the data is missing or messy, fixing the data comes first.

We would rather scope a smaller AI project that works than a large one that impresses in a demo and erodes trust in production. That is the whole philosophy on one line.

§11

Asked first

Questions, answered.

01What is the difference between RAG and fine-tuning?+

RAG gives a model fresh, private knowledge at question time by searching your documents and grounding the answer in what it finds. Fine-tuning changes the model itself by training it on examples, which suits style, format, or vocabulary, not facts that keep changing. For most enterprise knowledge problems RAG is the right first tool: cheaper, updatable in minutes, easy to cite. Fine-tuning comes later, and only when a measured gap justifies it.

02How much does LLM application development cost in India?+

A focused pilot typically runs from 8 to 25 lakh rupees depending on how many systems it touches and how strict the privacy requirements are. A full production build with evaluation, monitoring, and handover usually ranges from 30 lakh to 1.5 crore rupees. Running cost depends on hosted API (variable) versus a private model on your own GPU (flatter, predictable). We share both options before you commit.

03When should we use a private LLM instead of a hosted API?+

Choose private when data residency or compliance forbids sending content to a third party, when usage is high enough that per-token cost outweighs GPU cost, or when latency and availability must be under your control. Choose a hosted frontier model when you need the strongest reasoning and your policy allows it. Many systems use both.

04How do you stop an LLM from making things up?+

Hallucination is controlled in layers, not with one trick. We force grounded generation, require a citation on every claim, validate structured outputs against a schema, and add a refusal path. Above all we measure it: a regression suite checks correctness and citation accuracy on every change before it ships.

05Do we need our own machine learning team to maintain this?+

No. We design for handover. Your existing engineers operate the system with ordinary tools: version control, CI, observability, and a prompt-and-eval workflow we document and train them on.

06Which models do you build with?+

We are model-agnostic and pick per workload: frontier hosted models such as Claude and GPT for the hardest reasoning, and open-weight models such as Llama, Mistral, Qwen, and Phi when privacy, cost, or control matter more. The choice is driven by measurement on your data, not a favourite vendor.

07Can you integrate AI into our existing product instead of building new?+

Yes, and it is often the highest-return path. We add retrieval, agents, copilots, or automation into systems you already run, behind your existing authentication and permissions, without a rebuild.

08How do you measure whether the AI is actually working?+

We build an evaluation set from real questions and correct answers before writing much code, then track correctness, citation accuracy, latency, and refusal rate on every release. When a user reports a bad answer, that case enters the set permanently.

Colophon

Bring the workflow, the data, the constraint.

In one conversation we can usually tell you which rung it needs and what a realistic scope looks like.

Start a conversation →

RAG & knowledge Agentic AI Private LLM Enterprise copilot MCP