Tagged · Engineering

Field notes,
Engineering.

134 articles in this tag — part of the Jaypore Labs journal.

01
Engineering
Determinism harnesses for non-deterministic systems
Apr 30, 20262 min read
02
Engineering
Multi-agent orchestration: from kitchen brigade to opera
Apr 30, 20263 min read
03
Engineering
Retry strategies that don't compound errors
Apr 30, 20263 min read
04
Engineering
Your first MCP server (Node)
Apr 29, 20268 min read
05
Engineering
MCP error handling: tell the model what went wrong
Apr 29, 20262 min read
06
Engineering
What makes an eval good
Apr 29, 20267 min read
07
Engineering
MCP for CI/CD: build-system tools as agent inputs
Apr 28, 20262 min read
08
Engineering
Trend evals vs. threshold evals
Apr 28, 20262 min read
09
Engineering
Fall-back chains: cheap → expensive → human
Apr 27, 20263 min read
10
Engineering
Integration tests for AI features: contract or behavioural?
Apr 27, 20263 min read
11
Engineering
CI strategy: smoke vs. full suite for LLM apps
Apr 24, 20262 min read
12
Engineering
Self-consistency: when N=3 beats a smarter prompt
Apr 24, 20263 min read
13
Engineering
Cost guardrails: stop runaway agents before billing does
Apr 23, 20266 min read
14
Engineering
End-to-end tests for AI workflows: scope and survival
Apr 23, 20262 min read
15
Engineering
MCP for actioning tools (PR creator, ticket closer)
Apr 23, 20262 min read
16
Engineering
MCP and the Claude Code workflow specifically
Apr 22, 20262 min read
17
Engineering
Pairwise judges: A/B agreement at scale
Apr 22, 20262 min read
18
Engineering
Pinning model versions through provider migrations
Apr 22, 20262 min read
19
Engineering
Drift catchers: detecting style shifts
Apr 21, 20262 min read
20
Engineering
Eval CI: the pass/fail gate that's actually useful
Apr 21, 20262 min read
21
Engineering
Prompt invariance: prompts that survive paraphrase
Apr 21, 20263 min read
22
Engineering
Tool failure modes: timeouts, retries, idempotency
Apr 21, 20264 min read
23
Engineering
Context engineering: what to load, what to defer
Apr 20, 20264 min read
24
Engineering
Output validation: pydantic, zod, and friends in production
Apr 20, 20262 min read
25
Engineering
Versioning model + prompt as a unit
Apr 20, 20263 min read
26
Engineering
Building agents that explain themselves
Apr 16, 20263 min read
27
Engineering
Constrained decoding: the underrated lever
Apr 16, 20263 min read
28
Engineering
Safety guardrails: refusal patterns that don't make agents useless
Apr 16, 20263 min read
29
Engineering
Confidence calibration: when 'I don't know' is the answer
Apr 15, 20263 min read
30
Engineering
Counter-example mining
Apr 15, 20263 min read
31
Engineering
The post-launch test plan: what runs forever
Apr 15, 20263 min read
32
Engineering
Retiring an agent
Apr 14, 20263 min read
33
Engineering
Long-horizon tasks: keeping an agent on rails for hours
Apr 13, 20264 min read
34
Engineering
MCP authorization: per-user permissions
Apr 13, 20262 min read
35
Engineering
MCP composition: when one server should call another
Apr 13, 20262 min read
36
Engineering
MCP server versioning: shipping breaking changes safely
Apr 13, 20262 min read
37
Engineering
MCP transport: stdio vs. HTTP vs. SSE
Apr 13, 20262 min read
38
Engineering
Deploying agents in CI: scoped, audited, repeatable
Apr 10, 20267 min read
39
Engineering
Caching deterministic prefixes
Apr 10, 20263 min read
40
Engineering
Eval result storage and versioning
Apr 10, 20262 min read
41
Engineering
Tests for retrieval pipelines
Apr 10, 20262 min read
42
Engineering
Beyond MCP: tool-use specs in major models
Apr 9, 20262 min read
43
Engineering
Cost tests: catching the prompt that doubled spend
Apr 9, 20262 min read
44
Engineering
The judge pattern for confidence
Apr 9, 20263 min read
45
Engineering
MCP in 10 minutes
Apr 9, 20266 min read
46
Engineering
Versioning agent behaviour: prompts as source code
Apr 8, 20263 min read
47
Engineering
UX tests for AI-generated content
Apr 8, 20262 min read
48
Engineering
Agent observability: traces that tell you what happened
Apr 7, 20266 min read
49
Engineering
Eval anti-patterns: when evals make products worse
Apr 7, 20263 min read
50
Engineering
Browsing agents: scraping vs. structured tools
Apr 6, 20263 min read
51
Engineering
Eval-driven prompt iteration
Apr 6, 20262 min read
52
Engineering
Tool-use evals: right tool, right order
Apr 6, 20262 min read
53
Engineering
Voice-first agents: the latency budget you live within
Apr 6, 20263 min read
54
Engineering
Agent memory: what to write down, what to forget
Apr 3, 20263 min read
55
Engineering
Hallucination checks: cite-or-it-didn't-happen
Apr 3, 20263 min read
56
Engineering
MCP server observability
Apr 3, 20262 min read
57
Engineering
Prompt evolution: how agents get worse without you noticing
Apr 3, 20263 min read
58
Engineering
Red-teaming your own prompt
Apr 3, 20263 min read
59
Engineering
Tests for tool-using agents: trace assertions
Apr 2, 20263 min read
60
Engineering
MCP authentication: tokens, scopes, OAuth
Apr 1, 20262 min read
61
Engineering
MCP server rate limits: the polite-rejection pattern
Apr 1, 20262 min read
62
Engineering
Property-based testing for LLM features
Apr 1, 20262 min read
63
Engineering
Building your first eval set from scratch
Mar 31, 20268 min read
64
Engineering
Evals for agents: trajectory + outcome
Mar 31, 20267 min read
65
Engineering
MCP and secrets management
Mar 31, 20262 min read
66
Engineering
MCP server hosting: local, sidecar, remote
Mar 31, 20262 min read
67
Engineering
MCP tool naming: making tools discoverable
Mar 31, 20262 min read
68
Engineering
LLM-as-judge: when to trust it, when not
Mar 30, 20267 min read
69
Engineering
MCP for data tools (Postgres, BigQuery, S3)
Mar 30, 20262 min read
70
Engineering
Structured output: JSON mode, schemas, why one beats the other
Mar 30, 20267 min read
71
Engineering
Idempotency keys for LLM calls
Mar 27, 20263 min read
72
Engineering
Why we need MCP at all
Mar 27, 20262 min read
73
Engineering
Human eval workflows: instructions that don't vary
Mar 26, 20262 min read
74
Engineering
Judging open-ended output without a rubric
Mar 26, 20262 min read
75
AI
MCP servers are USB-C for AI
Mar 26, 20265 min read
76
Engineering
MCP tool schemas: arg shapes that help
Mar 26, 20262 min read
77
Engineering
Regression cohorts: catching what evals miss
Mar 26, 20263 min read
78
Engineering
Code-writing agents: the test-first discipline
Mar 25, 20263 min read
79
Engineering
Drift tests vs. functional tests: separate lanes
Mar 25, 20263 min read
80
Engineering
Plan vs. act: the agent loop everyone gets wrong
Mar 25, 20266 min read
81
Engineering
Privacy tests: PII redaction assertions
Mar 24, 20262 min read
82
Engineering
Sub-agents: when 1+1 actually equals 2
Mar 24, 20264 min read
83
Engineering
Calibrating your judge: meta-evals
Mar 23, 20262 min read
84
Engineering
Tool design: write tools the way you write APIs
Mar 23, 20268 min read
85
Engineering
Golden-set discipline
Mar 20, 20263 min read
86
Engineering
Why probabilistic systems still need deterministic contracts
Mar 20, 20267 min read
87
Engineering
Refusal grammars: predictable, not surprising
Mar 20, 20263 min read
88
Engineering
MCP for internal tools (Linear, Notion, Slack analogues)
Mar 19, 20262 min read
89
Engineering
Multimodal agents: when adding vision actually helps
Mar 19, 20264 min read
90
Engineering
Test-data management for AI: synthetic vs. real
Mar 19, 20262 min read
91
Engineering
Behavioural assertions: testing 'should-ness'
Mar 18, 20262 min read
92
Engineering
Eval taxonomy: golden, behavioural, drift, safety
Mar 18, 20263 min read
93
Engineering
Evals for retrieval: separating retrieval from synthesis
Mar 18, 20262 min read
94
Engineering
Your first MCP server (Python)
Mar 18, 20262 min read
95
Engineering
Agent A/B tests: comparing without confusing your users
Mar 17, 20263 min read
96
Engineering
The deterministic-envelope pattern
Mar 17, 20263 min read
97
Engineering
MCP and prompt injection: ambient instructions
Mar 17, 20262 min read
98
Engineering
Few-shot drift: why golden examples poison new versions
Mar 16, 20263 min read
99
Engineering
The judge pattern: agents that grade other agents
Mar 16, 20264 min read
100
Engineering
PII in test fixtures: the boring legal slope
Mar 16, 20263 min read
101
Engineering
Skills files: recipes the model can call
Mar 13, 20264 min read
102
Engineering
Evals that survive a model bump
Mar 12, 20263 min read
103
Engineering
Managed agents: when to reach for them
Mar 12, 20264 min read
104
Engineering
Mock LLMs in tests: when to fake, when to call
Mar 12, 20263 min read
105
Engineering
The red set: adversarial cases you're allowed to fail
Mar 12, 20262 min read
106
Engineering
The new test pyramid for AI products
Mar 11, 20267 min read
107
Engineering
Per-feature evals vs. per-model evals
Mar 11, 20262 min read
108
Engineering
Sampling production traffic for eval
Mar 11, 20262 min read
109
Engineering
Security tests: prompt-injection regression suite
Mar 10, 20262 min read
110
Engineering
Temperature, top-p, and the production tradeoff
Mar 10, 20263 min read
111
Engineering
The future of MCP
Mar 6, 20262 min read
112
Engineering
MCP testing: harnesses, fixtures, regressions
Mar 6, 20262 min read
113
Engineering
Output post-processors that don't hide the truth
Mar 6, 20263 min read
114
Engineering
Authoring eval cases
Mar 5, 20262 min read
115
Engineering
Snapshot tests: where they help, where they trap
Mar 5, 20262 min read
116
Engineering
Tests for streaming responses
Mar 5, 20262 min read
117
Engineering
Agent rollback: kill switches on day one
Mar 4, 20263 min read
118
Engineering
Determinism for tool calls: keys, ordering, side-effects
Mar 4, 20262 min read
119
Engineering
Output diffing in CI
Mar 4, 20263 min read
120
Engineering
Reading an eval dashboard
Mar 4, 20262 min read
121
Engineering
Accessibility tests for AI surfaces
Mar 3, 20262 min read
122
Engineering
Eval-driven development
Mar 3, 20263 min read
123
Engineering
Eval ownership in an org: PM, eng, or QA?
Mar 3, 20262 min read
124
Engineering
Performance tests: token budgets and latency SLAs
Mar 3, 20262 min read
125
AI
The agent maturity curve
Mar 2, 20269 min read
126
Engineering
Auto-generated eval cases from production logs
Mar 2, 20262 min read
127
Engineering
Eval cost management
Mar 2, 20262 min read
128
Engineering
AI-native debugging: the rubber duck got smarter
Feb 26, 20264 min read
129
Engineering
Semantic caching: why your top 1% of queries cost 60% of your bill
Feb 17, 20264 min read
130
Engineering
AI canary deployments: 1% traffic, 100% paranoia
Feb 2, 20264 min read
131
Engineering
AI incident response: the postmortem template you'll wish you had
Jan 15, 20264 min read
132
Engineering
HIPAA and AI: the BAA is the first conversation
Dec 26, 20254 min read
133
AI
AI and the symphony conductor: orchestration is older than software
Dec 19, 20254 min read
134
AI
AI and air traffic control: a 70-year-old playbook for safe autonomy
Dec 18, 20254 min read

← Back to all posts

Field notes,Engineering.

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors

Your first MCP server (Node)

MCP error handling: tell the model what went wrong

What makes an eval good

MCP for CI/CD: build-system tools as agent inputs

Trend evals vs. threshold evals

Fall-back chains: cheap → expensive → human

Integration tests for AI features: contract or behavioural?

CI strategy: smoke vs. full suite for LLM apps

Self-consistency: when N=3 beats a smarter prompt

Cost guardrails: stop runaway agents before billing does

End-to-end tests for AI workflows: scope and survival

MCP for actioning tools (PR creator, ticket closer)

MCP and the Claude Code workflow specifically

Pairwise judges: A/B agreement at scale

Pinning model versions through provider migrations

Drift catchers: detecting style shifts

Eval CI: the pass/fail gate that's actually useful

Prompt invariance: prompts that survive paraphrase

Tool failure modes: timeouts, retries, idempotency

Context engineering: what to load, what to defer

Output validation: pydantic, zod, and friends in production

Versioning model + prompt as a unit

Building agents that explain themselves

Constrained decoding: the underrated lever

Safety guardrails: refusal patterns that don't make agents useless

Confidence calibration: when 'I don't know' is the answer

Counter-example mining

The post-launch test plan: what runs forever

Retiring an agent

Long-horizon tasks: keeping an agent on rails for hours

MCP authorization: per-user permissions

MCP composition: when one server should call another

MCP server versioning: shipping breaking changes safely

MCP transport: stdio vs. HTTP vs. SSE

Deploying agents in CI: scoped, audited, repeatable

Caching deterministic prefixes

Eval result storage and versioning

Tests for retrieval pipelines

Beyond MCP: tool-use specs in major models

Cost tests: catching the prompt that doubled spend

The judge pattern for confidence

MCP in 10 minutes

Versioning agent behaviour: prompts as source code

UX tests for AI-generated content

Agent observability: traces that tell you what happened

Eval anti-patterns: when evals make products worse

Browsing agents: scraping vs. structured tools

Eval-driven prompt iteration

Tool-use evals: right tool, right order

Voice-first agents: the latency budget you live within

Agent memory: what to write down, what to forget

Hallucination checks: cite-or-it-didn't-happen

MCP server observability

Prompt evolution: how agents get worse without you noticing

Red-teaming your own prompt

Tests for tool-using agents: trace assertions

MCP authentication: tokens, scopes, OAuth

MCP server rate limits: the polite-rejection pattern

Property-based testing for LLM features

Building your first eval set from scratch

Evals for agents: trajectory + outcome

MCP and secrets management

MCP server hosting: local, sidecar, remote

MCP tool naming: making tools discoverable

LLM-as-judge: when to trust it, when not

MCP for data tools (Postgres, BigQuery, S3)

Structured output: JSON mode, schemas, why one beats the other

Idempotency keys for LLM calls

Why we need MCP at all

Human eval workflows: instructions that don't vary

Judging open-ended output without a rubric

MCP servers are USB-C for AI

MCP tool schemas: arg shapes that help

Regression cohorts: catching what evals miss

Code-writing agents: the test-first discipline

Drift tests vs. functional tests: separate lanes

Field notes,
Engineering.