Curated reading list

References

A curated reading list of articles, papers, and essays about context engineering, tools, memory, retrieval, and reliable AI agent workflows.

This page collects resources I find valuable in one way or another, whether because they are especially useful, well-written, thought-provoking, or worth revisiting over time. The emphasis is on material that helps when building systems that need to plan, use tools, retrieve information, and stay reliable over long runs.

Curated, tutorial ordered links

Anthropic

#agent_workflows
#context_engineering
#reliability
#tools

Building Effective AI Agents

Best starting point on the page for understanding agentic systems. It separates workflows from agents, argues for simple composable patterns, and explains why tools, retrieval, and memory should be added only when the task truly needs them.

#agent_workflows
#context_engineering
#memory
#reliability
#retrieval

Effective context engineering for AI agents

The clearest explanation here of context as a finite budget that has to be curated every turn. Strong on context rot, high-signal context selection, and the shift from prompt-writing to managing an evolving context state.

#agent_workflows
#evaluation
#reliability
#testing

Demystifying evals for AI agents

A strong guide to moving from ad hoc checks to a real eval program. It covers tasks, trials, graders, and the tradeoffs between deterministic, model-based, and human evaluation for multi-turn agents.

#agent_workflows
#context_engineering
#reliability
#tools

Effective harnesses for long-running agents

Useful for designing long-running harnesses that keep plans, context, and execution state aligned instead of letting the session drift as work stretches across many steps.

#agent_workflows
#context_engineering
#reliability
#tools

Harness design for long-running application development

A practical companion piece on trimming harness bulk while still preserving the context, checkpoints, and controls that keep long-running development runs reliable.

#agent_workflows
#context_engineering
#multi-agent
#reliability

How we built our multi-agent research system

Best advanced multi-agent case study on the page: split broad research across specialist agents, preserve plan checkpoints, and keep citations attached as the work is merged back together.

#context_engineering
#memory
#retrieval

Introducing Contextual Retrieval

Concrete retrieval technique for attaching enough surrounding meaning to a chunk before indexing it, which improves recall later without turning retrieval into raw keyword matching.

#agent_workflows
#reliability
#safety

Our framework for developing safe and trustworthy agents

Useful for the non-happy-path side of autonomy: oversight, privacy, security, and deployment controls for agents that retrieve information or trigger actions.

#agent_workflows
#documentation
#evaluation
#tools

Writing effective tools for agents - with agents

One of the highest-ROI reads once you move past the intro material. It treats tool design as an interface problem: prototype the tool, improve names and outputs, then evaluate and iterate until the agent uses it well.

#agent_workflows
#documentation
#testing
#tools

Best Practices for Claude Code

Operational guidance for running Claude Code in structured loops, with strong emphasis on context boundaries, iterative verification, and keeping planning separate from implementation work.

#agent_workflows
#context_engineering
#reliability
#teams
#testing

How Anthropic teams use Claude Code

A practical look at human review loops where the agent drives specs, tests, and edits, which makes it useful for designing team-level operating habits around coding agents.

arXiv

#agent_workflows
#context_engineering
#memory
#reliability
#testing
#tools

Building Effective AI Coding Agents for the Terminal

A dense terminal-agent paper on scaffolding, compaction, memory, verification, and the other mechanics needed for long-running command-line work.

#agent_workflows
#architecture
#modularity

Theory of Code Space: Do Code Agents Understand Software Architecture?

Useful for understanding whether an agent can build and maintain a real map of architecture, not just make local edits.

ArchUnit

#architecture
#modularity
#testing
#tools

User Guide

Shows how to turn architecture rules into executable tests so boundaries stay enforced as code changes.

Augment Code

#agent_workflows
#context_engineering
#documentation

What spec-driven development gets wrong

Useful counterpoint on static specs, stale context, and why generated implementation plans still need feedback from the actual codebase.

Bazel

#reliability
#testing

Hermeticity

A strong explanation of why isolated, reproducible builds make automated changes easier to trust and debug.

Birgitta Böckeler

#agent_workflows
#architecture
#reliability
#testing

Harness engineering for coding agent users

Strong operational framing for coding agents: the harness is not just plumbing but the system that provides steering loops, maintainability checks, architecture fitness checks, and behavior feedback.

#agent_workflows
#context_engineering
#documentation
#tools

Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl

A careful comparison of modern spec-driven development tools and the tradeoff between generating implementation guidance and keeping intent understandable.

#agent_workflows
#evaluation
#reliability
#testing

How far can we push AI autonomy in code generation?

A useful closing corrective. It documents how autonomous coding loops overreach, rationalize failing tests, and invent extra behavior, which makes it a good final chapter on where supervision still matters.

Carlos E. Jimenez et al.

#agent_workflows
#evaluation

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The standard reality check for code agents: real GitHub issues, real repositories, and hard multi-file fixes.

Chris Richardson

#architecture
#modularity
#teams

Service per team

Good reminder that service boundaries work best when they line up with ownership, which also keeps agent tasks narrower.

Cognition

#agent_workflows
#multi-agent
#reliability

Don't Build Multi-Agents

Clear argument for sharing full traces and carrying decisions forward, instead of fanning work out before the problem really needs it.

#agent_workflows
#reliability
#teams

How Cognition Uses Devin to Build Devin

Interesting because it shows an agent being used inside the product loop, exposing where automation helps and where human steering still matters.

Craig Larman

#architecture
#modularity
#tools

Protected Variation: The Importance of Being Closed

Classic guidance for putting stable interfaces around change points, which is the same move you need for tools, prompts, and APIs.

Cucumber

#documentation
#testing

BDD

Useful for turning examples into shared, testable expectations that humans and automation can agree on.

#documentation
#testing

Gherkin Reference

Worth keeping nearby when you want scenarios to stay precise enough to drive tests, documentation, or agent checks.

D. L. Parnas

#architecture
#modularity

On the Criteria To Be Used in Decomposing Systems into Modules

Still the clearest foundation for the whole page: modularize around design decisions that are likely to change, so changes stay local and both humans and agents can work without carrying the entire system in their heads.

G. Kiczales et al.

#architecture
#modularity

Aspect-Oriented Programming

A good historical reference for cross-cutting concerns and the tradeoff between local clarity and shared behavior.

GitHub Blog

#agent_workflows
#memory
#reliability
#retrieval

Building an agentic memory system for GitHub Copilot

A concrete memory chapter for coding agents: store repository-scoped facts with citations, verify them just in time before reuse, and let memories self-heal instead of turning stale notes into permanent steering errors.

#agent_workflows
#context_engineering
#reliability
#testing
#tools

How to build reliable AI workflows with agentic primitives and context engineering

Strong practical guide to splitting planning, implementation, and testing into separate sessions, then loading only the context and tools each phase actually needs.

#agent_workflows
#context_engineering
#documentation
#testing
#tools

Spec-driven development with AI: Get started with a new open source toolkit

A four-phase loop (Specify, Plan, Tasks, Implement) that turns human intent into durable artifacts before the agent starts making code changes.

GitHub Spec Kit

#agent_workflows
#context_engineering
#documentation
#tools

Quick Start Guide

Concrete walkthrough of the Spec Kit loop, useful when turning high-level intent into specs, plans, and task artifacts an agent can execute.

#agent_workflows
#context_engineering
#documentation
#tools

Specification-Driven Development (SDD)

Primary methodology document for Spec Kit, including the principles and artifacts behind its specification-driven workflow.

Google Research

#agent_workflows
#reliability
#testing

AI in software engineering at Google: Progress and the path ahead

Useful for seeing where AI already helps at scale and where the next gains are likely to come from, especially testing, understanding, and maintenance.

Google SRE

#reliability

Service Level Objectives

Core reading for turning vague reliability goals into measurements that an automated system can actually optimize against.

Google Testing Blog

#reliability
#testing

Just Say No to More End-to-End Tests

A durable argument for pushing feedback down the stack so failures are faster, cheaper, and easier to localize.

H. Gall

#architecture
#modularity

Detection of Logical Coupling Based on Product Release History

Useful for showing that hidden dependencies can be inferred from release behavior, not just source structure.

Herbert Graca

#architecture
#modularity

Packaging & namespacing

A helpful way to think about packages and namespaces as real architecture boundaries instead of just file organization.

Import Linter

#architecture
#modularity
#testing
#tools

Layers

A practical example of enforcing dependency direction in Python, which keeps generated or agent-edited code from breaking architecture.

J. Becker

#agent_workflows
#evaluation

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Good empirical guardrail for productivity claims, especially when you want to know what AI actually changes in real developer work.

Jimmy Bogard

#agent_workflows
#architecture
#modularity

Vertical Slice Architecture

Useful for organizing code around use cases so changes stay localized and agents can work on one slice without touching a whole layer stack.

John Ousterhout

#architecture
#modularity

A Philosophy of Software Design

A strong second chapter after Parnas because it translates modularity into day-to-day judgment: keep modules deep, reduce interface surface area, and treat complexity as the primary thing to manage.

Katie Hempenius

#performance
#reliability

Performance Budgets 101

A concrete way to keep an agent honest about cost, because performance limits need explicit budgets instead of vague aspirations.

Kent C. Dodds

#architecture
#modularity

AHA Programming

A short case for waiting on abstraction until the variation actually shows up, which helps avoid over-generalized code from both humans and agents.

LangChain Blog

#agent_workflows
#context_engineering
#memory
#retrieval

Context Engineering

A useful taxonomy for writing, selecting, compressing, and isolating context as separate problems instead of one vague prompt challenge.

#agent_workflows
#context_engineering

The rise of "context engineering"

A concise explanation of why context quality, structure, and format matter more than clever wording once systems become dynamic.

M. Cataldo

#architecture
#reliability
#teams

Software Dependencies, Work Dependencies, and Their Impact on Failures

Useful for understanding how coordination and communication links can become failure modes in large delivery systems.

Martin Fowler

#architecture
#modularity

Bounded Context

Strong way to keep domain boundaries clear so an agent or teammate does not have to solve the whole system at once.

#architecture
#migration
#modularity

Branch By Abstraction

A practical migration pattern when you need to replace behavior gradually without freezing the rest of the system.

#architecture
#teams

Conway's Law

A reminder that team structure leaks into architecture, which matters when agent workflows mirror org boundaries.

#modularity
#tools

Flag Argument

Useful warning about APIs that hide multiple behaviors behind one parameter and become hard for agents to use correctly.

#agent_workflows
#architecture
#reliability
#teams

Humans and Agents in Software Engineering Loops

Best read here for the distinction between the why loop and the how loop, and for why the surrounding harness deserves as much attention as the model itself.

#architecture
#documentation
#modularity

Is Design Dead?

Durable framing for keeping design continuous and evolutionary instead of treating upfront plans as either complete or worthless.

#architecture
#reliability

Is High Quality Software Worth the Cost?

A clear argument that internal quality pays for itself through cheaper, faster change, which matters when agents repeatedly modify the same codebase.

#architecture
#modularity
#teams

Linking Modular Architecture to Development Teams

Shows that modularity only pays off when the team structure and developer experience support the boundaries.

#architecture
#migration

Patterns of Legacy Displacement

Very practical for replacing old systems in stages instead of turning modernization into a risky big-bang rewrite.

#reliability
#testing

Test Pyramid

Still one of the cleanest heuristics for placing verification where it is cheapest and most informative.

#architecture
#reliability

Technical Debt Quadrant

Useful vocabulary for distinguishing deliberate tradeoffs from careless debt, which keeps agent-produced shortcuts visible instead of accidental.

#performance

Yet Another Optimization Article

A good antidote to speculative tuning, especially when an agent or engineer starts optimizing before the bottleneck is real.

Martin Fowler and Birgitta Böckeler

#agent_workflows
#context_engineering
#multi-agent
#reliability
#tools

Context Engineering for Coding Agents

A grounded overview of current context features in coding tools, and how to think about prompts, rules, skills, and subagents as one context system instead of disconnected knobs.

Martin Fowler and James Lewis

#architecture
#modularity
#teams

Microservices

The classic case for splitting services along independently deployable boundaries, which is still the right default when agent work needs a smaller surface area.

Microsoft Learn

#performance
#reliability
#testing

Use Test Impact Analysis

Useful for running only the tests likely to be affected by a change, which is exactly the sort of bounded feedback loop agents need.

N. Ajienka

#architecture
#evaluation
#modularity

Managing Hidden Dependencies in OO Software: A Study Based on Open Source Projects

Useful evidence that hidden dependencies are not theoretical; they show up in real codebases and matter for maintenance.

NASA Software Engineering Handbook

#documentation
#reliability
#testing

SWE-055 - Requirements Validation

Authoritative requirements-validation guidance for checking that documented intent is correct, complete, consistent, and testable before implementation relies on it.

Neal Ford

#architecture
#reliability
#testing

Building Evolutionary Architectures

Good framework for using fitness functions and other checks to let an architecture evolve without losing control.

NIST

#agent_workflows
#evaluation
#reliability
#safety

Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile

Authoritative risk-management profile for generative AI systems, useful for grounding agent workflows in documented governance, evaluation, and human-oversight concerns.

OpenAI

#agent_workflows
#multi-agent
#reliability
#safety
#tools

A practical guide to building agents

The best OpenAI companion to Anthropic's overview. It connects model choice, instructions, tools, guardrails, and orchestration patterns, while arguing that a capable single agent should be exhausted before splitting into multiple specialists.

#agent_workflows
#documentation
#reliability
#testing

Harness engineering: leveraging Codex in an agent-first world

A concrete account of shaping repo structure, docs, and verification so the harness gives agents room to work without making the system or decision trail opaque.

OpenAI Developers

#agent_workflows
#teams
#testing

Building an AI-Native Engineering Team

Useful organizational guidance for separating planning, implementation, and testing work so agents fit into the team instead of becoming a side experiment.

#agent_workflows
#context_engineering
#reliability
#testing

Run long-horizon tasks with Codex

A useful long-run case study on keeping a single session productive for hours through checkpoints, validation, and status artifacts that preserve context as the task evolves.

#agent_workflows
#documentation
#reliability
#testing
#tools

Build Code Review with the Codex SDK

Shows how to structure Codex for automated review with JSON output, deterministic review structure, and inline PR-comment integration.

#agent_workflows
#reliability
#safety
#tools

Guardrails and human review

A compact operational guide to approval surfaces. Useful for deciding which checks should be automatic, where runs should pause for review, and how to separate validation from human authorization.

#agent_workflows
#evaluation
#reliability
#testing

Testing Agent Skills Systematically with Evals

Good guide for turning prompt or skill quality into repeatable checks instead of subjective impressions, especially when skills evolve over time.

S. Amann

#evaluation
#reliability
#tools

A Systematic Evaluation of Static API-Misuse Detectors

Useful for understanding where static misuse detectors help and where they still miss enough to require stronger checks.

Sandi Metz

#architecture
#modularity

The Wrong Abstraction

Classic reminder that premature abstraction is often worse than direct code, especially when the variation you are abstracting for has not appeared yet.

Simon Willison

#agent_workflows
#multi-agent
#reliability
#testing
#tools

Agentic Engineering Patterns

Best practical handbook here for day-to-day coding-agent work. It collects habits, anti-patterns, subagent use, and testing loops that make the rest of the theory usable in real repositories.

Software Engineering at Google

#documentation
#reliability

Chapter 10: Documentation

Useful for thinking about docs as an engineering artifact that agents should be able to rely on and keep in sync.

#reliability
#testing

Chapter 11: Testing Overview

A broad map of testing as an engineering system, helpful when you need reliable feedback loops for agent-run changes.

#reliability
#testing

Chapter 12: Unit Testing

Good reference for the fastest, most localized form of feedback an agent can get while iterating.

#context_engineering
#documentation
#tools

Chapter 17: Code Search

Useful for making large codebases navigable, because searchability is often what lets an agent find the right context at all.

Thoughtworks

#architecture
#reliability
#testing

Fitness function-driven development

A strong pattern for translating architectural goals into executable checks that run continuously.

#agent_workflows
#context_engineering
#documentation

Spec-driven development

Technology Radar entry that captures spec-driven development as an emerging technique for making AI-assisted implementation more traceable and intentional.

W. P. Stevens

#architecture
#modularity

Structured Design

Foundational module-design reading on keeping responsibilities separate and interfaces clean.

Filter by tags