Curated reading list

References

A curated reading list of articles, papers, and essays about context engineering, tools, memory, retrieval, and reliable AI agent workflows.

This page collects resources I find valuable in one way or another, whether because they are especially useful, well-written, thought-provoking, or worth revisiting over time. The emphasis is on material that helps when building systems that need to plan, use tools, retrieve information, and stay reliable over long runs.

Anthropic

Building Effective AI Agents

A good foundation for deciding when to use a workflow, when to use an agent, and when the simplest single-call setup is enough.

Effective context engineering for AI agents

Best high-level framing of context as a scarce resource that must be assembled deliberately at each turn.

Effective harnesses for long-running agents

Useful for designing harnesses that keep long sessions moving without letting the context window or task state drift apart.

Harness design for long-running application development

A practical companion piece on reducing harness bulk while keeping long-running development runs reliable.

How we built our multi-agent research system

Shows how to split broad research into coordinated subagents, checkpoint plans, and keep citations attached to the final answer.

Introducing Contextual Retrieval

Concrete retrieval technique for giving chunks just enough surrounding meaning before indexing them, which helps agents pull the right evidence later.

Our framework for developing safe and trustworthy agents

Useful for the oversight, privacy, and security questions that sit beside autonomy in real systems.

Writing effective tools for agents - with agents

Excellent for turning tool definitions into clearer contracts, with better namespacing, outputs, and descriptions.

arXiv

Building Effective AI Coding Agents for the Terminal

A dense terminal-agent paper that covers scaffolding, compaction, memory, and other mechanics needed for long-running command-line work.

Theory of Code Space: Do Code Agents Understand Software Architecture?

Useful for understanding whether an agent can build and maintain a real map of architecture, not just make local edits.

ArchUnit

User Guide

Shows how to turn architecture rules into executable tests so boundaries stay enforced as code changes.

Bazel

Hermeticity

A strong explanation of why isolated, reproducible builds make automated changes easier to trust and debug.

Carlos E. Jimenez et al.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The standard reality check for code agents: real GitHub issues, real repositories, and hard multi-file fixes.

Chris Richardson

Service per team

Good reminder that service boundaries work best when they line up with ownership, which also keeps agent tasks narrower.

Cognition

Don't Build Multi-Agents

Clear argument for sharing full traces and carrying decisions forward, instead of fanning work out before the problem really needs it.

How Cognition Uses Devin to Build Devin

Interesting because it shows an agent being used inside the product loop, exposing where automation helps and where human steering still matters.

Craig Larman

Protected Variation: The Importance of Being Closed

Classic guidance for putting stable interfaces around change points, which is the same move you need for tools, prompts, and APIs.

Cucumber

BDD

Useful for turning examples into shared, testable expectations that humans and automation can agree on.

Gherkin Reference

Worth keeping nearby when you want scenarios to stay precise enough to drive tests, documentation, or agent checks.

D. L. Parnas

On the Criteria To Be Used in Decomposing Systems into Modules

Foundational reading on modularity as information hiding, especially the idea that you should decompose around likely change.

G. Kiczales et al.

Aspect-Oriented Programming

A good historical reference for cross-cutting concerns and the tradeoff between local clarity and shared behavior.

GitHub Blog

Building an agentic memory system for GitHub Copilot

Shows a memory design that stores useful facts with citations and verifies them before reuse, which is the right way to avoid stale steering signals.

How to build reliable AI workflows with agentic primitives and context engineering

Strong practical guide to splitting planning, implementation, and testing into separate sessions and loading only the context each phase needs.

Google Research

AI in software engineering at Google: Progress and the path ahead

Useful for seeing where AI already helps at scale and where the next gains are likely to come from, especially testing, understanding, and maintenance.

Google SRE

Service Level Objectives

Core reading for turning vague reliability goals into measurements that an automated system can actually optimize against.

Google Testing Blog

Just Say No to More End-to-End Tests

A durable argument for pushing feedback down the stack so failures are faster, cheaper, and easier to localize.

H. Gall

Detection of Logical Coupling Based on Product Release History

Useful for showing that hidden dependencies can be inferred from release behavior, not just source structure.

Herbert Graca

Packaging & namespacing

A helpful way to think about packages and namespaces as real architecture boundaries instead of just file organization.

Import Linter

Layers

A practical example of enforcing dependency direction in Python, which keeps generated or agent-edited code from breaking architecture.

J. Becker

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Good empirical guardrail for productivity claims, especially when you want to know what AI actually changes in real developer work.

Jimmy Bogard

Vertical Slice Architecture

Useful for organizing code around use cases so changes stay localized and agents can work on one slice without touching a whole layer stack.

John Ousterhout

A Philosophy of Software Design

One of the best general references for reducing complexity by designing modules that stay small, coherent, and easy to reason about.

Katie Hempenius

Performance Budgets 101

A concrete way to keep an agent honest about cost, because performance limits need explicit budgets instead of vague aspirations.

Kent C. Dodds

AHA Programming

A short case for waiting on abstraction until the variation actually shows up, which helps avoid over-generalized code from both humans and agents.

LangChain Blog

Context Engineering

A useful taxonomy for writing, selecting, compressing, and isolating context as separate problems instead of one vague prompt challenge.

The rise of "context engineering"

A concise explanation of why context quality, structure, and format matter more than clever wording once systems become dynamic.

M. Cataldo

Software Dependencies, Work Dependencies, and Their Impact on Failures

Useful for understanding how coordination and communication links can become failure modes in large delivery systems.

Martin Fowler

Bounded Context

Strong way to keep domain boundaries clear so an agent or teammate does not have to solve the whole system at once.

Branch By Abstraction

A practical migration pattern when you need to replace behavior gradually without freezing the rest of the system.

Conway's Law

A reminder that team structure leaks into architecture, which matters when agent workflows mirror org boundaries.

Flag Argument

Useful warning about APIs that hide multiple behaviors behind one parameter and become hard for agents to use correctly.

Humans and Agents in Software Engineering Loops

Best read here for the distinction between the why loop and the how loop, and for why the harness deserves as much attention as the model.

Linking Modular Architecture to Development Teams

Shows that modularity only pays off when the team structure and developer experience support the boundaries.

Patterns of Legacy Displacement

Very practical for replacing old systems in stages instead of turning modernization into a risky big-bang rewrite.

Test Pyramid

Still one of the cleanest heuristics for placing verification where it is cheapest and most informative.

Yet Another Optimization Article

A good antidote to speculative tuning, especially when an agent or engineer starts optimizing before the bottleneck is real.

Martin Fowler and Birgitta Böckeler

Context Engineering for Coding Agents

A grounded overview of the current context features in coding tools and how to think about prompts, rules, skills, and subagents as a system.

Martin Fowler and James Lewis

Microservices

The classic case for splitting services along independently deployable boundaries, which is still the right default when agent work needs a smaller surface area.

Microsoft Learn

Use Test Impact Analysis

Useful for running only the tests likely to be affected by a change, which is exactly the sort of bounded feedback loop agents need.

N. Ajienka

Managing Hidden Dependencies in OO Software: A Study Based on Open Source Projects

Useful evidence that hidden dependencies are not theoretical; they show up in real codebases and matter for maintenance.

Neal Ford

Building Evolutionary Architectures

Good framework for using fitness functions and other checks to let an architecture evolve without losing control.

OpenAI

Harness engineering: leveraging Codex in an agent-first world

A concrete account of shaping repo structure, docs, and verification so agents can do the bulk work without making the system opaque.

OpenAI Developers

Building an AI-Native Engineering Team

Useful organizational guidance for separating planning, implementation, and testing work so agents fit into the team instead of becoming a side experiment.

Run long-horizon tasks with Codex

A useful long-run case study on keeping a single session productive for hours through checkpoints, validation, and good status artifacts.

Testing Agent Skills Systematically with Evals

Good guide for turning prompt or skill quality into repeatable checks instead of subjective impressions.

S. Amann

A Systematic Evaluation of Static API-Misuse Detectors

Useful for understanding where static misuse detectors help and where they still miss enough to require stronger checks.

Sandi Metz

The Wrong Abstraction

Classic reminder that premature abstraction is often worse than direct code, especially when the variation you are abstracting for has not appeared yet.

Thoughtworks

Fitness function-driven development

A strong pattern for translating architectural goals into executable checks that run continuously.

W. P. Stevens

Structured Design

Foundational module-design reading on keeping responsibilities separate and interfaces clean.