[{"data":1,"prerenderedAt":702},["ShallowReactive",2],{"blog-ai-agent-harness-engineering":3},{"id":4,"title":5,"body":6,"category":686,"date":687,"description":688,"extension":689,"meta":690,"navigation":691,"path":692,"seo":693,"stem":694,"tags":695,"__hash__":701},"blog\u002Fblog\u002Fai-agent-harness-engineering.md","AI Agent 的工程化哲学：Harness 设计的核心原则",{"type":7,"value":8,"toc":673},"minimark",[9,13,26,33,36,41,44,47,131,138,144,148,151,162,168,171,187,194,199,203,210,216,222,225,230,234,237,240,334,341,348,354,357,362,366,369,395,398,401,429,434,438,441,464,470,475,479,491,494,499,503,506,525,528,554,559,563,566,569,589,592,597,601,604,660,663,666],[10,11,5],"h1",{"id":12},"ai-agent-的工程化哲学harness-设计的核心原则",[14,15,16,17,21,22,25],"p",{},"很多人第一次用 Claude Code 或 Cursor 时会有一种错觉：它好聪明，什么都能做。但用久了你会发现，这些产品真正厉害的地方",[18,19,20],"strong",{},"不是模型本身","——它们底层都接的是 Claude 或 GPT——",[18,23,24],{},"而是围绕模型搭建的那一套\"脚手架\"","。",[14,27,28,29,32],{},"这套脚手架在 AI 工程圈有个专门的词：",[18,30,31],{},"Harness（挽具）","。原意是套在马身上控制方向的工具，引申为围绕 LLM 构建的一切控制代码——决定什么时候调用模型、往上下文里塞什么、如何验证输出、失败时如何重试、如何保存状态以便中断后恢复。",[14,34,35],{},"LLM 是引擎，Harness 是把引擎装进车里的一切零件。",[37,38,40],"h2",{"id":39},"_1-context-engineering每个-token-都是预算","1. Context Engineering：每个 Token 都是预算",[14,42,43],{},"给 LLM 的输入要精确——这句话人人都同意，但落到实处是什么？",[14,45,46],{},"不是\"把需求说清楚\"就够了。Context Engineering 是一套完整的策展策略：",[48,49,50,63],"table",{},[51,52,53],"thead",{},[54,55,56,60],"tr",{},[57,58,59],"th",{},"维度",[57,61,62],{},"实操",[64,65,66,77,87,97,107,117],"tbody",{},[54,67,68,74],{},[69,70,71],"td",{},[18,72,73],{},"System Prompt",[69,75,76],{},"角色、目标、约束、输出格式全部显式写明",[54,78,79,84],{},[69,80,81],{},[18,82,83],{},"Few-shot Examples",[69,85,86],{},"高质量示例比长篇规则更有效——模型模仿代码的能力远超按自然语言规范写代码",[54,88,89,94],{},[69,90,91],{},[18,92,93],{},"Tool Descriptions",[69,95,96],{},"描述要具体，举反例（\"不要在 X 场景调用此工具\"）",[54,98,99,104],{},[69,100,101],{},[18,102,103],{},"Working Context",[69,105,106],{},"当前任务状态作为结构化 block 注入，而非混在对话历史里",[54,108,109,114],{},[69,110,111],{},[18,112,113],{},"Retrieved Context",[69,115,116],{},"RAG 拉来的资料标注来源、时间、可信度",[54,118,119,124],{},[69,120,121],{},[18,122,123],{},"Negative Context",[69,125,126,127,130],{},"告诉模型",[18,128,129],{},"不要做什么","——经常比正向指令更有效",[14,132,133,134,137],{},"核心洞察：上下文窗口是稀缺资源，",[18,135,136],{},"每个 token 都应该为当前子任务服务","。反模式是把整个项目代码、整段对话历史、所有可能用到的工具一股脑塞进去——这叫 Context Stuffing，结果是信号被噪声淹没。",[14,139,140,143],{},[18,141,142],{},"小结："," 精确的输入不是\"把话说清楚\"那么简单，而是一套信息策展纪律。扔掉无关的，标注来源的，举例说明的，明确禁止的——四管齐下。",[37,145,147],{"id":146},"_2-workflow-与-agent-的光谱外层固定内层自由","2. Workflow 与 Agent 的光谱：外层固定，内层自由",[14,149,150],{},"Agent 设计存在一个光谱：",[152,153,158],"pre",{"className":154,"code":156,"language":157},[155],"language-text","纯 Workflow  ←──────────────────────→  纯 Agent\n所有分支写死    预设骨架 + LLM 决策      完全自由的 ReAct 循环\n可预测、可审计                           灵活、应对力强\n容易调试                                 难调试、难预算\n","text",[159,160,156],"code",{"__ignoreMap":161},"",[14,163,164,165],{},"大型复杂任务的工程化原则很明确：",[18,166,167],{},"能用 workflow 表达的部分，就不要交给 Agent 自主决策。",[14,169,170],{},"比如一个\"代码 review → 测试 → 部署\"的大任务：",[172,173,174,181],"ul",{},[175,176,177,180],"li",{},[18,178,179],{},"宏观流程是 Workflow","（三阶段顺序固定，状态机表达）",[175,182,183,186],{},[18,184,185],{},"每阶段内部的判断是 Agent","（LLM 决定看哪些文件、跑哪些测试）",[14,188,189,190,193],{},"这就是\"相对固定的流程\"的精确含义——",[18,191,192],{},"外层骨架固定，内层决策自由","。LangGraph 的 StateGraph 天然适合这种表达：StateGraph 定义 workflow 骨架，每个节点里可以是 Agent 子图。",[14,195,196,198],{},[18,197,142],{}," 把 Agent 当\"万能自主决策者\"是新手最容易犯的错误。高手的设计是：固定流程做骨架，Agent 能力填血肉。",[37,200,202],{"id":201},"_3-plan-then-execute先规划再动手","3. Plan-Then-Execute：先规划，再动手",[14,204,205,206,209],{},"让 LLM 直接上手干活的失败率远高于先让它出计划。这在工业上叫 ",[18,207,208],{},"Plan-Then-Execute 模式","：",[152,211,214],{"className":212,"code":213,"language":157},[155],"输入\n  ↓\n阶段 1：Clarification（澄清）\n  - LLM 反问用户，确认真实目标\n  - 必要时 HITL（人工确认）\n  ↓\n阶段 2：Planning（规划）\n  - 输出结构化任务列表\n  - 每项含：描述、依赖、预期产物、验证方式\n  ↓\n阶段 3：Execution（执行）\n  - 遍历任务列表，逐项执行\n  - 每项完成后更新状态\n  ↓\n阶段 4：Consolidation（汇总）\n  - 检查是否全部达成，整合输出\n",[159,215,213],{"__ignoreMap":161},[14,217,218,221],{},[18,219,220],{},"一个关键点容易被忽略：Planning 阶段的输出应该是数据结构，不是自然语言。"," 结构化的 plan 可以机器可读、可渲染进度条、支持 DAG 并行执行、中断恢复时精确定位到具体任务。",[14,223,224],{},"Claude Code 的 TodoWrite 就是 Plan-Then-Execute 的显式实现——它不是可有可无的辅助功能，而是复杂任务不跑偏的核心保障。",[14,226,227,229],{},[18,228,142],{}," 让 LLM 先出结构化计划，人 review 确认，再逐项执行。这 30 秒的\"刹车\"能省掉后面 30 分钟的回滚。",[37,231,233],{"id":232},"_4-verifier-loop没有验证的-agent-就是没有刹车的跑车","4. Verifier Loop：没有验证的 Agent 就是没有刹车的跑车",[14,235,236],{},"这是整个 Harness 设计中最关键也最容易被忽视的一条。LLM 会自信地胡说——只有客观的 Verifier 能拦住它。",[14,238,239],{},"AI coding 领域的验证手段分六级（由强到弱）：",[48,241,242,255],{},[51,243,244],{},[54,245,246,249,252],{},[57,247,248],{},"验证强度",[57,250,251],{},"手段",[57,253,254],{},"可靠性",[64,256,257,270,283,295,308,321],{},[54,258,259,264,267],{},[69,260,261],{},[18,262,263],{},"执行级",[69,265,266],{},"跑测试、编译、运行",[69,268,269],{},"★★★★★",[54,271,272,277,280],{},[69,273,274],{},[18,275,276],{},"静态分析",[69,278,279],{},"TypeScript \u002F lint \u002F AST 检查",[69,281,282],{},"★★★★",[54,284,285,290,293],{},[69,286,287],{},[18,288,289],{},"Schema 级",[69,291,292],{},"zod 校验输出结构",[69,294,282],{},[54,296,297,302,305],{},[69,298,299],{},[18,300,301],{},"LLM-as-Judge",[69,303,304],{},"另一个 LLM 评分",[69,306,307],{},"★★★",[54,309,310,315,318],{},[69,311,312],{},[18,313,314],{},"正则 \u002F 字符串匹配",[69,316,317],{},"关键字出现检查",[69,319,320],{},"★★",[54,322,323,328,331],{},[69,324,325],{},[18,326,327],{},"无验证",[69,329,330],{},"信模型",[69,332,333],{},"★",[14,335,336,337,340],{},"工程实务：",[18,338,339],{},"每个子任务至少要有一个 Verifier","。失败时走\"观察错误 → 修正 → 重试\"的 Critic-Actor 循环。",[14,342,343,344,347],{},"但 Verifier 的价值不只是\"成功\u002F失败\"的二元信号。",[18,345,346],{},"好的 Verifier 会返回结构化的失败原因","，让 LLM 能基于此修正：",[152,349,352],{"className":350,"code":351,"language":157},[155],"❌ Bad verifier:  \"测试失败\"\n✓ Good verifier:  \"测试 'should return user id' 失败：\n                   预期 'user-123'，实际 undefined。\n                   可能原因：getUserById 未处理传入的 id 参数。\"\n",[159,353,351],{"__ignoreMap":161},[14,355,356],{},"TDD 是 Verifier Loop 最自然的实现——测试就是需求的可执行版本，红 → 绿 → 重构的节奏天然适配 Agent 的工作方式。",[14,358,359,361],{},[18,360,142],{}," Verifier 是 Agent 质量的天花板。执行级验证（跑测试、编译）是最可靠的，静态分析次之。每个子任务都必须配至少一个。",[37,363,365],{"id":364},"_5-state-is-the-backbone分层的状态管理","5. State is the Backbone：分层的状态管理",[14,367,368],{},"生产级 Agent 最核心的基础设施是状态管理。按粒度分四层：",[172,370,371,377,383,389],{},[175,372,373,376],{},[18,374,375],{},"任务级","：每个 subtask 的 status \u002F result \u002F errors",[175,378,379,382],{},[18,380,381],{},"步级","：每次 LLM 调用的 input \u002F output \u002F tokens \u002F latency",[175,384,385,388],{},[18,386,387],{},"事件级","：每个 tool call 的参数、返回、耗时",[175,390,391,394],{},[18,392,393],{},"会话级","：全局元信息（user_id、session_id、budget_used）",[14,396,397],{},"一个设计良好的 Agent State，打印出来应该能让一个新加入的工程师看懂\"现在在干什么、干到哪一步了\"。",[14,399,400],{},"中断恢复分三个层次：",[402,403,404,410,423],"ol",{},[175,405,406,409],{},[18,407,408],{},"Crash Recovery","：程序崩了——checkpointer 存 Redis\u002FPostgres，重启后加载",[175,411,412,415,416,419,420],{},[18,413,414],{},"Human Pause Recovery","：人工介入暂停——",[159,417,418],{},"interrupt()"," + ",[159,421,422],{},"resume",[175,424,425,428],{},[18,426,427],{},"Long-task Resume","：跨天任务——每个子任务完成就持久化，避免重做",[14,430,431,433],{},[18,432,142],{}," 状态不只是技术细节，它是 Agent 的\"记忆脊椎\"。没有分层持久化状态的 Agent 只能处理五分钟内的任务——超出这个窗口，崩溃就等于归零。",[37,435,437],{"id":436},"_6-预算控制给你的-agent-戴上三个紧箍咒","6. 预算控制：给你的 Agent 戴上三个紧箍咒",[14,439,440],{},"Agent 容易\"做得太久\"——在循环里不断尝试、不断消耗 token。需要在三个维度设预算：",[172,442,443,449,458],{},[175,444,445,448],{},[18,446,447],{},"Token 预算","：累计 token 上限，超了就强制 summary 或放弃",[175,450,451,454,455],{},[18,452,453],{},"Step 预算","：最多循环 N 次，对应 LangGraph 的 ",[159,456,457],{},"recursionLimit",[175,459,460,463],{},[18,461,462],{},"Wall-time 预算","：墙钟时间上限，用 AbortSignal 实现",[14,465,466,467],{},"任一超出 → 走降级路径：部分交付、告知用户、或转交人工。",[18,468,469],{},"Agent 不怕失败，怕的是悄无声息地烧钱。",[14,471,472,474],{},[18,473,142],{}," 这三个预算不是可选的 optimizations——它们是生产级 Agent 的安全带。开源 demo 和闭源产品之间最大的差距往往不在模型能力，在预算控制。",[37,476,478],{"id":477},"_7-错误是信号不是终点","7. 错误是信号，不是终点",[14,480,481,482,485,486,488],{},"初级 Harness 遇到错误 → 重试相同操作（没用）",[483,484],"br",{},"\n中级 Harness 遇到错误 → 换参数重试",[483,487],{},[18,489,490],{},"高级 Harness 遇到错误 → 让 LLM 看错误详情，重新规划",[14,492,493],{},"把错误作为上下文的一部分喂回去，是 Agent 展现\"智能\"的关键场景。这不是简单的 retry——它需要 Harness 把 error message 结构化地注入到下一轮 LLM 调用的上下文中，让模型理解\"刚才发生了什么、为什么会失败、现在该怎么调整\"。",[14,495,496,498],{},[18,497,142],{}," Agent 的智能不在于不犯错，而在于犯了错之后能看懂错误信息并调整策略。这需要 Harness 把\"错误 → 上下文 → 重新规划\"这条链路做成标配。",[37,500,502],{"id":501},"_8-工具设计少即是多","8. 工具设计：少即是多",[14,504,505],{},"给 Agent 20 个工具 ≠ Agent 能做 20 种事。太多工具带来的问题：",[172,507,508,511,522],{},[175,509,510],{},"稀释注意力——每个 turn 都要过一遍选择",[175,512,513,514,517,518,521],{},"相似工具混淆——",[159,515,516],{},"read_file"," vs ",[159,519,520],{},"load_file"," 到底选哪个",[175,523,524],{},"描述互相干扰——tool A 的描述碰巧包含了 tool B 的触发词",[14,526,527],{},"原则：",[172,529,530,536,547],{},[175,531,532,533],{},"单个 Agent 绑定的工具保持在 ",[18,534,535],{},"10 个以下",[175,537,538,539,542,543,546],{},"相似能力合并（一个 ",[159,540,541],{},"file_operation"," tool，用 ",[159,544,545],{},"op"," 参数区分读\u002F写\u002F删）",[175,548,549,550,553],{},"大工具集用",[18,551,552],{},"多 Agent 路由","（Supervisor 决定用哪个子 Agent，每个子 Agent 只带自己需要的工具）",[14,555,556,558],{},[18,557,142],{}," 工具设计的原则和函数设计一样——单一职责、少即是多。复杂的工具集不要扁平铺开，用 Agent 层级来组织。",[37,560,562],{"id":561},"_9-observability-first不要接受黑盒","9. Observability-First：不要接受黑盒",[14,564,565],{},"生产 Agent 必须从第一天就接 tracing。否则 debug 全靠运气。",[14,567,568],{},"推荐方案：",[172,570,571,577,583],{},[175,572,573,576],{},[18,574,575],{},"LangSmith","（LangChain 家族原生）——看每一步的 input\u002Foutput\u002F耗时",[175,578,579,582],{},[18,580,581],{},"OpenTelemetry","（通用方案，能跟公司现有 Grafana\u002FDatadog 整合）",[175,584,585,588],{},[18,586,587],{},"自己的 event log","（最少要有这个——jsonl 格式，每行一个事件）",[14,590,591],{},"至少追踪：每次 LLM 调用的 prompt\u002Ftokens\u002Flatency、每次 tool call 的参数\u002F返回\u002F耗时、每个子任务的开始\u002F完成\u002F失败。",[14,593,594,596],{},[18,595,142],{}," \"这个 Agent 为什么给出了这个答案？\"——如果没有 tracing，这个问题你永远回答不了。",[37,598,600],{"id":599},"总结九条原则背后的一个核心信念","总结：九条原则背后的一个核心信念",[14,602,603],{},"Harness Engineering 不是什么神秘知识，它就是把这些工程常识搬到了 AI 场景里：",[402,605,606,612,618,624,630,636,642,648,654],{},[175,607,608,611],{},[18,609,610],{},"Context Engineering"," — 每个 token 都为当前子任务服务",[175,613,614,617],{},[18,615,616],{},"Workflow + Agent 混合"," — 宏观写死，微观放开",[175,619,620,623],{},[18,621,622],{},"Plan-Then-Execute"," — 先出结构化计划，再逐项执行",[175,625,626,629],{},[18,627,628],{},"Verifier Loop"," — 每个子任务必须有客观验证",[175,631,632,635],{},[18,633,634],{},"Fine-grained State"," — 分层持久化，支持任意粒度恢复",[175,637,638,641],{},[18,639,640],{},"Budget Control"," — token\u002Fstep\u002F时间三维预算，超限降级",[175,643,644,647],{},[18,645,646],{},"Errors as Context"," — 把错误作为新信息喂回去重新规划",[175,649,650,653],{},[18,651,652],{},"Sparse Tools"," — 少而精的工具集，复杂能力走多 Agent 路由",[175,655,656,659],{},[18,657,658],{},"Observability-First"," — 第一天就有 tracing",[14,661,662],{},"如果你学过 LangGraph，会发现每一条都对应一个原生能力——StateGraph、Conditional Edge、MemorySaver、recursionLimit——Harness 不是新概念，是这些原语的组合应用。",[14,664,665],{},"Claude Code、Cursor、Devin 这些产品真正的护城河不在模型层，在 Harness 层。而理解了这九条原则，你就拿到了自己搭建生产级 Agent 的蓝图。",[14,667,668],{},[669,670,672],"a",{"href":671},"\u002Fblog\u002F","返回博客列表",{"title":161,"searchDepth":674,"depth":674,"links":675},2,[676,677,678,679,680,681,682,683,684,685],{"id":39,"depth":674,"text":40},{"id":146,"depth":674,"text":147},{"id":201,"depth":674,"text":202},{"id":232,"depth":674,"text":233},{"id":364,"depth":674,"text":365},{"id":436,"depth":674,"text":437},{"id":477,"depth":674,"text":478},{"id":501,"depth":674,"text":502},{"id":561,"depth":674,"text":562},{"id":599,"depth":674,"text":600},"AI\u002FLLM","2026-05-03","很多人第一次用 Claude Code 或 Cursor 时会有一种错觉：它好聪明，什么都能做。但用久了你会发现，这些产品真正厉害的地方不是模型本身——它们底层都接的是 Claude 或 GPT——而是围绕模型搭建的那一套\"脚手架\"。","md",{},true,"\u002Fblog\u002Fai-agent-harness-engineering",{"title":5,"description":688},"blog\u002Fai-agent-harness-engineering",[696,697,698,699,700],"AI Agent","Harness Engineering","LangGraph","工程化","LLM","K0vtNQ4LO7pBgLid5SIVQdXoUh7dHxtKD0g9QxZHCfU",1779959652906]