התחל במצב לא מקוון עם האפליקציה Player FM !
פודקאסטים ששווה להאזין
בחסות


#048 Why Your AI Agents Need Permission to Act, Not Just Read
Manage episode 482065227 series 3585930
Nicolay here,
most AI conversations obsess over capabilities. This one focuses on constraints - the right ones that make AI actually useful rather than just impressive demos.
Today I have the chance to talk to Dexter Horthy, who recently put out a long piece called the “12-factor agents”.
It’s like the 10 commandments, but for building agents.
One of it is “Contact human with tool calls”: the LLM can call humans for high-stakes decisions or “writes”.
The key insight is brutally simple. AI can get to 90% accuracy on most tasks - good enough for spam-like activities but disastrous for anything that requires trust. The solution isn't to wait for models to get smarter; it's to add a human approval layer for critical actions.
Imagine you are writing to a database or sending an email. Each “write” has to be approved by a human. So you post the email in a Slack channel and in most cases, your sales people will approve. In the 10%, it’s stopped in its tracks and the human can take over. You stop the slop and get good training data in the mean time.
Dexter’s company is building exactly this: an approval mechanism that lets AI agents send requests to humans before executing.
In the podcast, we also touch on a bunch of other things:
- MCP and that they are (atm) just a thin client
- Are we training LLMs toward mediocrity?
- What infrastructure do we need for human in the loop (e.g. DBOS)?
- and more
💡 Core Concepts
- Context Engineering: Crafting the information representation for LLMs - selecting optimal data structures, metadata, and formats to ensure models receive precisely what they need to perform effectively.
- Token Bloat Prevention: Ruthlessly eliminating irrelevant information from context windows to maintain agent focus during complex tasks, preventing the pattern of repeating failed approaches.
- Human-in-the-loop Approval Flows: Achieving 99% reliability through a "90% AI + 10% human oversight" framework where agents analyze data and suggest actions but request explicit permission before execution.
- Rubric Engineering: Systematically evaluating AI outputs through dimension-specific scoring criteria to provide precise feedback and identify exceptional results, helping escape the trap of models converging toward mediocrity.
📶 Connect with Dexter:
📶 Connect with Nicolay:
- X / Twitter
- Bluesky
- Website
- My Agency Aisbach (for ai implementations / strategy)
⏱️ Important Moments
- MCP Servers as Clients: [03:07] Dexter explains why what many call "MCP servers" actually function more like clients when examining the underlying code.
- Authentication Challenges: [04:45] The discussion shifts to how authentication should be handled in MCP implementations and whether it belongs in the protocol.
- Asynchronous Agent Execution: [08:18] Exploring how to handle agents that need to pause for human input without wasting tokens on continuous polling.
- Token Bloat Prevention: [14:41] Strategies for keeping context windows focused and efficient, moving beyond standard chat formats.
- Context Engineering: [29:06] The concept that everything in AI agent development ultimately comes down to effective context engineering.
- Fine-tuning vs. RAG for Writing Style: [20:05] Contrasting personal writing style fine-tuning versus context window examples.
- Generating Options vs. Deterministic Outputs: [19:44] The unexplored potential of having AI generate diverse creative options for human selection.
- The "Mediocrity Convergence" Question: [37:11] The philosophical concern that popular LLMs may inevitably trend toward average quality.
- Data Labeling Interfaces: [35:25] Discussion about the need for better, lower-friction interfaces to collect human feedback on AI outputs.
- Human-in-the-loop Approval Flows: [42:46] The core approach of HumanLayer, allowing agents to ask permission before taking action.
🛠️ Tools & Tech Mentioned
📚 Recommended Resources
🔮 What's Next
Next week, we will continue going more into getting generative AI into production talking to Vibhav from BAML.
💬 Join The Conversation
Follow How AI Is Built on YouTube, Bluesky, or Spotify.
If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.
I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.
♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. I am trying to produce the best content possible - informative, actionable, and engaging. I'm asking for two things: hit subscribe now to show me what content you like (so I can do more of it), and if this episode helped you, pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️
62 פרקים
Manage episode 482065227 series 3585930
Nicolay here,
most AI conversations obsess over capabilities. This one focuses on constraints - the right ones that make AI actually useful rather than just impressive demos.
Today I have the chance to talk to Dexter Horthy, who recently put out a long piece called the “12-factor agents”.
It’s like the 10 commandments, but for building agents.
One of it is “Contact human with tool calls”: the LLM can call humans for high-stakes decisions or “writes”.
The key insight is brutally simple. AI can get to 90% accuracy on most tasks - good enough for spam-like activities but disastrous for anything that requires trust. The solution isn't to wait for models to get smarter; it's to add a human approval layer for critical actions.
Imagine you are writing to a database or sending an email. Each “write” has to be approved by a human. So you post the email in a Slack channel and in most cases, your sales people will approve. In the 10%, it’s stopped in its tracks and the human can take over. You stop the slop and get good training data in the mean time.
Dexter’s company is building exactly this: an approval mechanism that lets AI agents send requests to humans before executing.
In the podcast, we also touch on a bunch of other things:
- MCP and that they are (atm) just a thin client
- Are we training LLMs toward mediocrity?
- What infrastructure do we need for human in the loop (e.g. DBOS)?
- and more
💡 Core Concepts
- Context Engineering: Crafting the information representation for LLMs - selecting optimal data structures, metadata, and formats to ensure models receive precisely what they need to perform effectively.
- Token Bloat Prevention: Ruthlessly eliminating irrelevant information from context windows to maintain agent focus during complex tasks, preventing the pattern of repeating failed approaches.
- Human-in-the-loop Approval Flows: Achieving 99% reliability through a "90% AI + 10% human oversight" framework where agents analyze data and suggest actions but request explicit permission before execution.
- Rubric Engineering: Systematically evaluating AI outputs through dimension-specific scoring criteria to provide precise feedback and identify exceptional results, helping escape the trap of models converging toward mediocrity.
📶 Connect with Dexter:
📶 Connect with Nicolay:
- X / Twitter
- Bluesky
- Website
- My Agency Aisbach (for ai implementations / strategy)
⏱️ Important Moments
- MCP Servers as Clients: [03:07] Dexter explains why what many call "MCP servers" actually function more like clients when examining the underlying code.
- Authentication Challenges: [04:45] The discussion shifts to how authentication should be handled in MCP implementations and whether it belongs in the protocol.
- Asynchronous Agent Execution: [08:18] Exploring how to handle agents that need to pause for human input without wasting tokens on continuous polling.
- Token Bloat Prevention: [14:41] Strategies for keeping context windows focused and efficient, moving beyond standard chat formats.
- Context Engineering: [29:06] The concept that everything in AI agent development ultimately comes down to effective context engineering.
- Fine-tuning vs. RAG for Writing Style: [20:05] Contrasting personal writing style fine-tuning versus context window examples.
- Generating Options vs. Deterministic Outputs: [19:44] The unexplored potential of having AI generate diverse creative options for human selection.
- The "Mediocrity Convergence" Question: [37:11] The philosophical concern that popular LLMs may inevitably trend toward average quality.
- Data Labeling Interfaces: [35:25] Discussion about the need for better, lower-friction interfaces to collect human feedback on AI outputs.
- Human-in-the-loop Approval Flows: [42:46] The core approach of HumanLayer, allowing agents to ask permission before taking action.
🛠️ Tools & Tech Mentioned
📚 Recommended Resources
🔮 What's Next
Next week, we will continue going more into getting generative AI into production talking to Vibhav from BAML.
💬 Join The Conversation
Follow How AI Is Built on YouTube, Bluesky, or Spotify.
If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at nicolay.gerold@gmail.com.
I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.
♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. I am trying to produce the best content possible - informative, actionable, and engaging. I'm asking for two things: hit subscribe now to show me what content you like (so I can do more of it), and if this episode helped you, pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️
62 פרקים
All episodes
×
1 Embedding Intelligence: AI's Move to the Edge 1:05:35

1 #054 Building Frankenstein Models with Model Merging and the Future of AI 1:06:55

1 #053 AI in the Terminal: Enhancing Coding with Warp 1:04:30

1 #052 Don't Build Models, Build Systems That Build Models 59:22

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34

1 #048 Why Your AI Agents Need Permission to Act, Not Just Read 57:02

1 #047 Architecting Information for Search, Humans, and Artificial Intelligence 57:21

1 #046 Building a Search Database From First Principles 53:28

1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34

1 #043 Knowledge Graphs Won't Fix Bad Data 1:10:58

1 #042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs 1:33:43

1 #041 Context Engineering, How Knowledge Graphs Help LLMs Reason 1:33:34

1 #040 Vector Database Quantization, Product, Binary, and Scalar 52:11

1 #039 Local-First Search, How to Push Search To End-Devices 53:08

1 #038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It 1:14:23

1 #037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces 49:12

1 #036 How AI Can Start Teaching Itself - Synthetic Data Deep Dive 48:10

1 #035 A Search System That Learns As You Use It (Agentic RAG) 45:29

1 #034 Rethinking Search Inside Postgres, From Lexemes to BM25 47:15

1 #033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data) 51:25

1 #032 Improving Documentation Quality for RAG Systems 46:36

1 #031 BM25 As The Workhorse Of Search; Vectors Are Its Visionary Cousin 54:04

1 #030 Vector Search at Scale, Why One Size Doesn't Fit All 36:25

1 #029 Search Systems at Scale, Avoiding Local Maxima and Other Engineering Lessons 54:46

1 #028 Training Multi-Modal AI, Inside the Jina CLIP Embedding Model 49:21

1 #027 Building the database for AI, Multi-modal AI, Multi-modal Storage 44:53

1 #026 Embedding Numbers, Categories, Locations, Images, Text, and The World 46:43

1 #025 Data Models to Remove Ambiguity from AI and Search 58:39

1 #024 How ColPali is Changing Information Retrieval 54:56

1 #023 The Power of Rerankers in Modern Search 42:28

1 #022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) 46:05

1 #021 The Problems You Will Encounter With RAG At Scale And How To Prevent (or fix) Them 50:08

1 #020 The Evolution of Search, Finding Search Signals, GenAI Augmented Retrieval 52:15

1 #019 Data-driven Search Optimization, Analysing Relevance 51:13

1 #018 Query Understanding: Doing The Work Before The Query Hits The Database 53:01


1 #017 Unlocking Value from Unstructured Data, Real-World Applications of Generative AI 36:27

1 #016 Data Processing for AI, Integrating AI into Data Pipelines, Spark 46:25

1 #015 Building AI Agents for the Enterprise, Agent Cost Controls, Seamless UX 35:11
ברוכים הבאים אל Player FM!
Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.