1,762 subscribers
התחל במצב לא מקוון עם האפליקציה Player FM !
פודקאסטים ששווה להאזין
בחסות


Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726
Manage episode 475703814 series 2355587
Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research.
The complete show notes for this episode can be found at https://twimlai.com/go/726.
758 פרקים
Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Manage episode 475703814 series 2355587
Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research.
The complete show notes for this episode can be found at https://twimlai.com/go/726.
758 פרקים
All episodes
×

1 Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli - #738 1:00:29


1 Building the Internet of Agents with Vijoy Pandey - #737 56:13


1 LLMs for Equities Feature Forecasting at Two Sigma with Ben Wellington - #736 59:31


1 Zero-Shot Auto-Labeling: The End of Annotation for Computer Vision with Jason Corso - #735 56:45


1 Grokking, Generalization Collapse, and the Dynamics of Training Deep Neural Networks with Charles Martin - #734 1:25:21


1 Google I/O 2025 Special Edition - #733 26:21


1 RAG Risks: Why Retrieval-Augmented LLMs are Not Safer with Sebastian Gehrmann - #732 57:09


1 From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731 1:01:25


1 How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730 1:07:27


1 CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729 56:18


1 Generative Benchmarking with Kelly Hong - #728 54:17


1 Exploring the Biology of LLMs with Circuit Tracing with Emmanuel Ameisen - #727 1:34:06


1 Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726 51:45


1 Waymo's Foundation Model for Autonomous Driving with Drago Anguelov - #725 1:09:07


1 Dynamic Token Merging for Efficient Byte-level Language Models with Julie Kallini - #724 50:32
ברוכים הבאים אל Player FM!
Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.