CUDA graphs

PyTorch Developer Podcast

Player FM - Internet Radio Done Right

32 subscribers

הוסף לפני four שנים

תוכן מסופק על ידי PyTorch, Edward Yang, and Team PyTorch. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי PyTorch, Edward Yang, and Team PyTorch או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

No Limit Leadership

1
81: From Nothing to a Billion: The Leadership Playbook They Don’t Teach You w/ Harry L Allen 36:51

לפני 11 ימים36:51

הפעל מאוחר יותר

רשימות

לייק

אהבתי

36:51

Traditional banks often lack personalized service, and local businesses struggle to find true partnership in financial institutions. Meanwhile, higher education faces scrutiny over relevance and ROI in a world where information is nearly free. Harry Allen helped launch Studio Bank to blend technology with high-touch service, fueled by community investment. At Belmont, he's applying the same entrepreneurial mindset to modernize university operations and embed practical learning experiences, like a one-of-a-kind partnership with Dolly Parton, into academia. In this episode, Harry L. Allen, co-founder of Studio Bank and now CFO at Belmont University, unpacks the bold vision behind launching a community-first bank in a city dominated by financial giants. He shares the leadership lessons that shaped his journey, how to lead through crisis, and why mentorship is the key to filling today's leadership vacuum. Key Takeaways Leveraging both financial and social capital creates a unique, community-first banking model. High-tech doesn't mean low-touch, Studio Bank fused innovation with personal relationships. Leadership means showing up, especially during crisis. Universities must shift from being information hubs to delivering real-world experience. Succession and mentorship are vital to cultivating the next generation of leaders. Chapters 00:00 Introduction to Harry L. Allen 01:49 The Birth of Studio Bank 04:29 Leveraging Technology in Community Banking 07:25 The Courage to Start a New Venture 10:37 Leadership Challenges in High Growth 13:02 Leading Through Crisis: The COVID Experience 17:55 Transitioning from Banking to Education 21:16 The Role of Leadership in Higher Education 25:16 Adapting to Challenges in Higher Education 30:04 The Leadership Vacuum in Society 33:17 Advice for Emerging Leaders 35:21 The American Dream and Community Impact No Limit Leadership is the go-to podcast for growth-minded executives, middle managers, and team leaders who want more than surface-level leadership advice. Hosted by executive coach and former Special Forces commander Sean Patton, this show dives deep into modern leadership, self-leadership, and the real-world strategies that build high-performing teams. Whether you're focused on leadership development, building a coaching culture, improving leadership communication, or strengthening team accountability, each episode equips you with actionable insights to unlock leadership potential across your organization. From designing onboarding systems that retain talent to asking better questions that drive clarity and impact, No Limit Leadership helps you lead yourself first so you can lead others better. If you're ready to create a culture of ownership, resilience, and results, this leadership podcast is for you.…

לפני 4 שנים 13:55

MP3•בית הפרקים

What are CUDA graphs? How are they implemented? What does it take to actually use them in PyTorch?

Further reading.

NVIDIA has docs on CUDA graphs https://developer.nvidia.com/blog/cuda-graphs/
Nuts and bolts implementation PRs from mcarilli: https://github.com/pytorch/pytorch/pull/51436 https://github.com/pytorch/pytorch/pull/46148

83 פרקים

#Tech #PyTorch #Edward Yang #Team PyTorch #Deep Learning #Machine Learning

CUDA graphs

PyTorch Developer Podcast

32 subscribers

published לפני 4 שנים

שתפו

MP3•בית הפרקים

What are CUDA graphs? How are they implemented? What does it take to actually use them in PyTorch?

Further reading.

NVIDIA has docs on CUDA graphs https://developer.nvidia.com/blog/cuda-graphs/
Nuts and bolts implementation PRs from mcarilli: https://github.com/pytorch/pytorch/pull/51436 https://github.com/pytorch/pytorch/pull/46148

83 פרקים

#Tech #PyTorch #Edward Yang #Team PyTorch #Deep Learning #Machine Learning

כל הפרקים

PyTorch Developer Podcast

1
Compiler collectives 16:33

לפני 49 weeks16:33

16:33

Compiler collectives are a PT2 feature where by compiler instances across multiple ranks use NCCL collectives to communicate information to other instances. This is used to ensure we consistently decide if inputs or static or dynamic across all ranks. See also PR at https://github.com/pytorch/pytorch/pull/130935…

PyTorch Developer Podcast

1
TORCH_TRACE and tlparse 15:28

לפני 1 year15:28

15:28

TORCH_TRACE and tlparse are a structured log and log parser for PyTorch 2. It gives useful information about what code was compiled and what the intermediate build products look like.

PyTorch Developer Podcast

1
Higher order operators 17:10

לפני 1 year17:10

17:10

Higher order operators are a special form of operators in torch.ops which have relaxed input argument requirements: in particular, they can accept any form of argument, including Python callables. Their name is based off of their most common use case, which is to represent higher order functions like control flow operators. However, they are also used to implement other variants of basic operators and can also be used to smuggle in Python data that is quite unusual. They are implemented using a Python dispatcher.…

PyTorch Developer Podcast

1
Inductor - Post-grad FX passes 24:07

לפני 1 year24:07

24:07

The post-grad FX passes in Inductor run after AOTAutograd has functionalized and normalized the input program into separate forward/backward graphs. As such, they generally can assume that the graph in question is functionalized, except for some mutations to inputs at the end of the graph. At the end of post-grad passes, there are special passes that reintroduce mutation into the graph before going into the rest of Inductor lowering which is generally aware of passes. The post-grad FX passes are varied but are typically domain specific passes making local changes to specific parts of the graph.…

PyTorch Developer Podcast

1
CUDA graph trees 20:50

לפני 1 year20:50

20:50

CUDA graph trees are the internal implementation of CUDA graphs used in PT2 when you say mode="reduce-overhead". Their primary innovation is that they allow the reuse of memory across multiple CUDA graphs, as long as they form a tree structure of potential paths you can go down with the CUDA graph. This greatly reduced the memory usage of CUDA graphs in PT2. There are some operational implications to using CUDA graphs which are described in the podcast.…

PyTorch Developer Podcast

1
Min-cut partitioner 15:56

לפני 1 year15:56

15:56

The min-cut partitioner makes decisions about what to save for backwards when splitting the forward and backwards graph from the joint graph traced by AOTAutograd. Crucially, it doesn't actually do a "split"; instead, it is deciding how much of the joint graph should be used for backwards. I also talk about the backward retracing problem.…

PyTorch Developer Podcast

1
AOTInductor 17:30

לפני 1 year17:30

17:30

AOTInductor is a feature in PyTorch that lets you export an inference model into a self-contained dynamic library, which can subsequently be loaded and used to run optimized inference. It is aimed primarily at CUDA and CPU inference applications, for situations when your model export once to be exported once while your runtime may still get continuous updates. One of the big underlying organizing principles is a limited ABI which does not include libtorch, which allows these libraries to stay stable over updates to the runtime. There are many export-like use cases you might be interested in using AOTInductor for, and some of the pieces should be useful, but AOTInductor does not necessarily solve them.…

PyTorch Developer Podcast

1
Tensor subclasses and PT2 13:25

לפני 1 year13:25

13:25

Tensor subclasses allow you to add extend PyTorch with new types of tensors without having to write any C++. They have been used to implement DTensor, FP8, Nested Jagged Tensor and Complex Tensor. Recent work by Brian Hirsh means that we can compile tensor subclasses in PT2, eliminating their overhead. The basic mechanism by which this compilation works is a desugaring process in AOTAutograd. There are some complications involving views, dynamic shapes and tangent metadata mismatch.…

PyTorch Developer Podcast

1
Compiled autograd 18:07

לפני 1 year18:07

18:07

Compiled autograd is an extension to PT2 that permits compiling the entirety of a backward() call in PyTorch. This allows us to fuse accumulate grad nodes as well as trace through arbitrarily complicated Python backward hooks. Compiled autograd is an important part of our plans for compiled DDP/FSDP as well as for whole-graph compilation.…

PyTorch Developer Podcast

1
PT2 extension points 15:54

לפני 1 year15:54

15:54

We discuss some extension points for customizing PT2 behavior across Dynamo, AOTAutograd and Inductor.

PyTorch Developer Podcast

1
Inductor - Define-by-run IR 12:06

לפני 1 year12:06

12:06

Define-by-run IR is how Inductor defines the internal compute of a pointwise/reduction operation. It is characterized by a function that calls a number of functions in the 'ops' namespace, where these ops can be overridden by different handlers depending on what kind of semantic analysis you need to do. The ops Inductor supports include regular arithmetic operators, but also memory load/store, indirect indexing, masking and collective operations like reductions.…

PyTorch Developer Podcast

1
Unsigned integers 13:07

לפני 1 year13:07

13:07

Traditionally, unsigned integer support in PyTorch was not great; we only support uint8. Recently, we added support for uint16, uint32 and uint64. Bare bones functionality works, but I'm entreating the community to help us build out the rest. In particular, for most operations, we plan to use PT2 to build anything else. But if you have an eager kernel you really need, send us a PR and we'll put it in. While most of the implementation was straightforward, there are some weirdnesses related to type promotion inconsistencies with numpy and dealing with the upper range of uint64. There is also upcoming support for sub-byte dtypes uint1-7, and these will exclusively be implemented via PT2.…

PyTorch Developer Podcast

1
Inductor - IR 18:00

לפני 1 year18:00

18:00

Inductor IR is an intermediate representation that lives between ATen FX graphs and the final Triton code generated by Inductor. It was designed to faithfully represent PyTorch semantics and accordingly models views, mutation and striding. When you write a lowering from ATen operators to Inductor IR, you get a TensorBox for each Tensor argument which contains a reference to the underlying IR (via StorageBox, and then a Buffer/ComputedBuffer) that says how the Tensor was computed. The inner computation is represented via define-by-run, which allows for compact definition of IR representation, while still allowing you to extract an FX graph out if you desire. Scheduling then takes buffers of inductor IR and decides what can be fused. Inductor IR may have too many nodes, this would be a good thing to refactor in the future.…

PyTorch Developer Podcast

1
Dynamo - VariableTracker 15:55

לפני 1 year15:55

15:55

I talk about VariableTracker in Dynamo. VariableTracker is Dynamo's representation of the Python. I talk about some recent changes, namely eager guards and mutable VT. I also tell you how to find the functionality you care about in VariableTracker ( https://docs.google.com/document/d/1XDPNK3iNNShg07jRXDOrMk2V_i66u1hEbPltcsxE-3E/edit#heading=h.i6v7gqw5byv6 ).…

PyTorch Developer Podcast

1
Unbacked SymInts 21:31

לפני 2 years21:31

21:31

This podcast goes over the basics of unbacked SymInts. You might want to listen to this one before listening to https://pytorch-dev-podcast.simplecast.com/episodes/zero-one-specialization Some questions we answer (h/t from Gregory Chanan): - Are unbacked symints only for export? Because otherwise I could just break / wait for the actual size. But maybe I can save some retracing / graph breaks perf if I have them too? So the correct statement is "primarily" for export? - Why am I looking into the broadcasting code at all? Naively, I would expect the export graph to be just a list of ATen ops strung together. Why do I recurse that far down? Why can't I annotate DONT_TRACE_ME_BRO? - How does 0/1 specialization fit into this? I understand we may want to 0/1 specialize in a dynamic shape regime in "eager" mode (is there a better term?), but that doesn't seem to matter for export? - So far we've mainly been talking about how to handle our own library code. There is a worry about pushing complicated constraints downstream, similar to torchscript. What constraints does this actually push?…

ברוכים הבאים אל Player FM!

Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.

תקשיבו ל-500+ נושאים

32 subscribers

דומה לPyTorch Developer Podcast

פודקאסטים ששווה להאזין

PyTorch Developer Podcast « » CUDA graphs

CUDA graphs

פודקאסטים ששווה להאזין

ברוכים הבאים אל Player FM!

דומה לPyTorch Developer Podcast

מדריך עזר מהיר

PyTorch Developer Podcast « »
CUDA graphs