Just enough CUDA to be dangerous

PyTorch Developer Podcast

Player FM - Internet Radio Done Right

32 subscribers

הוסף לפני four שנים

תוכן מסופק על ידי PyTorch, Edward Yang, and Team PyTorch. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי PyTorch, Edward Yang, and Team PyTorch או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

Worth Knowing with Bonnie Habyan

1
#11: From OJ Simpson Case to Best Selling Author - Marcia Clark Shares Latest Real Crime Book Release and How Resilience Is Key to Success and Reinvention 34:35

לפני 22 ימים34:35

הפעל מאוחר יותר

רשימות

לייק

אהבתי

34:35

Marcia Clark, best known as the lead prosecutor in the O.J. Simpson trial, has become a trailblazer for women in law and beyond. Her journey from courtroom to bestselling author reflects her resilience and determination to redefine herself amidst intense public scrutiny. On this episode of Worth Knowing, Clark dives into her latest book, *Trial by Ambush*, which examines the 1953 Barbara Graham case—a story that highlights gender bias, media sensationalism, and the notion that all cases are subject to societal, cultural, and political winds. Clark shares how her experiences during the Simpson trial shaped her perspective on societal pressures and the role of women in high-stakes professions. Her reflections on how media, forensic science, and legal practices have evolved over decades offer valuable insights into the intersection of law and culture. This conversation is a compelling exploration of true crime, personal growth, and how Clark’s groundbreaking career continues to inspire a new generation of women to challenge norms and forge their own paths. Marcia Clark is a bestselling author and a criminal lawyer who began her career in law as a criminal defense attorney and went on to become a prosecutor in the L.A. District Attorney's Office in 1981. She spent ten years in the Special Trials Unit, where she handled a number of high-profile cases, including the prosecution of stalker/murderer Robert Bardo, whose conviction for the murder of actress Rebecca Schaeffer resulted in legislation that offered victims better protection from stalkers as well as increased punishment for the offenders. She was lead prosecutor for the O.J. Simpson murder trial. In May of 1997 her book on the Simpson case, "Without a Doubt," was published and reached #1 on the New York Times, Wall St. Journal, Washington Post, Los Angeles Times, and Publishers Weekly bestsellers lists. In February 2016, Clark re-released the book with a new foreword. Resources Sign up for the Worth Knowing LinkedIn Newsletter to stay up to date: https://www.linkedin.com/newsletters/worth-knowing-7236433935503618048/ Follow Bonnie on LinkedIn: https://www.linkedin.com/in/bonnie-habyan/ Go to the Worth Knowing website: https://www.worthknowing.show Check out The Agile Brand Guide website with articles, insights, and Martechipedia, the wiki for marketing technology: https://www.agilebrandguide.com The Agile Brand is produced by Missing Link—a Latina-owned strategy-driven, creatively fueled production co-op. From ideation to creation, they craft human connections through intelligent, engaging and informative content. https://www.missinglink.company…

לפני 4 שנים 16:32

MP3•בית הפרקים

Ever wanted to learn about CUDA but not sure where to start? In this sixteen minute episode I try to jam in as much CUDA knowledge as could be reasonably expected in a podcast. You won't know how to write a kernel after this episode, but you'll know about what a GPU is, what the general CUDA programming model is, why asynchronous execution makes everything complicated, and some general principles PyTorch abides by when designing CUDA kernels.

Further reading:

PyTorch docs on CUDA semantics https://pytorch.org/docs/stable/notes/cuda.html
The book I was recommended for learning CUDA when I first showed up at PyToch: Programming Massively Parallel Processors https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0128119861
The environment variable that makes CUDA synchronous is CUDA_LAUNCH_BLOCKING=1. cuda-memcheck is also useful for debugging CUDA problems https://docs.nvidia.com/cuda/cuda-memcheck/index.html

83 פרקים

#Tech #PyTorch #Edward Yang #Team PyTorch #Deep Learning #Machine Learning

Just enough CUDA to be dangerous

PyTorch Developer Podcast

32 subscribers

published לפני 4 שנים

שתפו

MP3•בית הפרקים

Further reading:

PyTorch docs on CUDA semantics https://pytorch.org/docs/stable/notes/cuda.html
The book I was recommended for learning CUDA when I first showed up at PyToch: Programming Massively Parallel Processors https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0128119861
The environment variable that makes CUDA synchronous is CUDA_LAUNCH_BLOCKING=1. cuda-memcheck is also useful for debugging CUDA problems https://docs.nvidia.com/cuda/cuda-memcheck/index.html

83 פרקים

#Tech #PyTorch #Edward Yang #Team PyTorch #Deep Learning #Machine Learning

כל הפרקים

1
Compiler collectives 16:33

לפני 43 weeks16:33

16:33

Compiler collectives are a PT2 feature where by compiler instances across multiple ranks use NCCL collectives to communicate information to other instances. This is used to ensure we consistently decide if inputs or static or dynamic across all ranks. See also PR at https://github.com/pytorch/pytorch/pull/130935…

1
TORCH_TRACE and tlparse 15:28

לפני 1 year15:28

15:28

TORCH_TRACE and tlparse are a structured log and log parser for PyTorch 2. It gives useful information about what code was compiled and what the intermediate build products look like.

1
Higher order operators 17:10

לפני 1 year17:10

17:10

Higher order operators are a special form of operators in torch.ops which have relaxed input argument requirements: in particular, they can accept any form of argument, including Python callables. Their name is based off of their most common use case, which is to represent higher order functions like control flow operators. However, they are also used to implement other variants of basic operators and can also be used to smuggle in Python data that is quite unusual. They are implemented using a Python dispatcher.…

1
Inductor - Post-grad FX passes 24:07

לפני 1 year24:07

24:07

The post-grad FX passes in Inductor run after AOTAutograd has functionalized and normalized the input program into separate forward/backward graphs. As such, they generally can assume that the graph in question is functionalized, except for some mutations to inputs at the end of the graph. At the end of post-grad passes, there are special passes that reintroduce mutation into the graph before going into the rest of Inductor lowering which is generally aware of passes. The post-grad FX passes are varied but are typically domain specific passes making local changes to specific parts of the graph.…

1
CUDA graph trees 20:50

לפני 1 year20:50

20:50

CUDA graph trees are the internal implementation of CUDA graphs used in PT2 when you say mode="reduce-overhead". Their primary innovation is that they allow the reuse of memory across multiple CUDA graphs, as long as they form a tree structure of potential paths you can go down with the CUDA graph. This greatly reduced the memory usage of CUDA graphs in PT2. There are some operational implications to using CUDA graphs which are described in the podcast.…

1
Min-cut partitioner 15:56

לפני 1 year15:56

15:56

The min-cut partitioner makes decisions about what to save for backwards when splitting the forward and backwards graph from the joint graph traced by AOTAutograd. Crucially, it doesn't actually do a "split"; instead, it is deciding how much of the joint graph should be used for backwards. I also talk about the backward retracing problem.…

1
AOTInductor 17:30

לפני 1 year17:30

17:30

AOTInductor is a feature in PyTorch that lets you export an inference model into a self-contained dynamic library, which can subsequently be loaded and used to run optimized inference. It is aimed primarily at CUDA and CPU inference applications, for situations when your model export once to be exported once while your runtime may still get continuous updates. One of the big underlying organizing principles is a limited ABI which does not include libtorch, which allows these libraries to stay stable over updates to the runtime. There are many export-like use cases you might be interested in using AOTInductor for, and some of the pieces should be useful, but AOTInductor does not necessarily solve them.…

1
Tensor subclasses and PT2 13:25

לפני 1 year13:25

13:25

Tensor subclasses allow you to add extend PyTorch with new types of tensors without having to write any C++. They have been used to implement DTensor, FP8, Nested Jagged Tensor and Complex Tensor. Recent work by Brian Hirsh means that we can compile tensor subclasses in PT2, eliminating their overhead. The basic mechanism by which this compilation works is a desugaring process in AOTAutograd. There are some complications involving views, dynamic shapes and tangent metadata mismatch.…

1
Compiled autograd 18:07

לפני 1 year18:07

18:07

Compiled autograd is an extension to PT2 that permits compiling the entirety of a backward() call in PyTorch. This allows us to fuse accumulate grad nodes as well as trace through arbitrarily complicated Python backward hooks. Compiled autograd is an important part of our plans for compiled DDP/FSDP as well as for whole-graph compilation.…

1
PT2 extension points 15:54

לפני 1 year15:54

15:54

We discuss some extension points for customizing PT2 behavior across Dynamo, AOTAutograd and Inductor.

1
Inductor - Define-by-run IR 12:06

לפני 1 year12:06

12:06

Define-by-run IR is how Inductor defines the internal compute of a pointwise/reduction operation. It is characterized by a function that calls a number of functions in the 'ops' namespace, where these ops can be overridden by different handlers depending on what kind of semantic analysis you need to do. The ops Inductor supports include regular arithmetic operators, but also memory load/store, indirect indexing, masking and collective operations like reductions.…

1
Unsigned integers 13:07

לפני 1 year13:07

13:07

Traditionally, unsigned integer support in PyTorch was not great; we only support uint8. Recently, we added support for uint16, uint32 and uint64. Bare bones functionality works, but I'm entreating the community to help us build out the rest. In particular, for most operations, we plan to use PT2 to build anything else. But if you have an eager kernel you really need, send us a PR and we'll put it in. While most of the implementation was straightforward, there are some weirdnesses related to type promotion inconsistencies with numpy and dealing with the upper range of uint64. There is also upcoming support for sub-byte dtypes uint1-7, and these will exclusively be implemented via PT2.…

1
Inductor - IR 18:00

לפני 1 year18:00

18:00

Inductor IR is an intermediate representation that lives between ATen FX graphs and the final Triton code generated by Inductor. It was designed to faithfully represent PyTorch semantics and accordingly models views, mutation and striding. When you write a lowering from ATen operators to Inductor IR, you get a TensorBox for each Tensor argument which contains a reference to the underlying IR (via StorageBox, and then a Buffer/ComputedBuffer) that says how the Tensor was computed. The inner computation is represented via define-by-run, which allows for compact definition of IR representation, while still allowing you to extract an FX graph out if you desire. Scheduling then takes buffers of inductor IR and decides what can be fused. Inductor IR may have too many nodes, this would be a good thing to refactor in the future.…

1
Dynamo - VariableTracker 15:55

לפני 1 year15:55

15:55

I talk about VariableTracker in Dynamo. VariableTracker is Dynamo's representation of the Python. I talk about some recent changes, namely eager guards and mutable VT. I also tell you how to find the functionality you care about in VariableTracker ( https://docs.google.com/document/d/1XDPNK3iNNShg07jRXDOrMk2V_i66u1hEbPltcsxE-3E/edit#heading=h.i6v7gqw5byv6 ).…

1
Unbacked SymInts 21:31

לפני 2 years21:31

21:31

This podcast goes over the basics of unbacked SymInts. You might want to listen to this one before listening to https://pytorch-dev-podcast.simplecast.com/episodes/zero-one-specialization Some questions we answer (h/t from Gregory Chanan): - Are unbacked symints only for export? Because otherwise I could just break / wait for the actual size. But maybe I can save some retracing / graph breaks perf if I have them too? So the correct statement is "primarily" for export? - Why am I looking into the broadcasting code at all? Naively, I would expect the export graph to be just a list of ATen ops strung together. Why do I recurse that far down? Why can't I annotate DONT_TRACE_ME_BRO? - How does 0/1 specialization fit into this? I understand we may want to 0/1 specialize in a dynamic shape regime in "eager" mode (is there a better term?), but that doesn't seem to matter for export? - So far we've mainly been talking about how to handle our own library code. There is a worry about pushing complicated constraints downstream, similar to torchscript. What constraints does this actually push?…

PyTorch Developer Podcast

1
DataLoader with multiple workers leaks memory 16:38

לפני 4 years16:38

16:38

Today I'm going to talk about a famous issue in PyTorch, DataLoader with num_workers > 0 causes memory leak ( https://github.com/pytorch/pytorch/issues/13246 ). This bug is a good opportunity to talk about DataSet/DataLoader design in PyTorch, fork and copy-on-write memory in Linux and Python reference counting; you have to know about all of these things to understand why this bug occurs, but once you do, it also explains why the workarounds help. Further reading. A nice summary of the full issue https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662 DataLoader architecture RFC https://github.com/pytorch/pytorch/issues/49440 Cinder Python https://github.com/facebookincubator/cinder…

PyTorch Developer Podcast

1
Batching 13:37

לפני 4 years13:37

13:37

PyTorch operates on its input data in a batched manner, typically processing multiple batches of an input at once (rather than once at a time, as would be the case in typical programming). In this podcast, we talk a little about the implications of batching operations in this way, and then also about how PyTorch's API is structured for batching (hint: poorly) and how Numpy introduced a concept of ufunc/gufuncs to standardize over broadcasting and batching behavior. There is some overlap between this podcast and previous podcasts about TensorIterator and vmap; you may also be interested in those episodes. Further reading. ufuncs and gufuncs https://numpy.org/doc/stable/reference/ufuncs.html and https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html A brief taxonomy of PyTorch operators by shape behavior http://blog.ezyang.com/2020/05/a-brief-taxonomy-of-pytorch-operators-by-shape-behavior/ Related episodes on TensorIterator and vmap https://pytorch-dev-podcast.simplecast.com/episodes/tensoriterator and https://pytorch-dev-podcast.simplecast.com/episodes/vmap…

PyTorch Developer Podcast

1
Multiple dispatch in __torch_function__ 14:20

לפני 4 years14:20

14:20

Python is a single dispatch OO language, but there are some operations such as binary magic methods which implement a simple form of multiple dispatch. torch_function__ (through its Numpy predecessor __array_function ) generalizes this mechanism so that invocations of torch.add with different subclasses work properly. This podcast describes how this mechanism works and how it can be used (in an unconventional way) to build composable subclasses ala JAX in functorch. Further reading: This podcast in written form https://dev-discuss.pytorch.org/t/functorch-levels-as-dynamically-allocated-classes/294 Multiple dispatch resolution rules in the RFC https://github.com/pytorch/rfcs/blob/master/RFC-0001-torch-function-for-methods.md#process-followed-during-a-functionmethod-call…

PyTorch Developer Podcast

1
Multithreading 18:34

לפני 4 years18:34

18:34

Writing multithreading code has always been a pain, and in PyTorch there are buckets and buckets of multithreading related issues you have to be aware about and deal with when writing code that makes use of it. We'll cover how you interface with multithreading in PyTorch, what goes into implementing those interfaces (thread pools!) and also some miscellaneous stuff like TLS, forks and data structure thread safety that is also relevant. Further reading: TorchScript CPU inference threading documentation https://github.com/pytorch/pytorch/blob/master/docs/source/notes/cpu_threading_torchscript_inference.rst c10 thread pool https://github.com/pytorch/pytorch/blob/master/c10/core/thread_pool.h and autograd thread pool https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/engine.cpp Tracking issue for TLS propagation across threads https://github.com/pytorch/pytorch/issues/28520…

PyTorch Developer Podcast

1
Asynchronous versus synchronous execution 15:03

לפני 4 years15:03

15:03

CUDA is asynchronous, CPU is synchronous. Making them play well together can be one of the more thorny and easy to get wrong aspects of the PyTorch API. I talk about why non_blocking is difficult to use correctly, a hypothetical "asynchronous CPU" device which would help smooth over some of the API problems and also why it used to be difficult to implement async CPU (but it's not hard anymore!) At the end, I also briefly talk about how async/sync impedance can also show up in unusual places, namely the CUDA caching allocator. Further reading. CUDA semantics which discuss non_blocking somewhat https://pytorch.org/docs/stable/notes/cuda.html Issue requesting async cpu https://github.com/pytorch/pytorch/issues/44343…

PyTorch Developer Podcast

1
gradcheck 16:58

לפני 4 years16:58

16:58

We talk about gradcheck, the property based testing mechanism that we use to verify the correctness of analytic gradient formulas in PyTorch. I'll talk a bit about testing in general, property based testing and why gradcheck is a particularly useful property based test. There will be some calculus, although I've tried to keep the math mostly to intuitions and pointers on what to read up on elsewhere. Further reading. Gradcheck mechanics, a detailed mathematical explanation of how it works https://pytorch.org/docs/stable/notes/gradcheck.html In particular, it also explains how gradcheck extends to complex numbers JAX has a pretty good explanation about vjp and jvp at https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html Fast gradcheck tracking issue https://github.com/pytorch/pytorch/issues/53876…

PyTorch Developer Podcast

1
torch.use_deterministic_algorithms 10:50

לפני 4 years10:50

10:50

torch.use_deterministic_algorithms lets you force PyTorch to use deterministic algorithms. It's very useful for debugging! There are some errors in the recording: the feature is called torch.use_deterministic_algorithms, and there is not actually a capability to warn (this was in an old version of the PR but taken out), we just error if you hit nondeterministic code. Docs: https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms…

PyTorch Developer Podcast

1
Reference counting 15:14

לפני 4 years15:14

15:14

Reference counting is a common memory management technique in C++ but PyTorch does its reference counting in a slightly idiosyncratic way using intrusive_ptr. We'll talk about why intrusive_ptr exists, the reason why refcount bumps are slow in C++ (but not in Python), what's up with const Tensor& everywhere, why the const is a lie and how TensorRef lets you create a const Tensor& from a TensorImpl* without needing to bump your reference count. Further reading. Why you shouldn't feel bad about passing tensor by reference https://dev-discuss.pytorch.org/t/we-shouldnt-feel-bad-about-passing-tensor-by-reference/85 Const correctness in PyTorch https://github.com/zdevito/ATen/issues/27 TensorRef RFC https://github.com/pytorch/rfcs/pull/16…

PyTorch Developer Podcast

1
Memory layout 16:26

לפני 4 years16:26

16:26

Memory layout specifies how the logical multi-dimensional tensor maps its elements onto physical linear memory. Some layouts admit more efficient implementations, e.g., NCHW versus NHWC. Memory layout makes use of striding to allow users to conveniently represent their tensors with different physical layouts without having to explicitly tell every operator what to do. Further reading. Tutorial https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html Memory format RFC https://github.com/pytorch/pytorch/issues/19092 Layout permutation proposal (not implemented) https://github.com/pytorch/pytorch/issues/32078…

PyTorch Developer Podcast

1
pytorch-probot 13:06

לפני 4 years13:06

13:06

pytorch-probot is a GitHub application that we use to automate common tasks in GitHub. I talk about what it does and some design philosophy for it. Repo is at: https://github.com/pytorch/pytorch-probot

PyTorch Developer Podcast

1
API design via lexical and dynamic scoping 21:44

לפני 4 years21:44

21:44

Lexical and dynamic scoping are useful tools to reason about various API design choices in PyTorch, related to context managers, global flags, dynamic dispatch, and how to deal with BC-breaking changes. I'll walk through three case studies, one from Python itself (changing the meaning of division to true division), and two from PyTorch (device context managers, and torch function for factory functions). Further reading. Me unsuccessfully asking around if there was a way to simulate __future__ in libraries https://stackoverflow.com/questions/66927362/way-to-opt-into-bc-breaking-changes-on-methods-within-a-single-module A very old issue asking for a way to change the default GPU device https://github.com/pytorch/pytorch/issues/260 and a global GPU flag https://github.com/pytorch/pytorch/issues/7535 A more modern issue based off the lexical module idea https://github.com/pytorch/pytorch/issues/27878 Array module NEP https://numpy.org/neps/nep-0037-array-module.html…

PyTorch Developer Podcast

1
Intro to distributed 15:41

לפני 4 years15:41

15:41

Today, Shen Li (mrshenli) joins me to talk about distributed computation in PyTorch. What is distributed? What kinds of things go into making distributed work in PyTorch? What's up with all of the optimizations people want to do here? Further reading. PyTorch distributed overview https://pytorch.org/tutorials/beginner/dist_overview.html Distributed data parallel https://pytorch.org/docs/stable/notes/ddp.html…

PyTorch Developer Podcast

1
Double backwards 16:39

לפני 4 years16:39

16:39

Double backwards is PyTorch's way of implementing higher order differentiation. Why might you want it? How does it work? What are some of the weird things that happen when you do this? Further reading. Epic PR that added double backwards support for convolution initially https://github.com/pytorch/pytorch/pull/1643…

PyTorch Developer Podcast

1
Functional modules 14:34

לפני 4 years14:34

14:34

Functional modules are a proposed mechanism to take PyTorch's existing NN module API and transform it into a functional form, where all the parameters are explicit argument. Why would you want to do this? What does functorch have to do with it? How come PyTorch's existing APIs don't seem to need this? What are the design problems? Further reading. Proposal in GitHub issues https://github.com/pytorch/pytorch/issues/49171 Linen design in flax https://flax.readthedocs.io/en/latest/design_notes/linen_design_principles.html…

PyTorch Developer Podcast

1
CUDA graphs 13:55

לפני 4 years13:55

13:55

What are CUDA graphs? How are they implemented? What does it take to actually use them in PyTorch? Further reading. NVIDIA has docs on CUDA graphs https://developer.nvidia.com/blog/cuda-graphs/ Nuts and bolts implementation PRs from mcarilli: https://github.com/pytorch/pytorch/pull/51436 https://github.com/pytorch/pytorch/pull/46148…

ברוכים הבאים אל Player FM!

Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.

תקשיבו ל-500+ נושאים

32 subscribers

דומה לPyTorch Developer Podcast

Ailun 3 Pack Screen Protector for iPhone 16 Pro [6.3 inch] + 3 Pack Camera Lens Protector with Installation Frame,Case Friendly Tempered Glass Film,[9H Hardness] - HD [6 Pack]

Ailun Screen Protector for iPhone 16 / iPhone 15 / iPhone 15 Pro [6.1 Inch] Display 3 Pack Tempered Glass, Dynamic Island Compatible, Case Friendly [Not for iPhone 16 Pro 6.3 Inch].

USANOOKS Microfiber Cleaning Cloth Grey - 12 Pcs (12.5"x12.5") - High Performance - 1200 Washes, Ultra Absorbent Microfiber Towel Weave Grime & Liquid for Streak-Free Mirror Shine - Car Washing Cloth

פודקאסטים ששווה להאזין

PyTorch Developer Podcast « » Just enough CUDA to be dangerous

Just enough CUDA to be dangerous

פודקאסטים ששווה להאזין

ברוכים הבאים אל Player FM!

Neenah Index Cardstock, 8.5" x 11", 90 lb/163 gsm, White, Lightweight, 94 Brightness, 300 Sheets (91437)

iPhone Charger Fast Charging 2 Pack Type C Wall Charger Block with 2 Pack [6FT&10FT] Long USB C to Lightning Cable for iPhone 14/13/12/12 Pro Max/11/Xs Max/XR/X,AirPods Pro

Amazon Basics Clear Thermal Laminating Plastic Paper Laminator Sheets, 9 x 11.5-Inch, 200-Pack, 3mil

Apple AirPods Pro 2 Wireless Earbuds, Active Noise Cancellation, Hearing Aid Feature, Bluetooth Headphones, Transparency, Personalized Spatial Audio, High-Fidelity Sound, H2 Chip, USB-C Charging

דומה לPyTorch Developer Podcast

מדריך עזר מהיר

PyTorch Developer Podcast « »
Just enough CUDA to be dangerous