Skip to main content
Modern PyTorch Guide home page
Search...
⌘K
Official Docs
GitHub
GitHub
Search...
Navigation
Foundations
Building Models
Performance
Domains
Production
Advanced
API Reference
Community
Forums
torch.compile
Introduction to torch.compile
Compiler Backends
Graph Breaks
Dynamic Shapes
Inductor
Aot autograd
Troubleshooting
Compiled autograd
Regional compilation
Cache control
torch.export
Overview
Ir spec
Control flow
Dynamic shapes
Aot inductor
Quantization
Quantization Overview
Post-Training Quantization
Quantization-Aware Training
Dynamic Quantization
Fx graph mode
Backends
Distributed Training
Distributed Training Overview
DistributedDataParallel (DDP)
Fully Sharded Data Parallel (FSDP)
Fsdp2
DeepSpeed Integration
Tensor Parallelism
Pipeline Parallelism
Context parallel
Rpc framework
Zero redundancy
Comm hooks
Join context
Monarch
Hardware Utilization
Cuda semantics
GPU Memory Management
Multi-GPU Setup
NVLink & InfiniBand
Amp
Channels last
Mps backend
Hip rocm
Intel gpu
Memory Optimization
Gradient Checkpointing
Mixed Precision Deep Dive
CPU/Disk Offloading
Activation checkpointing
Profiling & Benchmarking
Pytorch profiler
Tensorboard profiling
Inductor profiling
Benchmark utils
Bottleneck
Flight recorder
Code Transforms
Overview
Graph transformations
Fusion patterns
Experimental
Hardware Utilization
Hip rocm
Mps backend
Intel gpu
⌘I