Jonathan Chang’s Blog
notebooks
Categories
All
(3)
Maximizing PyTorch Throughput with FastAPI
Minimizing GPU Idle Time with Async Operations
This post demonstrates how to maximize throughput by overlapping CPU work with GPU computation. Using asyncio and CUDA’s asynchronous execution APIs, we can minimize GPU idle time and handle multiple inference requests efficiently. The implementation code is available on GitHub:
vFLUX
Oct 28, 2024
Jonathan Chang
Additive Rotary Embedding - A Competitive Alternative to RoPE
Rotary Embedding (RoPE) is a relative positional embedding widely used by most language models. [1]
Jul 31, 2024
Jonathan Chang
Exploring the Effective Rank of Projection Weights in Attention
Disclaimer: Most of the code are written by GPT/Copilot, and is not optimized for presentation.
May 13, 2024
Jonathan Chang
No matching items