Posts – All Posts

Bond Pricing and Interest Rate Sensitivity

Finance

Python

While working at Allianz 2-3 years ago, I made this plot to understand why short-term bonds offer better protection against hikes in interest rates. I decided to release it…

FlashSampling / FMMS: The Fused Matmul-Sample Kernel

An Efficient and Exact LLM Sampling Algorithm

GPUs

Triton

vLLM

Transformers

Update: We renamed FMMS to FlashSampling, and we released a paper (arXiv) about it. Github Link.

Forecasting the Performance of Full CUDA Graphs for Speculative Decoding

Estimating Speedups from Piecewise to Full CUDA Graphs

vLLM

CUDA

Performance

In a previous post, I benchmarked speculative decoding with draft models in vLLM V1. This follow-up analyzes the potential performance gains from supporting full CUDA graphs…

Up to 3.55x Faster: Contributing Speculative Decoding with Draft Models to vLLM V1

Benchmarks and Key Learnings

vLLM

PyTorch

Triton

Transformers

I recently contributed speculative decoding with draft models to vLLM V1 (PR #24322). In this post, I benchmark the performance of my implementation (draft_model) and…

PyTorch and CPU-GPU Synchronizations

Writing Fast PyTorch Code

GPUs

PyTorch

Triton

TL;DR: This post is a guide to understand and prevent CPU-GPU synchronizations, which will help you write fast and efficient PyTorch programs 🚀. I explain the concept with…

A Guide to Classify 35,000 Videos with a 72B Multimodal LLM on the vLLM Engine

GPUs

vLLM

Transformers

In this post, I describe how we used the vLLM inference engine to classify 35k videos collected from TikTok for a research project. I share lessons learned about computing…

Drilling Down into Multimodal Attention

Transformers

Attention

This post explains how to inspect the attention patterns of a vision-language models (VLMs) using a new module I created on a fork of the circuitsviz library. To interact…

How Does Tiling Speed Up Matrix Multiplications on GPUs?

Mathematics

GPUs

TL;DR: Tiling is a technique used to reduce the number of memory accesses performed during matrix multiplication. We see how it improves compute intensity and how it speeds…

Grokking an Inner Product Inequality With Python on WebAssembly

Mathematics

Python

The purpose of this post is two-fold:

A Closed-Form Solution to Linearly Fine-Tune LLMs for Binary Classification

Machine Learning

In this post I show how to linearly fine-tune a large language model (LLM) using a closed-form solution, based on the Moore-Penrose Inverse. I will focus on the special case…