Posts – All Posts

A Guide to Classify 35,000 Videos with a 72B Multimodal LLM on the vLLM Engine

GPUs

vLLM

Transformers

In this post, I describe how we used the vLLM inference engine to classify 35k videos collected from TikTok for a research project. I share lessons learned about computing…

Drilling Down into Multimodal Attention

Transformers

Attention

This post explains how to inspect the attention patterns of a vision-language models (VLMs) using a new module I created on a fork of the circuitsviz library. To interact…

How Does Tiling Speed Up Matrix Multiplications on GPUs?

Mathematics

GPUs

TL;DR: Tiling is a technique used to reduce the number of memory accesses performed during matrix multiplication. We see how it improves compute intensity and how it speeds…

Grokking an Inner Product Inequality With Python on WebAssembly

Mathematics

Python

The purpose of this post is two-fold:

A Closed-Form Solution to Linearly Fine-Tune LLMs for Binary Classification

Machine Learning

In this post I show how to linearly fine-tune a large language model (LLM) using a closed-form solution, based on the Moore-Penrose Inverse. I will focus on the special case…