SFU MOCAD Seminar: Ricardo Baptista
Topic
Processing Language, Images and Other Data Modalities
Speakers
Details
A fundamental problem in artificial intelligence is how to simultaneously deploy data from different sources, such as audio, images, text, and video, collectively known as multimodal data. In this talk, I will present a mathematical framework for studying this question, focusing primarily on text and images. I will begin by describing how large language models (LLMs) operate, addressing the challenging issue of using real-number algorithms to process language. In particular, I will explain next-token prediction, the core of current LLM methodology. I will then focus on the canonical problem of measuring alignment between image and text data (contrastive learning). Finally, I will describe how images can be generated from text prompts (conditional generative modeling). From a mathematical perspective, a unifying theme underlying this work is the minimization of divergences defined on spaces of probability measures. A second key mathematical idea is the attention mechanism—a form of nonlinear correlation between vector-valued sequences. I aim to explain these concepts and their relevance to modern machine learning algorithms in an accessible fashion for a broad audience from the mathematical and computational sciences.