Research

Upper and lower memory capacity bounds of transformers for next-token prediction

Published in Arxiv, 2024

Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer transformers and our proofs highlight an important injectivity property satisfied by self-attention. Furthermore, we provide numerical evidence that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.

Recommended citation: Madden, L; Fox, C; Thrampoulidis, C. Upper and lower memory capacity bounds of transformers for next-token prediction. arXiv preprint arXiv:2405.13718, 2024
Download Paper

Study of the Edge of Stability in Deep Learning

Published in Master's Thesis, 2023

Optimization of deep neural networks has been studied extensively, but our understanding of it is still very limited. In particular, it is still unclear how to optimally set hyperparameters such as the step size for gradient descent when training neural networks. We explore the issue of tuning the step size for full batch gradient descent, examining a proposed phenomenon in the literature called the “edge of stability”. This refers to a phase of training for neural networks when the training loss non-monotonically decreases over time while the sharpness (the maximum eigenvalue of the Hessian matrix) oscillates around the threshold of precisely two divided by the step size. In this thesis, we start by providing the necessary background to understand the current state of the art in this area. We then perform various experiments on the sharpness, and study its trajectory over the course of training with full batch gradient descent. Unlike the vast majority of the previous literature, we focus on investigating the sharpness with respect to the various layers of neural networks individually. We observe that different layers of a neural network have differing behavior in their sharpness trajectories. We focus on fully connected networks and examine how varying hyperparameters of the network, such as the number of layers or hidden units per layer, impact the sharpness. Finally, we explore how various architecture choices impact step size selection when training with gradient descent. We see that changing just one parameter of a deep neural network, such as the non-linear activation functions used in each layer, can greatly impact which step sizes can lead to divergence during training. This motivates further investigation into how architecture choices affect hyperparameter selection.

Recommended citation: Fox, C. A Study of the Edge of Stability in Deep Learning. Master's Thesis, 2023
Download Paper

Deep Learning of Immune Cell Differentiation

Published in Proceedings of the National Academy of Sciences of the United States of America, 2020

Although we know many sequence-specific transcription factors (TFs), how the DNA sequence of cis-regulatory elements is decoded and orchestrated on the genome scale to determine immune cell differentiation is beyond our grasp. Leveraging a granular atlas of chromatin accessibility across 81 immune cell types, we asked if a convolutional neural network (CNN) could learn to infer cell type-specific chromatin accessibility solely from regulatory DNA sequences. With a tailored architecture and an ensemble approach to CNN parameter interpretation, we show that our trained network (“AI-TAC”) does so by rediscovering ab initio the binding motifs for known regulators and some unknown ones. Motifs whose importance is learned virtually as functionally important overlap strikingly well with positions determined by chromatin immunoprecipitation for several TFs. AI-TAC establishes a hierarchy of TFs and their interactions that drives lineage specification and also identifies stage-specific interactions, like Pax5/Ebf1 vs. Pax5/Prdm1, or the role of different NF-κB dimers in different cell types. AI-TAC assigns Spi1/Cebp and Pax5/Ebf1 as the drivers necessary for myeloid and B lineage fates, respectively, but no factors seemed as dominantly required for T cell differentiation, which may represent a fall-back pathway. Mouse-trained AI-TAC can parse human DNA, revealing a strikingly similar ranking of influential TFs and providing additional support that AI-TAC is a generalizable regulatory sequence decoder. Thus, deep learning can reveal the regulatory syntax predictive of the full differentiative complexity of the immune system.

Recommended citation: Maslova, A; Ramirez, R; Ma, K; Schmutz, H; Wang, C; Fox, C; Ng, B; Benoist, C; Mostafavi, S; The Immunological Genome Project. Deep Learning of Immune Cell Differentiation. Proceedings of the National Academy of Sciences of the United States of America, 2020
Download Paper