Posts by Collection

portfolio

publications

Deep Learning of Immune Cell Differentiation

Published in Proceedings of the National Academy of Sciences of the United States of America, 2020

Although we know many sequence-specific transcription factors (TFs), how the DNA sequence of cis-regulatory elements is decoded and orchestrated on the genome scale to determine immune cell differentiation is beyond our grasp. Leveraging a granular atlas of chromatin accessibility across 81 immune cell types, we asked if a convolutional neural network (CNN) could learn to infer cell type-specific chromatin accessibility solely from regulatory DNA sequences. With a tailored architecture and an ensemble approach to CNN parameter interpretation, we show that our trained network (“AI-TAC”) does so by rediscovering ab initio the binding motifs for known regulators and some unknown ones. Motifs whose importance is learned virtually as functionally important overlap strikingly well with positions determined by chromatin immunoprecipitation for several TFs. AI-TAC establishes a hierarchy of TFs and their interactions that drives lineage specification and also identifies stage-specific interactions, like Pax5/Ebf1 vs. Pax5/Prdm1, or the role of different NF-κB dimers in different cell types. AI-TAC assigns Spi1/Cebp and Pax5/Ebf1 as the drivers necessary for myeloid and B lineage fates, respectively, but no factors seemed as dominantly required for T cell differentiation, which may represent a fall-back pathway. Mouse-trained AI-TAC can parse human DNA, revealing a strikingly similar ranking of influential TFs and providing additional support that AI-TAC is a generalizable regulatory sequence decoder. Thus, deep learning can reveal the regulatory syntax predictive of the full differentiative complexity of the immune system.

Recommended citation: A. Maslova, R. N. Ramirez, K. Ma, H. Schmutz, C. Wang, C. Fox, B. Ng, C. Benoist, S. Mostafavi, Immunological Genome Project. "Deep Learning of Immune Cell Differentiation". Proceedings of the National Academy of Sciences of the United States of America, 2020
Download Paper

Study of the Edge of Stability in Deep Learning

Published in Master's Thesis, 2023

Optimization of deep neural networks has been studied extensively, but our understanding of it is still very limited. In particular, it is still unclear how to optimally set hyperparameters such as the step size for gradient descent when training neural networks. We explore the issue of tuning the step size for full batch gradient descent, examining a proposed phenomenon in the literature called the “edge of stability”. This refers to a phase of training for neural networks when the training loss non-monotonically decreases over time while the sharpness (the maximum eigenvalue of the Hessian matrix) oscillates around the threshold of precisely two divided by the step size. In this thesis, we start by providing the necessary background to understand the current state of the art in this area. We then perform various experiments on the sharpness, and study its trajectory over the course of training with full batch gradient descent. Unlike the vast majority of the previous literature, we focus on investigating the sharpness with respect to the various layers of neural networks individually. We observe that different layers of a neural network have differing behavior in their sharpness trajectories. We focus on fully connected networks and examine how varying hyperparameters of the network, such as the number of layers or hidden units per layer, impact the sharpness. Finally, we explore how various architecture choices impact step size selection when training with gradient descent. We see that changing just one parameter of a deep neural network, such as the non-linear activation functions used in each layer, can greatly impact which step sizes can lead to divergence during training. This motivates further investigation into how architecture choices affect hyperparameter selection.

Recommended citation: C. Fox. "A Study of the Edge of Stability in Deep Learning". Masters Thesis, 2023
Download Paper

Nonmonotone Line Searches Operate at the Edge of Stability

Published in NeurIPS OPT Workshop, 2024

The traditional convergence analysis of Gradient Descent (GD) assumes the step size to be bounded from above by twice the reciprocal of the sharpness, i.e., the largest eigenvalue of the Hessian of the objective function. However, recent numerical observations on neural networks have shown that GD also converges with larger step sizes. In this case, GD may enter into the so-called edge of stability phase in which the objective function decreases faster than with smaller steps, but nonmonotonically. Interestingly enough, this same behaviour was already observed roughly 40 years ago when the first nonmonotone line search was proposed. These methods are designed to accept larger steps than their monotone counterparts (e.g., Armijo) as they do not impose a decrease in the objective function, while still being provably convergent. In this paper, we show that nonmonotone line searches are able to operate at the edge of stability regime right from the start of the training. Moreover, we design a new resetting technique that speeds up training and yields flatter solutions, by keeping GD at the edge of stability without requiring hyperparameter-tuning or prior knowledge of problem-dependent constants. To conclude, we observe that the large steps yielded by our method seem to mimic the behavior of the well-known Barzilai-Borwein method.

Recommended citation: C. Fox, L. Galli, M. Schmidt, H. Rauhut. "Nonmonotone Line Searches Operate at the Edge of Stability". NeurIPS OPT Workshop, 2024
Download Paper

Glocal Smoothness: Line Search can really help!

Published in Arxiv, 2025

Iteration complexities for first-order optimization algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have lead to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local (“glocal”) smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes, and that in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

Recommended citation: C. Fox, A. Mishkin, S. Vaswani, M. Schmidt. "Glocal Smoothness: Line Search can really help!". arXiv preprint arXiv:2506.12648, 2025
Download Paper

Upper and lower memory capacity bounds of transformers for next-token prediction

Published in IEEE Transactions on Information Theory, 2025

Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by self-attention. Furthermore, we provide numerical evidence that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.

Recommended citation: L. Madden, C. Fox, C. Thrampoulidis. "Next-token Prediction Capacity: General Upper Bounds and a Lower Bound for Transformers". IEEE Transactions on Information Theory, 2025
Download Paper

talks

teaching

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2015

CPSC 110 - Computation, Programs, and Programming

Fundamental program and computation structures. Introductory programming skills. Computation as a tool for information processing, simulation and modelling, and interacting with the world.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2016

CPSC 110 - Computation, Programs, and Programming

Fundamental program and computation structures. Introductory programming skills. Computation as a tool for information processing, simulation and modelling, and interacting with the world.

Teaching Assistant

Undergraduate Course, Department of Statistics, University of British Columbia, 2016

STAT 200 - Elementary Statistics for Applications

Classical, nonparametric, and robust inferences about means, variances, and analysis of variance, using computers. Emphasis on problem formulation, assumptions, and interpretation.

Teaching Assistant

Undergraduate Course, Department of Statistics, University of British Columbia, 2016

STAT 302 - Introduction to Probability

Basic notions of probability, random variables, expectation and conditional expectation, limit theorems.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2017

CPSC 213 - Introduction to Computer Systems

Software architecture, operating systems, and I/O architectures. Relationships between application software, operating systems, and computing hardware; critical sections, deadlock avoidance, and performance; principles and operation of disks and networks.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2017

CPSC 221 - Basic Algorithms and Data Structures

Design and analysis of basic algorithms and data structures; algorithm analysis methods, searching and sorting algorithms, basic data structures, graphs and concurrency.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2018

CPSC 421 - Introduction to Theory of Computing

Characterizations of computability (using machines, languages and functions). Universality, equivalence and Church’s thesis. Unsolvable problems. Restricted models of computation. Finite automata, grammars and formal languages.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2019

CPSC 406 - Computational Optimization

Formulation and analysis of algorithms for continuous and discrete optimization problems; linear, nonlinear, network, dynamic, and integer optimization; large-scale problems; software packages and their implementation; duality theory and sensitivity.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2021

CPSC 302 - Numerical Computation for Algebraic Problems

Numerical techniques for basic mathematical processes involving no discretization, and their analysis. Solution of linear systems, including analysis of round-off errors; norms and condition number; introduction to iterative techniques in linear algebra, including eigenvalue problems; solution to nonlinear equations.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2022

CPSC 340 - Machine Learning and Data Mining

Models of algorithms for dimensionality reduction, nonlinear regression, classification, clustering and unsupervised learning; applications to computer graphics, computer games, bio-informatics, information retrieval, e-commerce, databases, computer vision and artificial intelligence.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2022

CPSC 406 - Computational Optimization

Formulation and analysis of algorithms for continuous and discrete optimization problems; linear, nonlinear, network, dynamic, and integer optimization; large-scale problems; software packages and their implementation; duality theory and sensitivity.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2023

CPSC 213 - Introduction to Computer Systems

Software architecture, operating systems, and I/O architectures. Relationships between application software, operating systems, and computing hardware; critical sections, deadlock avoidance, and performance; principles and operation of disks and networks.

Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2023

CPSC 340 - Machine Learning and Data Mining

Models of algorithms for dimensionality reduction, nonlinear regression, classification, clustering and unsupervised learning; applications to computer graphics, computer games, bio-informatics, information retrieval, e-commerce, databases, computer vision and artificial intelligence.

Head Teaching Assistant

Undergraduate Course, Department of Computer Science, University of British Columbia, 2024

CPSC 340 - Machine Learning and Data Mining

Models of algorithms for dimensionality reduction, nonlinear regression, classification, clustering and unsupervised learning; applications to computer graphics, computer games, bio-informatics, information retrieval, e-commerce, databases, computer vision and artificial intelligence.