Papers

Towards Large Scale Training on Apple Silicon

Towards Large Scale Training on Apple SiliconJuly 2025

ICML 2025 ES-FoMo III Workshop

We introduce K-POP, a novel optimizer that applies Adam in the Kronecker-factored eigenbasis (KFE). We show that this paper is more effective on Apple Silicon hardware, due to the greater efficiency per FLOP, and higher memory usage. Additionally eigenvector calculation can be offloaded to the CPU whilst the forwards-backwards pass is performed, making the most of Apple Silicon's unified memory architecture.

EXO Gym

EXO Gym: a simulation environment for low-bandwidth trainingJuly 2025

ICML 2025 CODEML Workshop

EXO Gym allows distributed training methods (LocalSGD, DiLoCo, Deep Gradient Compression etc.) to be simulated on a single device. This lowers the barrier to entry for experimenting with distributed training; allowing small-scale distributed training experiments to be run easily, and then scaled up within the same codebase.

SPARTA

Improving the Efficiency of Distributed Training using Sparse Parameter AveragingMarch 2025

ICLR 2025 MCDC Workshop

SPARTA is a novel distributed training method where a small subset of parameters are averaged across workers each training step. Using SPARTA, we show that we can achieve the same performance as DiLoCo with 100x lower communication overhead.

Embedded Transfer Learning

Embedded Transfer LearningMay 2024

For my master's degree essay, I proposed a new area of transfer learning called embedded transfer, supervised by Professor Mihaela van der Schaar. In embedded transfer the target covariate domain is dimensionally contained within the source domain: to model this, parameter duplication is used in the transfer step. In the paper, I demonstrate the power of this new paradigm by applying it to ECG classification.