Finding the Right Optimization for Mixture-of-Experts @ EPFL - Machine Learning and Optimization Lab
Feb 2025 – July 2025Comprehensive study on optimization strategies for Mixture-of-Experts models, exploring trade-offs in load balancing, expressivity, and performance.
- Differentiated learning rate schedules for expert vs. non-expert parameters
- Tuned auxiliary loss coefficients and explored auxiliary loss–free methods
- Compared optimizers (AdamW vs. Shampoo) and activation functions (sigmoid vs. softmax)
- Varied number of experts to assess impact on validation loss, perplexity, and accuracy
