Mixture of Experts the new AI models approach by Scaling AI with Specialized Intelligence

mixture of experts models mixture of experts models

Mixture of Experts (MoE) is a machine learning technique where multiple specialized models (experts) work together, with a gating network selecting the best expert for each input. In the race to build ever-larger and more capable AI systems, a new architecture is gaining traction: Mixture of Experts (MoE). Unlike traditional models that activate every neuron in every layer during a forward pass, MoE selectively routes inputs through only a subset of specialized sub-networks (experts). This drastically reduces computation – while maintaining or even improving performance.

The mixture of experts (MoE) technique addresses this challenge by breaking down large models into smaller, specialized networks. The concept MoE originated from the 1991 paper Adaptive Mixture of Local Experts. Since then, MoEs have been employed in multi-trillion parameter models, such as the 1.6 trillion parameter open-sourced Switch Transformers.

In this post, we’ll break down what MoE is, why it’s important, and how it’s shaping the future of scalable machine learning.

What Is a Mixture of Experts (MoE)?

Imagine an AI model as a team of specialists, each with their own unique expertise. A mixture of experts (MoE) model operates on this principle by dividing a complex task among smaller, specialized networks known as “experts.”

Each expert focuses on a specific aspect of the problem, enabling the model to address the task more efficiently and accurately. It’s similar to having a doctor for medical issues, a mechanic for car problems, and a chef for cooking—each expert handles what they do best.

By collaborating, these specialists can solve a broader range of problems more effectively than a single generalist.

Let’s take a look at the diagram below—we’ll explain it shortly after.

From Sparse to Soft Mixture of Experts

Let’s break down the components of this diagram:

  • Input: This is the problem or data you want the AI to handle.
  • Experts: These are smaller AI models, each trained to be really good at a specific part of the overall problem. Think of them like the different specialists on your team.
  • Gating network: This is like a manager who decides which expert is best suited for each part of the problem. It looks at the input and figures out who should work on what.
  • Output: This is the final answer or solution that the AI model produces after the experts have done their work.

The advantages of using MoE are:

  • Efficiency: Only the experts who are good at a particular part of the problem are used, saving time and computing power.
  • Flexibility: You can easily add more experts or change their specialties, making the system adaptable to different problems.
  • Better results: Because each expert focuses on what they’re good at, the overall solution is usually more accurate and reliable.

At its core, a Mixture of Experts model consists of:

  • A pool of expert networks (e.g., small feed-forward neural nets).

  • A gating network that decides which experts to activate for a given input.

  • Sparse activation — only a few experts are used per input, not all of them.

How it works:

  1. Input data is fed into the model.

  2. A lightweight gating function determines which experts are most relevant.

  3. Only those experts (say, 2 out of 16) are activated.

  4. The outputs from those experts are combined (usually weighted) and passed onward.

This means for a model with 16 experts, only ~12.5


Expert networks

Expert Networks in Mixture of Experts (MoE) refer to the individual specialized sub-networks (called “experts”) within an MoE model that are selectively activated during inference or training. Each expert is typically a neural network or MLP trained to handle a specific subset of input patterns or tasks.

Think of the “expert networks” in a MoE model as a team of specialists. Instead of having one AI model do everything, each expert focuses on a particular type of task or data. In an MoE model, these experts are like individual neural networks, each trained on different datasets or tasks. They are designed to be sparse meaning only a few are active at any given time, depending on the nature of the input. This prevents the system from being overwhelmed and ensures that the most relevant experts are working on the problem.

Mixture of Experts: How an Ensemble of AI Models Act as One | Deepgram

Here’s a detailed breakdown:

Role of Expert Networks in MoE

1. Specialization

Each expert learns to specialize in certain patterns. For example:

  • In language modeling, one expert may specialize in legal text, another in code, and another in conversation.

  • In vision, one expert may specialize in textures, another in shapes.

2. Sparsity

Unlike standard dense models, only a few experts are active per input (e.g., top-2 out of 64). This reduces computation while maintaining model capacity.

3. Routing via Gating

A gating network (usually a softmax over the experts’ scores) routes each token or input to the most relevant expert(s). The gating function can be:

  • Deterministic (Top-k)

  • Stochastic (Roulette-style or noisy gating)

️ Architecture of Expert Networks

Each expert is typically a lightweight MLP or transformer block. For example, in Google’s Switch Transformer and GShard, experts share similar architecture but differ in weights and activations.

Mixture of Experts Model. | Download Scientific Diagram

Common design:

Each expert processes the input independently, and the outputs are combined (e.g., weighted sum or concatenation).

⚙️ Training Challenges

  1. Load Balancing:

    • Some experts might get overused.

    • Solution: Add auxiliary losses (e.g., entropy-based) to encourage even distribution of tokens.

  2. Expert Collapse:

    • All inputs route to a few experts.

    • Solution: Regularization and noise in gating.

  3. Communication Overhead (in distributed settings):

    • Experts may be sharded across machines (e.g., in TPU pods).

    • Requires careful design for fast routing and execution.

Benefits of Expert Networks in MoE:

Meet LEVER: A Simple AI Approach to Improve Language-to-Code Generation by Learning to Verify the Generated Programs with their Execution Results - MarkTechPost

FeatureDescription
ScalabilityEnables trillion-parameter models without proportional compute cost.
EfficiencyOnly a subset of model parameters are used per forward pass.
InterpretabilityExperts can be analyzed to understand what types of data they specialize in.
Transfer LearningExperts can be fine-tuned or added for domain-specific tasks.

Examples of MoE Models Using Expert Networks

ModelKey Feature
GShard (Google)First large-scale MoE with routing via top-k gating.
Switch TransformerUses top-1 routing, simplifies MoE to scale to 1T+ parameters.
GLaM (Google)Combines MoE with dense layers for robustness.
V-MoE (Vision)Applies MoE in vision models, experts route based on image patches.
DeepSpeed-MoE (Microsoft)Optimized for training large MoE models efficiently.

Further Reading


Gating Networks in Mixture of Experts (MoE)

A Gating Network is the component in a Mixture of Experts (MoE) model that selects which expert(s) to activate for each input. Think of it as the controller that decides where to send each input, based on its content.

The gating network (the router) is another type of neural network that learns to analyze the input data (like a sentence to be translated) and determine which experts are best suited to handle it.

It does this by assigning a “weight” or importance score to each expert based on the characteristics of the input.  The experts with the highest weights are then selected to process the data.

There are various ways (called “routing algorithms”) that the gating network can select the right experts. Here are a few common ones:

  1. Top-k routing: This is the simplest method. The gating network picks the top ‘k’ experts with the highest affinity scores and sends the input data to them.
  2. Expert choice routing: In this method, instead of the data choosing the experts, the experts decide which data they can handle best. This strategy aims to achieve the best load balancing and allows for a varied way of mapping data to experts.
  3. Sparse routing: This approach only activates a few experts for each piece of data, creating a sparse network. Sparse routing uses less computational power compared to dense routing, where all experts are active for every piece of data.

During the process of making predictions, the model combines the outputs from the experts, following the same process it used to assign tasks to the experts. For a single task, more than one expert might be needed, depending on how complex and varied the problem is.

Here’s a step-by-step mathematical breakdown of the gating network’s operations in a Mixture of Experts (MoE), including tensor shapes and intermediate computations. We’ll use the PyTorch implementation as a reference.

Notation
– Input: \( x \in \mathbb{R}^{B \times T \times D} \)
(Batch size \( B \), Sequence length \( T \), Feature dimension \( D \))
– Number of experts: \( N \)
– Top-\( k \) experts selected per token.

Step 1: Flatten Input
The input is reshaped to treat each token independently:
\[
x_{\text{flat}} \in \mathbb{R}^{(B \cdot T) \times D}
\]

Step 2: Compute Gating Logits
The gating network (a linear layer) computes raw scores for each expert:
\[
\text{logits} = W_g \cdot x_{\text{flat}} + b_g \quad \text{where} \quad W_g \in \mathbb{R}^{D \times N}, \ b_g \in \mathbb{R}^{N}
\]
\[
\text{logits} \in \mathbb{R}^{(B \cdot T) \times N}
\]

 

Step 3: Convert Logits to Probabilities
Apply softmax to get normalized expert scores:
\[
\text{scores} = \text{softmax}(\text{logits}) \in \mathbb{R}^{(B \cdot T) \times N}
\]

Step 4: Select Top-\( k \) Experts
For each token, pick the top-\( k \) experts and their weights:
\[
\text{top\_k\_weights}, \text{top\_k\_indices} = \text{topk}(\text{scores}, k)
\]
\[
\text{top\_k\_weights} \in \mathbb{R}^{(B \cdot T) \times k}, \quad \text{top\_k\_indices} \in \mathbb{R}^{(B \cdot T) \times k}
\]

 

 

Step 5: Create Expert Mask
Construct a sparse mask where only top-\( k \) experts are active:
\[
\text{expert\_mask} \in \mathbb{R}^{(B \cdot T) \times N}
\]
\[
\text{expert\_mask}[i, j] =
\begin{cases}
\text{top\_k\_weights}[i, m] & \text{if } j = \text{top\_k\_indices}[i, m] \\
0 & \text{otherwise}
\end{cases}
\]

Step 6: Compute Expert Outputs
Each expert processes the input only if selected (sparse activation):
\[
E_i(x_{\text{flat}}) \in \mathbb{R}^{(B \cdot T) \times D} \quad \text{for } i \in \{1, \dots, N\}
\]
\[
\text{expert\_outputs} = \text{stack}([E_1(x), \dots, E_N(x)]) \in \mathbb{R}^{(B \cdot T) \times N \times D}
\]

 

Step 7: Weighted Combination
Multiply expert outputs by their gating weights and sum:
\[
y_{\text{flat}} = \sum_{i=1}^N \text{expert\_mask}[i] \cdot E_i(x_{\text{flat}})
\]
\[
y_{\text{flat}} \in \mathbb{R}^{(B \cdot T) \times D}
\]

Step 8: Reshape Output
Restore the original batch and sequence dimensions:
\[
y \in \mathbb{R}^{B \times T \times D}
\]

Load Balancing Loss (Optional)
To ensure experts are used equally:
\[
\text{router\_prob} = \text{mean}(\text{expert\_mask}, \text{dim}=0) \in \mathbb{R}^N
\]
\[
\text{expert\_prob} = \text{mean}(\text{scores}, \text{dim}=0) \in \mathbb{R}^N
\]
\[
L_{\text{load}} = N \cdot \sum_{i=1}^N \text{router\_prob}_i \cdot \text{expert\_prob}_i
\]

Summary of Tensor Shapes

StepTensor ShapeDescription
Input x[B, T, D]Original input with batch size B, sequence length T, and embedding dimension D.
Flattened x[B⋅T, D]Tokens are flattened to process individually.
Gating logits[B⋅T, N]Raw scores for N experts from the gating network.
Softmax scores[B⋅T, N]Normalized probabilities for each expert.
Top-k indices[B⋅T, k]Indices of the top k selected experts per token.
Top-k weights[B⋅T, k]Gating weights (softmax values) for selected experts.
Expert mask[B⋅T, N]Sparse binary mask indicating selected experts per token.
Expert outputs[B⋅T, N, D]Outputs from all experts (zeroed if not selected).
Combined output[B⋅T, D]Weighted sum of selected expert outputs.
Final output y[B, T, D]Output reshaped back to match original input shape.

Key Equations
1. Gating:
\[
\text{scores} = \text{softmax}(W_g x + b_g)
\]
2. Top-\( k \) Selection:
\[
\text{top\_k\_indices}, \text{top\_k\_weights} = \text{argtopk}(\text{scores}, k)
\]
3. Expert Combination:
\[
y = \sum_{i \in \text{Top-}k} \text{score}_i \cdot E_i(x)
\]
4. Load Balancing Loss:
\[
L_{\text{load}} = N \cdot \sum_{i=1}^N (\text{mean}(\text{mask}_i) \cdot \text{mean}(\text{scores}_i))
\]

Visualization of Sparse Activation

 


How Mixture of Experts (MoE) Works

MoE operates in two stages:

  1. The training phase
  2. The Inference phase

Training phase

Similar to other machine learning models, MoE begins by training on a dataset. However, the training process is not applied to the entire model but is instead conducted on its components individually.

Expert training

Each component of an MoE framework undergoes training on a specific subset of data or tasks. The aim is to enable each component to focus on a particular aspect of the broader problem.

This focus is achieved by providing each component with data relevant to its assigned task. For instance, in a language processing task, one component might concentrate on syntax while another on semantics.

The training for each component follows a standard neural network training process, where the model learns to minimize the loss function for its specific data subset.

Gating network training

The gating network is tasked with learning to select the most suitable expert for a given input.

During the training of the gating network, is trained alongside the expert networks. It receives the same input as the experts and learns to predict a probability distribution over the experts. This distribution indicates which expert is best suited to handle the current input.

The gating network is typically trained using optimization methods that include both the accuracy of the gating network and the performance of the selected experts.

Joint training

In the joint training phase, the entire MoE system, which includes both the expert models and the gating network, is trained together.

This strategy ensures that both the gating network and the experts are optimized to work in harmony. The loss function in joint training combines the losses from the individual experts and the gating network, encouraging a collaborative optimization approach.

The combined loss gradients are then propagated through both the gating network and the expert models, facilitating updates that improve the overall performance of the MoE system.

Inference phase

Inference involves generating outputs by combining context from gating networks with outputs from experts. In MoE, this process is designed to keep inference costs minimal.

Input routing

In the context of MoE, the role of the gating network is pivotal in deciding which models should process a specific input.

Upon receiving an input, the gating network assesses it and creates a probability distribution across all the models. This distribution then directs the input to the most suitable models, leveraging the patterns learned during the training phase. This ensures that the right expertise is applied to each task, optimizing the decision-making process.

Expert selection

Only a select few models, usually one or a few, are chosen to process each input. This selection is determined by the probabilities assigned by the gating network.

Choosing a limited number of models for each input helps in the efficient use of computational resources while still benefiting from the specialized knowledge within the MoE framework.

The output from the gating network ensures that the chosen models are the most appropriate for handling the input, thereby improving the system’s overall efficiency and performance.

Output combination

The last step in the inference process involves merging the outputs from the selected models.

This merging is often achieved through weighted averaging, where the weights reflect the probabilities assigned by the gating network. In certain scenarios, alternative methods like voting or learned combination techniques might be employed to merge the expert outputs. The aim is to integrate the varied insights from the selected models into a unified and accurate final prediction, thereby leveraging the strengths of the MoE architecture.

With the rapid advancement of technology, there is an increasing need for fast, efficient, and optimized techniques to handle large models. MoE is emerging as a promising solution in this regard. What other benefits does MoE offer?

Benefits of Mixture of Experts (MoE)

Mixture of Experts (MoE) architecture offers several advantages:

  1. Performance: By selectively activating only the relevant experts for a given task, MoE models avoid unnecessary computation, leading to improved speed and reduced resource consumption.
  2. Flexibility: The diverse capabilities of experts make MoE models highly flexible. By calling on experts with specialized capabilities, the MoE model can succeed in a wider range of tasks.
  3. Fault tolerance: MoE’s “divide and conquer” approach, where tasks are executed separately, enhances the model’s resilience to failures. If one expert encounters an issue, it doesn’t necessarily affect the entire model’s functionality.
  4. Scalability – Decomposing complex problems into smaller, more manageable tasks helps MoE models handle increasingly complicated inputs.

Applications of Mixture of Experts (MoE)

The fact that MoEs have been around for the last 30 years makes it a widely used technique in different areas of machine learning.

Natural language processing (NLP)

MoE offers a unique approach to training large models with improved efficiency, faster pre-training, and competitive inference speeds.

In traditional dense models, all parameters are used for all inputs. Sparsity, however, allows the model to run only specific parts of the system based on the input, significantly reducing computation.

One example is Microsoft’s translation API, Z-code. The MoE architecture in Z-code supports a massive scale of model parameters while keeping the amount of compute constant.

Computer vision

Google’s V-MoEs, a sparse architecture based on Vision Transformers (ViT), showcase the effectiveness of MoE in computer vision tasks.

By partitioning images into smaller patches and feeding them to a gating/routing layer, V-MoEs can dynamically select the most appropriate experts for each patch, optimizing both accuracy and efficiency.

A notable advantage of this approach is its flexibility. You can decrease the number of selected experts per token to save time and compute, without any further training on the model weights.

Recommendation systems

MoE has also been successfully applied to recommendation systems. For example, Google researchers have proposed an MMoE (Multi-Gate Mixture of Experts) based ranking system for YouTube video recommendations.

They first group their task objective into two categories: engagement and satisfaction. Given the list of candidate videos from the retrieval step, their ranking system uses candidate, user, and context features to learn to predict the probabilities corresponding to the two categories of user behavior.

One thing to note in this approach is that they did not apply the MoE layer directly to the input because the high dimensionality of input would lead to significant model training and serving costs.

MoEs have seen a wide-scale adoption in the industry for several applications. Their learning procedure divides the task into appropriate subtasks, each of which can be solved by a very simple expert network. This capability translates to parallelizable training and fast inference, making MoEs lucrative for large-scale systems.

Mixture of Experts (MoE): Challenges

Experts are particularly beneficial for high-throughput scenarios involving many machines. Given a fixed compute budget for pretraining, a sparse model can be more efficient.

However, sparse models require substantial memory during execution, as all experts need to be stored in memory. This can be a significant limitation in systems with low VRAM, where such models may struggle.

Let’s explore other limitations of MoEs.

Training complexity

Training MoE models is more complex than training a single model. Here’s why:

  1. Coordination: You need the gating network to learn how to correctly route inputs to the right experts while each expert specializes in different parts of the data. Balancing this can be tricky.
  2. Optimization: The loss function used in joint training must balance the performance of the experts and the gating network, which complicates the optimization process.
  3. Hyperparameter tuning: MoE models have more hyperparameters, such as the number of experts and the architecture of the gating network. Tuning these can be time-consuming and complicated.

Inference efficiency

Inference in MoE models can be less efficient due to a few factors:

  1. Gating network: The gating network needs to run for each input to determine the right experts. This adds extra computation.
  2. Expert selection and activation: Even though only a subset of experts is activated for each input, selecting and activating these experts adds overhead, potentially increasing inference times.
  3. Parallelism: Running multiple experts in parallel can be challenging, especially in environments with limited computational resources. Effective parallelism requires advanced scheduling and resource management.

Increased model size

MoE models tend to be larger than single models due to the multiple experts:

  1. Storage requirements: Storing multiple expert networks and the gating network increases the overall storage needs, which can be a drawback in storage-limited environments.
  2. Memory usage: Training and inference require more memory because multiple models need to be loaded and maintained in memory at the same time. This can be problematic in resource-constrained settings.
  3. Deployment challenges: Deploying MoE models is harder due to their size and complexity. Efficient deployment on various platforms, including edge devices, may require additional optimization and engineering efforts.

Conclusion

In this article, we explored the Mixture of Experts (MoE) technique, a sophisticated approach for scaling neural networks to handle complex tasks and diverse data. MoE uses multiple specialized experts and a gating network to route inputs effectively.

We covered the core components of MoE, including expert networks and the gating network, and discussed its training and inference processes.

Benefits such as improved performance, scalability, and adaptability were highlighted, along with applications in natural language processing, computer vision, and recommendation systems.

Despite challenges in training complexity and model size, MoE offers a promising method for advancing AI capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

Home
Courses
Services
Search