Introducing SmolLM3: Small, Efficient, and Highly Capable
The AI community continues to push the boundaries of small language models (SLMs), proving that bigger isn’t always better. Today, we’re excited to introduce SmolLM3, a 3B-parameter model that outperforms competitors like Llama-3.2-3B and Qwen2.5-3B while rivaling larger 4B models (Qwen3 & Gemma3).
What makes SmolLM3 special?
✅ Multilingual (English, French, Spanish, German, Italian, Portuguese)
✅ 128K long-context support (via NoPE + YaRN)
✅ Dual-mode reasoning (think/no_think for explicit vs. direct answers)
✅ Fully open weights & training recipe
Why SmolLM3 Stands Out
Performance Highlights
-
Beats Llama-3.2-3B & Qwen2.5-3B across reasoning, math, and coding tasks
-
Competitive with 4B models (Qwen3-4B, Gemma3-4B) at lower compute cost
-
Strong multilingual ability (tested on Global MMLU, MLMM HellaSwag, Belebele)
-
128K context handling (via NoPE + YaRN extrapolation)
️ Key Architectural Improvements
-
Grouped Query Attention (GQA) – Reduces KV cache size without performance loss
-
NoPE (No Positional Embeddings in select layers) – Better long-context handling
-
Intra-Document Masking – Improves training stability for long sequences
-
Embedding Layer Optimization – Removes weight decay for smoother training
The Full Training Blueprint
Unlike proprietary models, we’re releasing everything:
Data mixtures (11.2T tokens across web, math, code)
Three-stage pretraining (progressive domain specialization)
Mid-training for reasoning & long-context adaptation
Post-training with SFT & Anchored Preference Optimization (APO)
Three-Stage Pretraining
Stage | Web | Code | Math | Focus |
---|---|---|---|---|
Stage 1 (0-8T) | 85 percent | 12 percent | 3 percent | General capabilities |
Stage 2 (8-10T) | 75 percent | 15 percent | 10 percent | Higher-quality math/code |
Stage 3 (10-11.1T) | 63 percent | 24 percent | 13 percent | Reasoning & long-context |
Dual-Mode Reasoning: Think vs. No-Think
SmolLM3 supports two response modes:
-
/think – Explicit reasoning traces (like Chain-of-Thought)
-
/no_think – Direct answers (faster inference)
Example:
1 2 3 4 5 6 |
messages = [ {"role": "system", "content": "/think"}, # or "/no_think" {"role": "user", "content": "Explain quantum entanglement."} ] |
Tool Calling Support
1 2 3 4 5 6 7 |
tools = [{ "name": "get_weather", "description": "Fetch weather data", "parameters": {"city": {"type": "string"}} }] |
⚡ How to Use SmolLM3
Install & Run
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
pip install -U transformers from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B") tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B") # Generate with reasoning messages = [{"role": "user", "content": "Solve 3x + 5 = 20"}] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt") outputs = model.generate(inputs, max_new_tokens=128) print(tokenizer.decode(outputs[0])) |
Recommended Sampling: temperature=0.6
, top_p=0.95
Why This Matters
SmolLM3 proves that small models can be highly capable when optimized correctly. Key takeaways:
✅ Efficiency matters – 3B models can rival 4B with the right training
✅ Long-context is achievable – NoPE + YaRN enables 128K support
✅ Open weights & recipes accelerate research – No more black-box models!
Resources
GitHub Repo (Training configs & eval code)
Model Collection (Quantized checkpoints)
Training Logs
What will YOU build with SmolLM3? Let us know in the comments!