DeepSeek-V3

Posted Dec 27, 2024 Updated Dec 28, 2024

By Ke

1 min read

DeepSeek has just released DeepSeek-V3, a groundbreaking language model featuring 671 billion total parameters and 37 billion activated parameters per token. This model excels in performance while remaining cost-efficient. As an open-source alternative, it competes with models like GPT-4o and Claude-3.5-Sonnet. The training process consumed only 2.788 million H800 GPU hours, costing approximately $5.576 million, making it one of the most economical large-scale models available.

Key Innovations

Multi-head Latent Attention (MLA): Enhances inference efficiency by reducing the Key-Value (KV) cache, enabling faster processing without sacrificing performance.
DeepSeekMoE with Auxiliary-Loss-Free Load Balancing: Improves training efficiency by dynamically balancing expert loads without the need for auxiliary losses, which can degrade model performance.
Multi-Token Prediction (MTP): Predicts multiple future tokens during training, improving data efficiency and enabling speculative decoding for faster inference.
FP8 Mixed Precision Training: Leverages low-precision FP8 format for training, significantly reducing GPU memory usage and accelerating computation while maintaining accuracy.

Reading Room, Paper Reading

llm agi

This post is licensed under CC BY 4.0 by the author.

Trending Tags