DeepSeek-V3
DeepSeek has just released DeepSeek-V3, a groundbreaking language model featuring 671 billion total parameters and 37 billion activated parameters per token. This model excels in performance while remaining cost-efficient. As an open-source alternative, it competes with models like GPT-4o and Claude-3.5-Sonnet. The training process consumed only 2.788 million H800 GPU hours, costing approximately $5.576 million, making it one of the most economical large-scale models available.
Key Innovations
Multi-head Latent Attention (MLA): Enhances inference efficiency by reducing the Key-Value (KV) cache, enabling faster processing without sacrificing performance.
DeepSeekMoE with Auxiliary-Loss-Free Load Balancing: Improves training efficiency by dynamically balancing expert loads without the need for auxiliary losses, which can degrade model performance.
Multi-Token Prediction (MTP): Predicts multiple future tokens during training, improving data efficiency and enabling speculative decoding for faster inference.
FP8 Mixed Precision Training: Leverages low-precision FP8 format for training, significantly reducing GPU memory usage and accelerating computation while maintaining accuracy.