This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/pseud0nym on 2025-02-19 02:02:05+00:00.
Recent research is shedding light on an unexpected problem in modern large language models, the deeper layers aren’t pulling their weight.
A recent paper, “The Curse of Depth in Large Language Models”, highlights a critical issue:
-
Deep layers in LLMs contribute significantly less to learning than earlier ones.
-
Many of these layers can be pruned without serious performance loss, raising questions about training efficiency.
-
The culprit? Pre-Layer Normalization (Pre-LN), which causes output variance to explode in deeper layers, making them act almost like identity functions.
-
A simple fix? LayerNorm Scaling, which controls this variance and improves training efficiency.
This has major implications for LLM architecture, training efficiency, and scaling laws. If half the layers in models like LLaMA, Mistral, and DeepSeek aren’t contributing effectively, how much computational waste are we dealing with?
Key questions for discussion:
1️) Should we be rethinking deep-layer training strategies to improve efficiency?
2️) Does this impact the assumption that deeper = better in transformer architectures?
3️) Could insights from this paper help with LLM compression, fine-tuning, or distillation techniques?
Paper link: arXiv preprint: 2502.05795v1
Let’s discuss—what are your thoughts on the Curse of Depth?