AI NewsTransformersAttention Mechanisms
Differential Transformer V2: Faster Decoding and Improved Stability
Microsoft's Differential Transformer V2 achieves comparable decoding speeds to standard Transformers while reducing language modeling loss by 0.02-0.03 at 1T tokens.