Skip to content

feat: add BitNet Distillation (BitDistill) pipeline code#382

Open
yushuiwx wants to merge 2 commits intomicrosoft:mainfrom
yushuiwx:feature/bitdistill
Open

feat: add BitNet Distillation (BitDistill) pipeline code#382
yushuiwx wants to merge 2 commits intomicrosoft:mainfrom
yushuiwx:feature/bitdistill

Conversation

@yushuiwx
Copy link

🚀 What does this PR do?

This PR introduces BitNet Distillation (BitDistill), a lightweight distillation and fine-tuning pipeline that converts off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit (ternary) BitNet models for specific downstream tasks.

BitDistill enables task-specific adaptation of BitNet models while preserving strong performance and significantly reducing memory and inference cost.


Background & Motivation

While BitNet demonstrates strong efficiency advantages with 1.58-bit weights, directly fine-tuning low-bit models often leads to a noticeable performance gap compared to full-precision counterparts, especially on downstream tasks.

BitDistill addresses this issue by combining:

  • architectural improvements from BitNet,
  • a continual pre-training warm-up stage,
  • and representation-level distillation techniques.

to improve scalability and downstream task performance of low-bit models.

This PR is based on the paper:

BitNet Distillation
https://arxiv.org/abs/2510.13998


Key Techniques Implemented

This implementation includes the following core components:

  1. SubLN module

    • Improves training stability for low-bit models
  2. Continual pre-training warm-up

    • Mitigates the performance gap between full-precision and 1.58-bit models
  3. Multi-head attention distillation

    • Distills attention patterns from full-precision teacher models

Results & Benefits

  • Achieves comparable task performance to full-precision models across model sizes
  • Enables up to 10× memory reduction
  • Achieves up to 2.65× faster CPU inference

Note: This PR replaces a previous one that mistakenly added BitDistill as a submodule. The code is now fully included.

…fine-tuning

This commit introduces BitDistill, a lightweight distillation framework
that fine-tunes full-precision LLMs into 1.58-bit BitNet models for
task-specific applications.

Key components include:
- SubLN module for training stability
- Multi-head attention distillation inspired by MiniLM
- Continual pre-training warm-up to reduce performance gap

Based on the paper: https://arxiv.org/abs/2510.13998
…fine-tuning

This commit introduces BitDistill, a lightweight distillation framework
that fine-tunes full-precision LLMs into 1.58-bit BitNet models for
task-specific applications.

Key components include:
- SubLN module for training stability
- Multi-head attention distillation inspired by MiniLM
- Continual pre-training warm-up to reduce performance gap

Based on the paper: https://arxiv.org/abs/2510.13998
@yushuiwx
Copy link
Author

yushuiwx commented Mar 3, 2026

@microsoft-github-policy-service agree company="Microsoft"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant