feat: add BitNet Distillation (BitDistill) pipeline code by yushuiwx · Pull Request #382 · microsoft/BitNet

yushuiwx · 2026-01-27T05:27:12Z

🚀 What does this PR do?

This PR introduces BitNet Distillation (BitDistill), a lightweight distillation and fine-tuning pipeline that converts off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit (ternary) BitNet models for specific downstream tasks.

BitDistill enables task-specific adaptation of BitNet models while preserving strong performance and significantly reducing memory and inference cost.

Background & Motivation

While BitNet demonstrates strong efficiency advantages with 1.58-bit weights, directly fine-tuning low-bit models often leads to a noticeable performance gap compared to full-precision counterparts, especially on downstream tasks.

BitDistill addresses this issue by combining:

architectural improvements from BitNet,
a continual pre-training warm-up stage,
and representation-level distillation techniques.

to improve scalability and downstream task performance of low-bit models.

This PR is based on the paper:

BitNet Distillation
https://arxiv.org/abs/2510.13998

Key Techniques Implemented

This implementation includes the following core components:

SubLN module
- Improves training stability for low-bit models
Continual pre-training warm-up
- Mitigates the performance gap between full-precision and 1.58-bit models
Multi-head attention distillation
- Distills attention patterns from full-precision teacher models

Results & Benefits

Achieves comparable task performance to full-precision models across model sizes
Enables up to 10× memory reduction
Achieves up to 2.65× faster CPU inference

Note: This PR replaces a previous one that mistakenly added BitDistill as a submodule. The code is now fully included.

…fine-tuning This commit introduces BitDistill, a lightweight distillation framework that fine-tunes full-precision LLMs into 1.58-bit BitNet models for task-specific applications. Key components include: - SubLN module for training stability - Multi-head attention distillation inspired by MiniLM - Continual pre-training warm-up to reduce performance gap Based on the paper: https://arxiv.org/abs/2510.13998

yushuiwx · 2026-03-03T07:43:20Z

@microsoft-github-policy-service agree company="Microsoft"

yushuiwx added 2 commits January 26, 2026 21:01

DesperateZero mentioned this pull request Feb 4, 2026

When are you going to upload the bitdistill code? #344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BitNet Distillation (BitDistill) pipeline code#382

feat: add BitNet Distillation (BitDistill) pipeline code#382
yushuiwx wants to merge 2 commits intomicrosoft:mainfrom
yushuiwx:feature/bitdistill

yushuiwx commented Jan 27, 2026

Uh oh!

yushuiwx commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yushuiwx commented Jan 27, 2026

🚀 What does this PR do?

Background & Motivation

Key Techniques Implemented

Results & Benefits

Uh oh!

yushuiwx commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant