Add PowerPC support (POWER8 VSX + Power Mac G5 AltiVec)#404
Add PowerPC support (POWER8 VSX + Power Mac G5 AltiVec)#404Scottcjn wants to merge 10 commits intomicrosoft:mainfrom
Conversation
Port Microsoft BitNet to IBM POWER8 (ppc64le). Adds scalar fallback implementations for all 5 I2_S dot product kernel functions and the quantize_i2_s function. Also adds PowerPC defines to gemm-config.h. Benchmarks on POWER8 S824 (16c/128t, 512GB RAM, scalar-only): - BitNet Large 700M: 21.5 t/s pp128, 11.2 t/s tg32 - BitNet 2B: 8.0 t/s pp128, 4.1 t/s tg32 - Llama3-8B BitNet: 2.6 t/s pp128, 1.6 t/s tg32 Files changed: - gemm-config.h: Add PARALLEL_SIZE/ROW_BLOCK_SIZE for __VSX__/__powerpc64__ - ggml-bitnet-mad.cpp: Scalar fallbacks for quantize_i2_s, ggml_vec_dot_i2_i8_s_1x1, _1x4_32W, _1xN, _Nx1 - bitnet-lut-kernels.h: Stub header for POWER8 (LUT kernels are x86/ARM) Next: VSX-optimized kernels using vec_perm for 10-16x speedup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace scalar fallback with POWER8 AltiVec/VSX optimized kernels for all 5 I2_S functions using vec_msum (vmsummbm) instruction. Key optimizations: - vec_msum: 16 signed*unsigned byte multiply-accumulate per cycle - dcbt prefetch hints for weight/activation data - i2s_vsx_half() helper processes 16-byte blocks in vectorized form - All 5 kernels: quantize, 1x1, 1x4_32W, 1xN, Nx1 Benchmark results (POWER8 S824, 64 threads): 700M: pp128 21.48 -> 211.48 t/s (9.8x) 2B: pp128 8.04 -> 73.03 t/s (9.1x) 8B: pp128 2.60 -> 27.39 t/s (10.5x) 8B: tg32 1.61 -> 4.90 t/s (3.0x) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use dcbt with TH=0x10 (L3 resident hint) for weight block prefetch instead of transient dcbt. Keeps weight data sticky in L3 cache between token generation steps, avoiding re-fetch from DRAM. Key design: only prefetch next block (not bulk scan) to avoid overhead on prompt processing. Bulk prefetch hurts pp because BitNet I2_S blocks are tiny (32 bytes) and vec_msum is so fast that prefetch overhead dominates. Results (vs VSX-only baseline): 700M tg32: 22.77 -> 24.02 t/s (+5.5%) 2B tg32: 10.93 -> 11.99 t/s (+9.7%) 8B tg32: 4.90 -> 5.63 t/s (+14.9%) pp unchanged (within noise) Full speedup from scalar baseline: 8B pp128: 2.60 -> 26.98 t/s (10.4x) 8B tg32: 1.61 -> 5.63 t/s (3.5x) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds POWER8/PowerPC section with: - Build instructions for ppc64le - Three optimization levels explained (scalar, VSX, dcbt) - Full benchmark tables for 700M, 2B, and 8B models - Scalar-to-VSX speedup comparison (9-10x) - Key technical details (vec_msum, dcbt resident, NUMA) - Model sources and conversion notes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add patches and build infrastructure for running BitNet on PowerPC 970 (Power Mac G5) big-endian systems. GGUF format is always little-endian on disk; this adds byte-swap support for all multi-byte scalar reads and tensor data. Key changes: - g5-big-endian.patch: gguf_fread_val() byte-swap function for GGUF reader, tensor data byte-swap for F32/F16/I2_S at load time, sizeof(bool)==4 fix for PowerPC GCC - regex-ppc.h: POSIX regex wrapper replacing broken std::regex on PPC big-endian - build_g5.sh: Build script with G5-safe compiler flags (-Os) Tested on Power Mac G5 Dual 2.0 GHz, Mac OS X 10.5, GCC 10.5.0. Produces coherent text at 4.31 t/s prompt eval, 1.61 t/s generation with bitnet_b1_58-large (728M). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Port POWER8 VSX SIMD kernels to G5 AltiVec using 4 compatibility macros that abstract the ISA differences. One code path, works on both targets. Compatibility macros: - I2S_VEC_LD_UC/I2S_VEC_LD_SC: vec_vsx_ld (POWER8) vs vec_ld (G5) - I2S_DCBT_*: TH-hint dcbt (POWER8) vs basic dcbt (G5) Key changes across all 4 kernel functions (1x1, 1x4_32W, 1xN, Nx1): - vec_vsx_ld → I2S_VEC_LD_UC / I2S_VEC_LD_SC (22 sites) - static const vector arrays → vec_splat_u8() macros (avoids Mach-O alignment issues on old Darwin, generates vspltisb instruction) - hsum_i32_4_vsx → hsum_i32_4_ppc with G5 branch using vec_sums - POWER8-specific dcbt TH hints → G5-safe basic dcbt fallback - Architecture guard extended with __powerpc__ and __ppc__ - Build script updated: -Os → -O3 (safe now that vector constants are in-register), added llama-bench target Scalar baseline on G5 Dual 2.0GHz: pp5 = 4.31 t/s, tg = 1.61 t/s Target with AltiVec: 12-20 t/s (3-5x speedup via vmsummbm) No endian changes needed: vec_msum accumulates all 16 bytes into 4 int32 lanes then hsum reduces all 4 - total is identical regardless of lane assignment order on BE vs LE. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change build_g5.sh from -O3 to -Os: higher optimization levels cause Bus errors on G5 due to Mach-O ABI stack alignment issues with aggressive vector register spills from GCC 10. - Add __attribute__((always_inline)) to i2s_ppc_half and hsum_i32_4_ppc: without this, the Mach-O ABI generates VRsave save/restore sequences (mfspr/mtspr, ~20 cycles each) on every function call, devastating performance in the inner dot product loop. - Recommend -t 1 for G5 inference: single thread is faster because ggml_barrier() overhead on 870 graph nodes per token exceeds the benefit of 2-thread parallelism. - Remove llama-bench from G5 build (C++ compat issues with GCC 10). G5 AltiVec kernel microbenchmark: 16.1x raw speedup (5.84 vs 0.36 GMAC/s). End-to-end limited by Amdahl's law: matmul is 12-24% of total inference. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tize_row_i8_s Two changes applied via g5-altivec-framework.patch: 1. ggml.c: Add #elif defined(__ALTIVEC__) GGML_SIMD block after POWER9 - Vectorizes all ggml_vec_* functions (scale, dot, mad, add, mul) - Uses vec_ld/vec_st (aligned) instead of VSX vec_xl/vec_xst - Uses vec_madd(a,b,zero) instead of vec_mul (no vmulfp on G5) - F16 via scalar FP16<->FP32 conversion (same approach as WASM backend) 2. ggml-quants.c: Add AltiVec path in quantize_row_i8_s (168 calls/token) - Pass 1: vec_abs + vec_max for finding max absolute value - Pass 2: vec_madd + vec_round + vec_cts + vec_packs for int8 quantize - vec_sums result in element 3 (big-endian PPC) Build script updated: -O3 for C (safe), -Os for C++ (GCC 10.5 miscompile workaround), -lm for roundf(). Benchmark (bitnet_b1_58-large I2_S, G5 dual 2.0GHz, -t 1): AltiVec+framework: pp6=5.14 t/s, tg=1.51 t/s Scalar baseline: pp6=4.78 t/s, tg=1.84 t/s (+7.5% prompt processing, -18% generation due to -Os C++ constraint) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MK_CFLAGS/MK_CXXFLAGS on the make command line override Makefile internal += appends, so -fopenmp added by the GGML_NO_OPENMP= path was silently discarded. The code compiled with GGML_USE_OPENMP defined (via MK_CPPFLAGS, which was not overridden), causing it to take the OpenMP codepath — but without -fopenmp, the compiler treated #pragma omp parallel as a no-op. Only thread 0 ever ran. Fix: add -fopenmp explicitly to MK_CFLAGS and MK_CXXFLAGS. Benchmarks (Dual 2.0 GHz G5, BitNet 700M I2_S, -n 20): -t 1: 721 ms/token, MUL_MAT 3186 us/call -t 2: 498 ms/token, MUL_MAT 2189 us/call (1.45x speedup) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace the scalar scale correction loops in ggml_compute_forward_mul_mat (both GEMM and GEMV I2_S paths) with AltiVec vectorized equivalents. Algebraic refactoring eliminates per-element division: Original: (x - act_sums) / act_scales * scale (sub + div + mul) Optimized: x * (scale/act_scales) + (-(act_sums * scale/act_scales)) This maps to a single vec_madd (vmaddfp) processing 4 floats per cycle. Precomputing factor and neg_offset once per column replaces thousands of divisions with two scalar divides total. Wrapped in #if defined(__ALTIVEC__) || defined(__VSX__) with scalar fallback, so non-PPC builds are unaffected. Benchmarks (Dual 2.0 GHz G5, BitNet 700M I2_S, -t 2): Scale correction: 91,588 us → 18,395 us (5.0x faster) Per-token total: 498 ms → 490 ms (~1.6% end-to-end) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
👋 Friendly ping — any chance this could get a review? This PR adds PowerPC support (POWER8 VSX + Power Mac G5 AltiVec) which opens BitNet to a whole new architecture. The code is tested and working on real POWER8 hardware. Happy to address any feedback! |
|
@tsong-ms @XsquirrelC @potassiummmm Hey team! This PR has been open for 3 days with no review activity. What it adds:
The code is clean and ready to merge. Would really appreciate a review when you get a chance! Related: PR #394 also adds POWER8 support (different approach). Happy to consolidate if preferred. |
|
Checking on this?! |
|
Friendly ping — this has been open 4 weeks with no maintainer feedback. PR #394 is now closed in favor of this one (superset). This adds:
CLA is signed, CI passes. Happy to address any feedback or rebase if needed. cc @tsong-ms @XsquirrelC @potassiummmm Signed-off-by: Scott Boudreaux scott@elyanlabs.ai |
Summary
#ifdefguardsWhat's included
Core kernel (
src/ggml-bitnet-mad.cpp)ggml_gemv_i2_i8_s/ggml_gemm_i2_i8_swith AltiVecvec_msum(16 multiply-accumulates per cycle)#elif defined(__VSX__) || defined(__ALTIVEC__)blocks — zero impact on x86/ARMBig-endian support (
patches/g5-big-endian.patch)patches/regex-ppc.h)Framework vectorization (
patches/g5-altivec-framework.patch)#elif defined(__ALTIVEC__)to ggml.c's GGML_SIMD macro chainggml_vec_*functions (scale, dot, add, mul, mad) via AltiVecquantize_row_i8_swithvec_abs/vec_packs/vec_ctsScale correction vectorization (
patches/g5-altivec-scale.patch)vec_madd(x-A)/B*C→x*(C/B)+(-(A*C/B))= single fused multiply-addBuild script (
patches/build_g5.sh)-Osfor C++,-O3for C)-fopenmpin compiler flagsBenchmarks
Power Mac G5 Dual 2.0 GHz (PowerPC 970, 6GB DDR400, Mac OS X 10.5):
IBM POWER8 S824 (16-core, 512GB RAM, ppc64le):
Files changed
src/ggml-bitnet-mad.cppinclude/bitnet-lut-kernels.hinclude/gemm-config.hpatches/build_g5.shpatches/g5-big-endian.patchpatches/g5-altivec-framework.patchpatches/g5-altivec-scale.patchpatches/regex-ppc.hREADME.mdTest plan
#ifdefguards, no changes to x86/ARM paths🤖 Generated with Claude Code