Skip to content

[REFACTOR][TIR] Remove body from AllocBuffer and DeclBuffer#18876

Open
tqchen wants to merge 4 commits intoapache:mainfrom
tqchen:remove-allocbuffer-declbuffer-body-v2
Open

[REFACTOR][TIR] Remove body from AllocBuffer and DeclBuffer#18876
tqchen wants to merge 4 commits intoapache:mainfrom
tqchen:remove-allocbuffer-declbuffer-body-v2

Conversation

@tqchen
Copy link
Member

@tqchen tqchen commented Mar 5, 2026

Summary

  • Remove body field from AllocBufferNode and DeclBufferNode, making them flat statements consistent with Bind
  • Buffer scope extends to end of enclosing scope via flat SeqStmt semantics
  • 60 files changed across core IR, codegen backends, transforms, script IR builder, and tests

Test plan

  • All existing test suites pass (tir-transform, tir-base, tvmscript, s_tir, codegen, C++)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of the TIR (Tensor IR) by removing the explicit 'body' field from AllocBuffer and DeclBuffer statements. This change simplifies the IR structure, aligning these buffer operations with other flat statements like Bind. The primary impact is a more consistent and streamlined representation of buffer allocation and declaration, requiring extensive updates across the compiler stack to adapt to this new flat semantic model. The refactor aims to improve the clarity and maintainability of the IR, ensuring that buffer lifetimes are managed through sequential statement ordering rather than nested scopes.

Highlights

  • Core IR Refactoring: The body field has been removed from AllocBufferNode and DeclBufferNode, making these statements 'flat' and consistent with Bind semantics. This means buffer scope now implicitly extends to the end of the enclosing SeqStmt.
  • TVMScript IR Builder Updates: The Python T.alloc_buffer and T.decl_buffer functions in TVMScript now directly return Buffer objects and emit flat statements, rather than being used as context managers. This simplifies their usage and aligns with the new IR structure.
  • Widespread Adaption: Over 60 files across core IR, codegen backends (LLVM, NVPTX, AMDGPU, C, CUDA, Metal, WebGPU, SPIRV), various transforms (e.g., compact_buffer_region, inject_double_buffer, lower_thread_allreduce), and analysis passes have been updated to correctly handle the new flat AllocBuffer and DeclBuffer nodes.
  • Deferred Remapping for Flat IR: New mechanisms, such as pending_dbuffer_allocs_ in InjectDoubleBuffer and a DeferredRemapper pass in LowerThreadAllreduce, have been introduced to correctly handle buffer remappings in the context of flat IR where allocation statements might appear before their remapping logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/tvm/script/ir_builder/tir/ir.h
    • Updated DeclBuffer and AllocBuffer FFI functions to return Buffer instead of frame objects.
  • include/tvm/tir/stmt.h
    • Removed the body field from DeclBufferNode and AllocBufferNode.
    • Adjusted constructors for DeclBuffer and AllocBuffer to no longer accept a body parameter.
  • python/tvm/script/ir_builder/tir/ir.py
    • Modified alloc_buffer and decl_buffer Python functions to return Buffer objects directly.
    • Updated docstrings and examples to reflect the non-context-manager usage.
  • python/tvm/tir/stmt.py
    • Removed the body attribute and corresponding parameter from AllocBuffer and DeclBuffer Python classes.
  • src/relax/op/tensor/inspect.cc
    • Adapted LegalizeTensorShape to use the flat DeclBuffer within a SeqStmt.
  • src/s_tir/analysis/calculate_allocated_memory.cc
    • Modified AllocBufferCalculator to visit the AllocBufferNode directly instead of its now-removed body.
  • src/s_tir/analysis/estimate_flops.cc
    • Updated FlopEstimator to handle AllocBufferNode and DeclBufferNode without a body.
  • src/s_tir/backend/adreno/inject_texture_alloc.cc
    • Adjusted TextureAllocInjector to construct SeqStmt with flat Bind and AllocBuffer statements.
  • src/s_tir/schedule/primitive/cache_read_write.cc
    • Simplified InsertCacheStage and CacheLocDetector by removing logic that previously traversed nested DeclBuffer/AllocBuffer bodies.
  • src/s_tir/schedule/transform.cc
    • Removed code that peeled off DeclBuffer and AllocBuffer nodes from SBlock bodies in LeafBlockRemovalPlan.
  • src/s_tir/transform/compact_buffer_region.cc
    • Updated BufferAccessRegionCollector to track pending flat AllocBuffer nodes and compact them upon scope exit.
    • Introduced CompactPendingFlatAllocBuffers method to process deferred AllocBuffer compaction.
  • src/s_tir/transform/inject_double_buffer.cc
    • Modified DoubleBufferInjector to defer processing of AllocBuffer nodes, storing them in pending_dbuffer_allocs_ for later re-emission.
  • src/s_tir/transform/inject_ptx_ldg32.cc
    • Replaced nested AllocBuffer statements with SeqStmt::Flatten for AllocBuffer.
  • src/s_tir/transform/inject_virtual_thread.cc
    • Updated VarTouchedAnalysis and VTInjector to reflect the absence of a body in AllocBuffer nodes.
  • src/s_tir/transform/lower_match_buffer.cc
    • Adjusted MatchBufferLower to correctly handle buffer remapping with flat IR, ensuring remapped buffers are accessible.
  • src/s_tir/transform/lower_opaque_block.cc
    • Used SeqStmt::Flatten to combine AllocBuffer with the block body.
  • src/s_tir/transform/lower_thread_allreduce.cc
    • Introduced deferred remapping logic for AllocBuffer and DeclBuffer nodes in ThreadAllreduceBuilder.
    • Added a DeferredRemapper post-processing pass to apply these deferred remappings for flat IR.
    • Modified allocation prepending to use SeqStmt::Flatten for new allocations.
  • src/s_tir/transform/lower_vtcm_alloc.cc
    • Updated VtcmAllocator to directly return a Bind statement for AllocBuffer without a nested body.
  • src/s_tir/transform/merge_shared_memory_allocations.cc
    • Modified SharedMemoryRewriter to use SeqStmt::Flatten for new AllocBuffer statements.
    • Changed AllocBuffer and DeclBuffer visit methods to return Evaluate(0) when the buffer is merged or unused.
  • src/s_tir/transform/profile_instrumentation.cc
    • Updated LoopAnalyzer to no longer traverse the body of AllocBufferNode.
  • src/script/ir_builder/tir/frame.cc
    • Modified AllocateFrameNode and DeclBufferFrameNode to emit flat AllocBuffer/DeclBuffer statements followed by their body statements, handling single vs. multiple statements for parent frames.
  • src/script/ir_builder/tir/ir.cc
    • Updated DeclBuffer and AllocBuffer FFI functions to directly add statements to the parent and return the Buffer object.
  • src/script/printer/tir/stmt.cc
    • Adjusted DeclBufferDoc and AllocBuffer printing logic to reflect the flat statement structure, removing context manager syntax.
  • src/target/llvm/codegen_amdgpu.cc
    • Removed VisitStmt(op->body) call for AllocBufferNode.
  • src/target/llvm/codegen_llvm.cc
    • Removed VisitStmt(op->body) calls for AllocBufferNode and DeclBufferNode.
  • src/target/llvm/codegen_nvptx.cc
    • Removed VisitStmt(op->body) call for AllocBufferNode.
  • src/target/source/codegen_c.cc
    • Removed PrintStmt(op->body) calls for DeclBufferNode and AllocBufferNode.
  • src/target/source/codegen_cuda.cc
    • Removed PrintStmt(op->body) call for AllocBufferNode.
  • src/target/source/codegen_metal.cc
    • Removed PrintStmt(op->body) call for AllocBufferNode.
  • src/target/source/codegen_webgpu.cc
    • Removed PrintStmt(op->body) call for AllocBufferNode.
  • src/target/spirv/codegen_spirv.cc
    • Removed VisitStmt(op->body) calls for AllocBufferNode and DeclBufferNode.
  • src/tir/ir/script/script_complete.cc
    • Simplified ScriptCompleter for AllocBufferNode and DeclBufferNode to only register the buffer, as they no longer have bodies.
  • src/tir/ir/stmt.cc
    • Updated DeclBuffer and AllocBuffer constructors to remove the body parameter.
  • src/tir/ir/stmt_functor.cc
    • Removed VisitStmt(op->body) calls from StmtVisitor and StmtMutator for AllocBufferNode and DeclBufferNode.
  • src/tir/ir/tir_visitor_with_path.cc
    • Modified TIRVisitorWithPath to push AllocBuffer and DeclBuffer definitions into the current scope as siblings, rather than visiting a nested body.
  • src/tir/ir/tir_visitor_with_path.h
    • Updated TIRVisitorWithPath to allow DefContext<Buffer> in bind_scope_.
  • src/tir/transform/common_subexpr_elim.cc
    • Simplified CommonSubexpressionEliminator for AllocBufferNode and DeclBufferNode to reflect their new flat semantics.
  • src/tir/transform/flatten_buffer.cc
    • Used SeqStmt::Flatten when inserting DeclBuffer statements.
  • src/tir/transform/ir_utils.cc
    • Updated MergeNest to use SeqStmt::Flatten for AllocBuffer and DeclBuffer nodes.
    • Simplified IRConvertSSA for DeclBufferNode and AllocBufferNode by removing scope management around their bodies.
  • src/tir/transform/lower_tvm_builtin.cc
    • Adapted BuiltinLower to handle flat AllocBuffer nodes, pushing free statements to the enclosing scope and constructing the allocation as a SeqStmt.
  • src/tir/transform/lower_warp_memory.cc
    • Adapted WarpAccessRewriter to take a body parameter for Rewrite and used SeqStmt::Flatten for AllocBuffer.
    • Modified WarpMemoryRewriter to process SeqStmt to find and rewrite warp AllocBuffer nodes.
  • src/tir/transform/remove_no_op.cc
    • Removed logic that checked for body in AllocBufferNode and DeclBufferNode.
  • src/tir/transform/split_host_device.cc
    • Used SeqStmt::Flatten when inserting DeclBuffer statements.
  • src/tir/transform/storage_rewrite.cc
    • Updated StoragePlanRewriter to return Evaluate(0) for stripped AllocBuffer/DeclBuffer nodes and used flat AllocBuffer in PrepareNewAlloc.
  • src/tir/transform/tvm_ffi_binder.cc
    • Used flat DeclBuffer in DecodeAllParams.
  • src/tir/transform/unsupported_dtype_legalize.cc
    • Updated ComputeLegalizer and StorageLegalizer to use flat DeclBuffer and AllocBuffer.
  • tests/cpp/ir_functor_test.cc
    • Updated test cases to reflect AllocBuffer and DeclBuffer no longer having bodies, using SeqStmt to combine them with other statements.
  • tests/python/s_tir/transform/test_s_tir_transform_rewrite_unsafe_select.py
    • Adjusted access paths for Evaluate nodes due to the new flat SeqStmt structure.
  • tests/python/tir-base/test_tir_constructor.py
    • Removed assertions related to the body field for AllocBuffer.
  • tests/python/tir-base/test_tir_specialize.py
    • Adjusted access path for A_buf due to the new flat SeqStmt structure.
  • tests/python/tir-transform/test_tir_transform_convert_ssa.py
    • Updated test case to use SeqStmt for AllocBuffer and Evaluate.
  • tests/python/tir-transform/test_tir_transform_lower_tvm_builtin.py
    • Removed structural equality assertion and added checks for specific keywords in the script output to verify lowered builtins and flat DeclBuffer.
  • tests/python/tir-transform/test_tir_transform_narrow_datatype.py
    • Updated lower_func_body to correctly navigate SeqStmt to find For nodes.
  • tests/python/tir-transform/test_tir_transform_remove_no_op.py
    • Added xfail markers for tests related to dead allocation removal, indicating it's not yet implemented for flat AllocBuffer.
  • tests/python/tir-transform/test_tir_transform_simplify.py
    • Updated navigation logic to find For loops within SeqStmt.
  • tests/python/tir-transform/test_tir_transform_storage_rewrite.py
    • Adjusted assertions to account for AllocBuffer being the first element of a SeqStmt.
  • tests/python/tir-transform/test_tir_transform_unroll_loop.py
    • Updated test cases to use flat AllocBuffer and DeclBuffer syntax.
  • tests/python/tvmscript/test_tvmscript_ir_builder_tir.py
    • Updated assertions for T.allocate and T.decl_buffer to expect a SeqStmt containing the buffer statement and an Evaluate statement.
  • tests/python/tvmscript/test_tvmscript_printer_structural_equal.py
    • Adjusted AccessPath for structural equality checks due to SeqStmt wrapping.
  • tests/python/tvmscript/test_tvmscript_printer_tir.py
    • Updated expected print output for T.allocate and T.decl_buffer to reflect flat statements and SeqStmt usage.
Activity
  • The pull request introduces a fundamental change to the TIR structure by removing the body field from AllocBuffer and DeclBuffer nodes.
  • This refactoring required extensive modifications across the codebase, touching core IR definitions, script builders, code generation backends, and various transformation passes.
  • The author has confirmed that all existing test suites pass, indicating a successful integration of this significant change.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring that removes the body from AllocBuffer and DeclBuffer nodes in TIR, making them flat statements. This change simplifies the IR structure and has been applied consistently across a large number of files, including core IR definitions, visitors, mutators, transform passes, and codegen backends. The updates to tests and IR builder utilities also reflect these new semantics. The overall change is well-executed. However, I've found a potential issue in the AllocBufferCalculator which seems to no longer correctly calculate peak memory usage due to the change in allocation scoping.

Note: Security Review did not run due to the size of the PR.

Make AllocBuffer and DeclBuffer flat (no body field), like Bind.
Buffer scope extends to end of enclosing scope instead of being
defined by a nested body.

- Remove body field from AllocBufferNode/DeclBufferNode
- Update all visitors, mutators, codegen backends
- Add scope-stack tracking in calculate_allocated_memory and
  lower_tvm_builtin for correct peak memory and LIFO free ordering
- Remove obsolete AllocateFrame/DeclBufferFrame/AllocBufferFrame
- Remove T.allocate() (replaced by T.alloc_buffer())
- Two-pass approach in lower_warp_memory (pre-scan function body)
- Deferred remapping in lower_thread_allreduce for flat IR
@tqchen tqchen force-pushed the remove-allocbuffer-declbuffer-body-v2 branch from d22b208 to 2b9541f Compare March 5, 2026 19:49
@tqchen
Copy link
Member Author

tqchen commented Mar 5, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR is a major refactoring to remove the body from AllocBuffer and DeclBuffer nodes, making them flat statements. This simplifies the IR structure and makes it more consistent with Bind. The changes are extensive, touching many files across the codebase, including IR definitions, IR builders, printers, and numerous analysis and transformation passes.

The overall refactoring is well-executed. Many passes have been correctly updated to handle the new flat semantics, often by adopting patterns like deferred processing or analyzing SeqStmt to understand allocation scopes. This is a significant improvement to TIR's design.

I have found one critical issue in the lower_thread_allreduce pass where the volatile_scope attribute for shared memory buffers is not correctly applied after this refactoring. This could lead to incorrect code generation for GPU targets.

Additionally, as noted in the test changes, the RemoveNoOp pass currently doesn't handle dead allocation elimination for the new flat AllocBuffer nodes. This seems to be a known limitation of this PR.

Note: Security Review is unavailable for this PR.

@tqchen
Copy link
Member Author

tqchen commented Mar 5, 2026

@gemini-code-assist review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request performs an extensive refactoring to flatten AllocBuffer and DeclBuffer nodes, making them simple statements by removing their bodies. However, a critical vulnerability has been introduced where attributes like volatile_scope and storage_alignment are now applied to empty or truncated scopes. This breaks synchronization on GPUs and loses alignment metadata, which can lead to data races and incorrect computation results. Specifically, the handling of volatile_scope for shared memory in src/s_tir/transform/lower_thread_allreduce.cc is broken, as the new implementation emits an AttrStmt with an empty body, rendering the volatile attribute ineffective. A pattern similar to the correct implementation in lower_warp_memory.cc, which gathers siblings into a new body, should be applied to address this.

tqchen added 3 commits March 5, 2026 21:08
With flat IR, T.decl_buffer() returns a plain Buffer instead of a
frame (context manager). Move buffer declarations out of frame_scope
lists in scan, sort, searchsorted, and nms_util.
… codegen

For AllocBuffer, codegen now reads alignment directly from the buffer's
data_alignment field instead of relying on a separate AttrStmt. Remove
the AttrStmt emission from lower_tvm_builtin. AttrStmt storage_alignment
is kept for function parameter buffers (from tvm_ffi_binder).
@tqchen
Copy link
Member Author

tqchen commented Mar 6, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a large and impactful refactoring to flatten AllocBuffer and DeclBuffer statements in TIR, removing their body field. The changes are extensive, touching core IR definitions, TVMScript, numerous TIR passes, and codegen backends. The overall implementation looks solid and consistent. The adaptation of visitors and mutators to the new flat scoping rules, particularly using deferred processing in passes like InjectDoubleBuffer and LowerThreadAllreduce, is well-handled. The replacement of the volatile_scope attribute with a tir.volatile annotation is also a good cleanup. I have one suggestion for improvement.

Comment on lines +85 to +89
ffi::Array<Stmt> seq;
seq.push_back(stmt);
seq.push_back(Bind(op->buffer->data,
Call(op->buffer->data.dtype(), builtin::nd_mem_alloc_with_scope(), args)));
stmt = SeqStmt(seq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This pass appears to be lowering AllocBuffer to a Bind statement that performs the actual memory allocation. However, the current implementation replaces the AllocBuffer node with a SeqStmt containing both the original AllocBuffer and the new Bind statement. This seems redundant, as the AllocBuffer is a high-level construct that this lowering pass should replace.

Other lowering passes in this PR (e.g., LowerVtcmAlloc) replace AllocBuffer with just a Bind statement. To be consistent and avoid redundancy, this pass should probably do the same.

Suggested change
ffi::Array<Stmt> seq;
seq.push_back(stmt);
seq.push_back(Bind(op->buffer->data,
Call(op->buffer->data.dtype(), builtin::nd_mem_alloc_with_scope(), args)));
stmt = SeqStmt(seq);
stmt = Bind(op->buffer->data,
Call(op->buffer->data.dtype(), builtin::nd_mem_alloc_with_scope(), args));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants