For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
DeviceGraphBuilder
struct DeviceGraphBuilder
Builder for explicit device graph construction.
A DeviceGraphBuilder is obtained from
DeviceContext.create_graph_builder().
Callers add kernel nodes via add_function() and then call
instantiate() to produce a reusable DeviceGraph.
Example:
from std.gpu.host import DeviceContext
def kernel(x: Int):
print("Value:", x)
with DeviceContext() as ctx:
var compiled_fn = ctx.compile_function[kernel, kernel]()
var builder = ctx.create_graph_builder()
builder.add_function(compiled_fn, 42, grid_dim=1, block_dim=1)
var graph = builder^.instantiate()
graph.replay()
ctx.synchronize()
Implemented traits
AnyType,
ImplicitlyDestructible,
Movable,
_FunctionEnqueuer
comptime members
enqueue_fn_name
comptime enqueue_fn_name = StringSlice("AsyncRT_DeviceGraphBuilder_addFunctionDirect")
C runtime function name used by _FunctionEnqueuer to add a kernel node.
Methods
__init__
__init__(out self, *, copy: Self)
Creates a copy of an existing graph builder by incrementing its reference count.
Args:
- copy (
Self): The graph builder to copy.
__del__
__del__(deinit self)
Releases resources associated with this graph builder.
handle
handle(self) -> UnsafePointer[NoneType, MutExternalOrigin]
Gets the underlying C builder handle.
Returns:
UnsafePointer[NoneType, MutExternalOrigin]: The underlying C builder handle as an opaque pointer.
add_function
add_function[*Ts: DevicePassable](self, f: DeviceFunction[target=f.target, compile_options=f.compile_options, link_options=f.link_options, _ptxas_info_verbose=f._ptxas_info_verbose], *args: *Ts.values, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None)))
Adds a type-checked compiled kernel function as a node in this graph.
Parameters:
- *Ts (
DevicePassable): Argument types (must beDevicePassable).
Args:
- f (
DeviceFunction[target=f.target, compile_options=f.compile_options, link_options=f.link_options, _ptxas_info_verbose=f._ptxas_info_verbose]): The type-checked compiled function to add. Must have been compiled viaDeviceContext.compile_function(). - *args (
*Ts.values): Arguments to pass to the kernel. - grid_dim (
Dim): Dimensions of the compute grid. - block_dim (
Dim): Dimensions of each thread block. - cluster_dim (
OptionalReg[Dim]): Cluster dimensions (optional). - shared_mem_bytes (
OptionalReg[Int]): Amount of dynamic shared memory per block. - attributes (
List[LaunchAttribute]): Launch attributes. - constant_memory (
List[ConstantMemoryMapping]): Constant memory mappings.
Raises:
If adding the node fails.
add_function[FuncType: def() register_passable -> None, //, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, func: FuncType, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None)))
Compiles and adds a capturing kernel closure as a node in this graph.
This overload is for kernels that capture variables from their
enclosing scope using the {var} capture syntax. Compilation is
performed automatically using the DeviceContext that created this
builder, so no separate compile step is needed.
Example:
from std.gpu import global_idx
from std.gpu.host import DeviceContext
with DeviceContext() as ctx:
var scale: Float32 = 2.0
var buf = ctx.enqueue_create_buffer[DType.float32](256)
var ptr = buf.unsafe_ptr()
def scale_kernel() {var}:
var i = global_idx.x
ptr[i] = Float32(i) * scale
var builder = ctx.create_graph_builder()
builder.add_function(scale_kernel, grid_dim=1, block_dim=256)
var graph = builder^.instantiate()
graph.replay()
ctx.synchronize()
Parameters:
- FuncType (
def() register_passable -> None): The type of the closure function (usually inferred). - dump_asm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the compiled assembly, passTrue, or a file path to dump to, or a function returning a file path. - dump_llvm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the generated LLVM code, passTrue, or a file path to dump to, or a function returning a file path. - _dump_sass (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. PassTrue, or a file path to dump to, or a function returning a file path. - _ptxas_info_verbose (
Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changesdump_asmto output verbose PTX assembly (defaultFalse).
Args:
- func (
FuncType): The capturing kernel closure to compile and add as a graph node. - grid_dim (
Dim): Dimensions of the compute grid. - block_dim (
Dim): Dimensions of each thread block. - cluster_dim (
OptionalReg[Dim]): Cluster dimensions (optional). - shared_mem_bytes (
OptionalReg[Int]): Amount of dynamic shared memory per block. - attributes (
List[LaunchAttribute]): Launch attributes. - constant_memory (
List[ConstantMemoryMapping]): Constant memory mappings.
Raises:
If adding the node fails.
add_copy
add_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: HostBuffer[dtype])
Adds a host-to-device memcpy node to the graph.
The number of bytes copied is determined by the size of the device buffer.
Parameters:
- dtype (
DType): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[dtype]): Device buffer to copy to. - src_buf (
HostBuffer[dtype]): Host buffer to copy from.
Raises:
If adding the node fails.
add_copy[dtype: DType](self, dst_buf: HostBuffer[dtype], src_buf: DeviceBuffer[dtype])
Adds a device-to-host memcpy node to the graph.
The number of bytes copied is determined by the size of the device buffer.
Parameters:
- dtype (
DType): Type of the data being copied.
Args:
- dst_buf (
HostBuffer[dtype]): Host buffer to copy to. - src_buf (
DeviceBuffer[dtype]): Device buffer to copy from.
Raises:
If adding the node fails.
add_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: DeviceBuffer[dtype])
Adds a device-to-device memcpy node to the graph.
Both buffers must belong to the same context as this builder; cross-context copies are not supported in graphs. The number of bytes copied is determined by the size of the source buffer.
Parameters:
- dtype (
DType): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[dtype]): Device buffer to copy to. - src_buf (
DeviceBuffer[dtype]): Device buffer to copy from. Must be the same size asdst_buf.
Raises:
If adding the node fails.
add_memset
add_memset[dtype: DType](self, dst: DeviceBuffer[dtype], val: Scalar[dtype])
Adds a memset node to the graph that sets all elements of dst to val.
Parameters:
- dtype (
DType): Type of the data stored in the buffer.
Args:
- dst (
DeviceBuffer[dtype]): Destination buffer. - val (
Scalar[dtype]): Value to set all elements ofdstto.
Raises:
If adding the node fails. The underlying graph APIs cannot express an 8-byte memset whose high and low 32-bit halves differ as a single node, so such patterns will return an error.
instantiate
instantiate(var self) -> DeviceGraph
Instantiates the constructed graph into an executable device graph.
Finalizes the graph construction and produces a DeviceGraph that
can be replayed multiple times.
Returns:
DeviceGraph: The instantiated device graph.
Raises:
If instantiation fails.