Mojo v1.0.0b1

May 7, 2026

Highlights

fn is deprecated; use def. Mojo now emits a compiler warning on uses of fn; this will become a compilation error in the next release. This completes the def/fn unification begun in v0.26.2: def is now Mojo's standard function declaration keyword, with the same non-raising semantics fn had. See Language changes.
Unified closures. This release continues the closure unification work begun in earlier releases: stateless closures auto-lift to top-level functions (and can be passed as FFI callbacks), the ref capture convention is supported, default capture conventions can be combined with explicit capture lists, and a new thin function effect declares a plain function pointer type that doesn't carry captured state. See Language enhancements.
UnsafePointer is non-null by design. The default null constructor and __bool__() method are deprecated, and UnsafePointer no longer conforms to Defaultable or Boolable. Express nullability with Optional[UnsafePointer[...]], which shares UnsafePointer's layout (the null address is the None niche) so nullable pointers remain zero-overhead and FFI-safe. See Pointer and memory.
Bounds-checked collections by default. Negative indexing has been removed from all standard library collections — x[-1] is now a compile-time error; use x[len(x) - 1] instead — and bounds checking is now on by default for all collections on CPU. Out-of-bounds accesses report the user's call site. Bounds checking remains off by default on GPU for performance; use mojo build -D ASSERT=all to enable. See Collections and iterators.
NDBuffer removed. NDBuffer has been fully removed from the standard library. Migrate to TileTensor. See Collections and iterators.
Expanded GPU hardware support. Apple Metal becomes a much more capable Mojo target — print() works, dynamic threadgroup memory (external_memory[]()) is supported, Apple M5 MMA intrinsics enable hardware matrix multiply-accumulate, and Apple GPU targets prefer metal4 features by default. Added support for AMD MI250X and NVIDIA B300 (sm_103a) accelerators. See GPU programming.
GPU primitive id accessors migrated UInt → Int. thread_idx, block_idx, block_dim, grid_dim, global_idx, lane_id, warp_id, and the cluster accessors now return Int as part of a broader migration to standardize on Int for sizes and offsets. Temporary *_uint aliases provide a migration path; they will eventually be deprecated and removed. See GPU programming.
CPU DeviceContext expansion. DeviceContext(api="cpu") is now a stream-ordered execution context for CPU work, paving the way for NUMA-aware CPU dispatch. New enqueue_cpu_function() and enqueue_cpu_range() enqueue host functions and parallel ranges with stream ordering relative to surrounding work. See GPU programming.
Grapheme cluster support in String and StringSlice. Added UAX #29 grapheme cluster segmentation with graphemes(), count_graphemes(), the [grapheme=...] slicing syntax, and reverse iteration. Correctly handles combining marks, emoji ZWJ sequences, flag emoji, Hangul syllables, and other multi-codepoint clusters. See String and text.
Type refinement. The compiler now narrows types from where clauses, comptime if statements, and comptime assert statements, driven by conforms_to() expressions. This makes trait_downcast unnecessary in the common case — Mojo recognizes when a type satisfies a trait inside a refined scope and lets you call its trait methods directly. See Language enhancements.
Unified reflection API. A new reflect[T]() entry point in std.reflection returns a Reflected[T] handle, replacing the family of struct_field_* free functions and get_type_name / get_base_type_name. reflect is auto-imported via the prelude. The legacy free functions and the ReflectedType[T] wrapper are now @deprecated. See Other library changes.

Documentation

Added a new Mojo language reference covering lexical elements, expressions, statements, numeric types, structs, and traits. The reference includes new pages on Functions, the @doc_hidden decorator, and inline MLIR, with negative examples that highlight common errors.
Added a manual section on TileTensor and TileTensor layouts.
Separated the Mojo layout library docs from the MAX kernels library, reflecting that the layout library ships with mojo and the rest of the kernels library ships with max.
Added a new Compilation targets doc covering how to inspect your platform, select a target configuration, and cross-compile for other CPUs, operating systems, and accelerators.
Added a new Packaging guide for building Mojo packages, currently covering the rattler-build workflow.
Restructured the Mojo and MAX system requirements docs into a two-level "Continuously tested" / "Known compatible" taxonomy, with a dedicated Mojo GPU compatibility page and per-vendor hardware tables. Added a new Troubleshooting GPU detection section.
Split the operators page into separate manual pages, refreshed coverage, and added a new tutorial and reference page.

Language enhancements

Added type refinement based on compile-time assumptions, enabling Mojo to narrow types from where clauses, comptime if statements, and comptime assert statements. Refinements in a scope are driven by conforms_to() expressions.

Before:

def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):
    for item in self:
        if trait_downcast[Equatable](item) == trait_downcast[Equatable](value):
            return True
    return False

After:

def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):
    for item in self:
        if item == value:
            return True
    return False

Unified closure improvements. This release continues the closure unification work begun in earlier releases: stateless closures auto-lift, the ref capture convention is supported, default capture conventions can be combined with explicit capture lists, and a new thin function effect declares a plain function pointer type that doesn't carry captured state.

def main() raises:
    var a, b, c, d = 1, 2, 3, 4
    var x = "hello"

    # Legacy closure: no capture list. Cannot capture variables.
    def hello():
        print("hi")

    # Unified closure with no captures (stateless). Stateless closures
    # lift to top-level functions and can be passed as FFI callbacks.
    def add_one(n: Int) {} -> Int:
        return n + 1

    # Unified closure with explicit captures and a default capturing
    # convention:
    def my_fn() {mut a, b, c^, read}:
        # capture:
        # `a` by mut reference
        # `b` by immut reference
        # `c` by moving
        # `d` by immut reference (the default `read` convention)
        use(a, b, c, d)

    # Unified closure that captures `x` by ref (carries an
    # origin-mutability parameter):
    def show_x() {ref x}:
        print(x)

    # Function effects come before the capture list. The calling context
    # must handle errors raised from a `raises` closure.
    def fallible() raises {}:
        raise Error("nope")

    # Closures are invoked like ordinary functions:
    hello()
    print(add_one(41))
    my_fn()
    show_x()
    try:
        fallible()
    except e:
        print(e)

    # The `thin` function effect declares a plain function pointer
    # type that doesn't carry captured state. Stateless closures and
    # top-level functions are compatible with `thin` function pointers:
    var fn_ptr: def(Int) thin -> Int = add_one
    print(fn_ptr(99))

Added abi("C") as a function effect for declaring the C calling convention on function definitions and function pointer types. Functions marked with abi("C") use the platform C ABI (System V x86-64 / ARM64 AAPCS) for struct arguments and return values, enabling safe interop with C libraries. DLHandle.get_function() now enforces that the type parameter carries abi("C"), preventing silent ABI mismatches when loading C symbols.
```
# C-ABI function definition (safe as a callback into C code)
def add(a: Int32, b: Int32) abi("C") -> Int32:
    return a + b

# C-ABI function pointer type (safe for use with DLHandle.get_function)
var f = handle.get_function[def(Float64) abi("C") -> Float64]("sqrt")
```
Added support for conditional RegisterPassable conformance.
The ternary if/else expression now coerces each element to its contextual type when obvious. For example, this works instead of producing an incompatible-metatypes error:
```
comptime some_type: Movable = Int if cond else String
```

Variadic lists and packs can be forwarded through runtime calls with *pack when the callee takes a compatible variadic list or pack.

def callee[*Ts: Writable](*args: *Ts):
    comptime for i in range(args.__len__()):
        print(args[i])

def forwarder[*Ts: Writable](*args: *Ts):
    callee(*args)

forwarder(1, "hello", 3.14)  # prints each value on a separate line

Heterogeneous variadic packs can now be specified with a SomeTypeList helper. These two are equivalent:

def foo[*arg_types: Copyable](*args: *arg_types) -> Int: ...
def foo(*args: *SomeTypeList[Copyable]) -> Int: ...

String literals now support \uXXXX and \UXXXXXXXX Unicode escape sequences, matching Python. The resulting code point is stored as UTF-8. Invalid code points and surrogates are rejected at parse time.

T-strings can now be used in comptime assert messages:

def foo[i: Int]():
    comptime assert i > 5, t"expected i > 5, got {i}"

Language changes

The fn keyword for function declarations is deprecated. Mojo now emits a compiler warning on uses of fn; this will become a compilation error in the next release. Use def instead.
The unified keyword has been removed; specify unified-closure semantics with an explicit capture list {...} after the function signature. An empty capture list {} denotes unified with no captures; closures without any capture list are legacy. Mojo also now warns when a function pointer type omits the thin effect; specify thin explicitly to silence the warning.
Import statements of the form from pkg import ... no longer make pkg available to the module.
Removed support for comparing tuples of differing lengths or types. Such comparisons (for example (1, 2) != (4, 5, 6)) are now rejected statically by the type system instead of silently returning not-equal.

Variadic parameter lists are now ParameterList and TypeList instead of !kgen.param_list, so they can be used like ordinary types:

def callee[*values: Int]():
    var v = 0
    for i in range(len(values)):
        v += values[i]
    for elt in values:
        v += elt

Each Mojo function now has its own unique function-literal type. Two separately-defined functions, even with identical signatures, are not interchangeable through their literal types; use a function pointer type (for example, def(Int) thin -> Int) to abstract over them.
A if comptime(C) else B now skips elaboration of the dead branch, treating the ternary expression as a compile-time evaluation contract analogous to comptime if C: A else: B.
@explicit_destroy is now rejected at parse time when paired with an unconditional ImplicitlyDestructible conformance; it remains valid only on conditional (where-clause-constrained) conformances.

Library changes

Type system and traits

The Boolable, Defaultable, and Writable traits no longer inherit from ImplicitlyDestructible. Generic code that needs the destructor bound must now request it explicitly: T: Writable & ImplicitlyDestructible.
Standard library types now use conditional conformances:
- Span: Writable, Hashable
- Tuple, Optional, Variant, and UnsafeMaybeUninit: RegisterPassable
- Tuple: Defaultable (when all element types are Defaultable)
- Variant: Copyable, ImplicitlyCopyable
- Optional: DevicePassable (conditional on element type)
ArcPointer now conditionally conforms to Hashable and Equatable when its inner type does, with __eq__() and __hash__() delegating to the managed value (matching C++ shared_ptr and Rust Arc semantics). This makes ArcPointer usable as a Dict key or Set element with value-based equality; pointer identity remains available via the is operator.
Path now conforms to Comparable, enabling lexicographic ordering and use with sort().

Atomic operations

Atomic operations have moved to a dedicated std.atomic module. The Consistency type has been renamed to Ordering and its MONOTONIC member to RELAXED, to align with conventions used by other languages. Update existing code as follows:

# Before
from std.os import Atomic
from std.os.atomic import Atomic, Consistency, fence

_ = atom.load[ordering=Consistency.MONOTONIC]()

# After
from std.atomic import Atomic, Ordering, fence

_ = atom.load[ordering=Ordering.RELAXED]()

Swapped the ordering arguments of Atomic.compare_exchange() so success_ordering is listed before failure_ordering, matching the convention used by C++, Rust, and other languages.
Ordering now has a default constructor that selects RELEASE on Apple GPU and SEQUENTIAL on all other targets. All Atomic methods and fence() use this platform-aware default instead of hard-coding SEQUENTIAL.

Pointer and memory

UnsafePointer is now non-null by design. The default null constructor __init__(out self) and __bool__(self) method are deprecated, and UnsafePointer no longer conforms to Defaultable or Boolable. See the non-null pointer proposal for the full design.

To migrate, express nullability with Optional[UnsafePointer[...]], which shares the layout of UnsafePointer (the null address is the None niche) so nullable pointers remain zero-overhead and FFI-safe.
```
# Before: null default construction
var ptr = UnsafePointer[Int, origin]()

# After: express absence with Optional
var ptr: Optional[UnsafePointer[Int, origin]] = None

# Before: Bool-based null check
if ptr:
    use(ptr[])

# After: check the Optional, then unwrap
if ptr:
    use(ptr.value()[])
```
For a non-null placeholder for a field that will be populated later (for example, a buffer allocated on demand), use UnsafePointer.unsafe_dangling()—a well-aligned but dangling pointer. It's not a null sentinel; lazy-init types must track initialization separately.
CStringSlice can no longer represent a null pointer. To represent nullability use Optional[CStringSlice], which is guaranteed to have the same size and layout as const char* (with NULL as the empty Optional).
OwnedDLHandle.get_symbol() now returns Optional[UnsafePointer[...]] instead of aborting when a symbol is not found, allowing callers to handle missing symbols gracefully.
alloc[T](count, alignment) now aborts if the underlying allocation fails.
Added std.memory.forget_deinit() to enable low-level code to skip running a destructor for a value. Use rarely, only when building low-level abstractions.

Collections and iterators

NDBuffer has been fully removed. Migrate to TileTensor.
Negative indexing has been removed from all stdlib collections (List, Span, InlineArray, String, StringSlice, LinkedList, Deque, IntTuple) to enable cheap CPU bounds checks by default. Using a negative IntLiteral for indexing now triggers a compile-time error:
```
constraint failed: negative indexing is not supported, use e.g. `x[len(x) - 1]` instead
```
Update any x[-1] to x[len(x) - 1].
Bounds checking is now on by default for all collections on CPU. Out-of-bounds accesses report the user's call site:
```
def main():
    var x = [1, 2, 3]
    print(x[3])
```
```
At: /tmp/main.mojo:3:12: Assert Error: index 3 is out of bounds, valid range is 0 to 2
```
Bounds checking is still off by default on GPU for performance. Use mojo build -D ASSERT=all to enable bounds checking on GPU; use -D ASSERT=none to disable all asserts including CPU bounds checking.
range() overloads that took differently-typed arguments or arguments that were Intable/IntableRaising but not Indexer have been removed. Callers should pass consistent integral argument types.
Added IterableOwned trait to the iteration module. Types conforming to IterableOwned implement __iter__(var self), which consumes the collection and returns an iterator that owns the underlying elements. List, Optional, Deque, LinkedList, Dict, Set, Counter, and InlineArray now conform; Span conforms conditionally on T: Copyable, with the owned iterator yielding copies by value.

Iterator adaptors (enumerate(), zip(), map(), peekable(), take_while(), drop_while(), product(), cycle(), count(), repeat()) now conform to IterableOwned. Added owned overloads of enumerate(), zip(), map(), peekable(), take_while(), drop_while(), product(), and cycle() that consume the input iterable.

Added map() and and_then() methods to Optional. map() applies a function to the contained value (returning Optional[To]); and_then() flat-maps over operations that themselves return an Optional.

var o = Optional[Int](42)

def closure(n: Int) {} -> String:
    return String(n + 1)

var mapped: Optional[String] = o.map[To=String](closure)
print(mapped) # Optional("43")

Added Optional.destroy_with(destroy_func), which destroys an Optional[T] in-place using a caller-provided destructor. This enables Optional to hold element types that are not ImplicitlyDestructible (for example, types marked @explicit_destroy), mirroring Variant.destroy_with(). Both destroy_with() methods now accept closures that capture local state in addition to plain function references. Variant.destroy_with() callers must now pass the destroyed type explicitly (for example, v^.destroy_with[Int](destroy_func)) since T can no longer be inferred from the closure type.
Added a generic __contains__() method to Span for any element type conforming to Equatable, not just Scalar types.
assert_raises() now catches custom Writable error types, not just Error.

String and text

String.__len__() is deprecated. Use String.byte_length() or String.count_codepoints() instead.
Grapheme cluster support in String and StringSlice. Added UAX #29 grapheme cluster segmentation, correctly handling combining marks, emoji ZWJ sequences, flag emoji, Hangul syllables, and other multi-codepoint clusters.
- graphemes() returns a GraphemeSliceIter yielding each user-perceived "character" as a StringSlice; count_graphemes() returns the grapheme cluster count.
- StringSlice supports slicing by grapheme cluster via the grapheme= keyword argument, mirroring the existing byte= indexer (for example, s[grapheme=0:3]). Because grapheme boundaries are discovered by a forward scan, this is O(n) in byte length—prefer byte= when byte offsets are known.
- Grapheme-aware algorithms grapheme_indices(), nth_grapheme(n), and split_at_grapheme(n) mirror Rust's str::grapheme_indices and friends, useful for editors and UIs mapping cursor byte positions to grapheme boundaries.
- GraphemeSliceIter supports reverse iteration via next_back(), peek_back(), and the graphemes_reversed() constructors on String / StringSlice. Reverse iteration costs more per cluster than forward iteration because the UAX #29 state machine is forward-scanning.
- GraphemeSliceIter.remaining_byte_length() reports the iterator's remaining byte range in O(1).
- count_graphemes() takes a fast path over printable-ASCII runs: ~10x faster on pure-ASCII text, ~5–6x faster on ASCII-dominant mixed text. Pure non-ASCII text (Arabic, Russian, Chinese) is unchanged.

Diagnostics and debug

abort(message) now includes the call site location in its output. You can also pass an explicit SourceLocation to override it:

abort("something went wrong")
# prints: ABORT: path/to/file.mojo:42:5: something went wrong

var loc = current_location()
abort("something went wrong", location=loc)

abort(message) now prints its message on NVIDIA and AMD GPUs, including block and thread IDs. Previously, the message was silently suppressed on these GPUs. On Apple GPUs, the message is silently suppressed for now.
New diagnostics report the user's call site rather than stdlib source: check_bounds() for collections asserts on out-of-range indices, and debug_assert() now accepts a call_location parameter for callers to override the reported SourceLocation.
SourceLocation fields are now private; use the line(), column(), and file_name() accessor methods instead.
Added uninitialized memory read detection for float loads. When compiled with -D MOJO_STDLIB_SIMD_UNINIT_CHECK=true, every float load is checked against the debug allocator's poison pattern (the largest finite value of the float type, for example FLT_MAX for Float32); a match triggers abort(). The poison is non-NaN so it coexists with nan-check in kernels that intentionally write only active positions. Zero runtime overhead when disabled (the default).
InlineArray's storage constructor now uses debug_assert[assert_mode="safe"] for the element-count check, so size mismatches are caught by default instead of only with -D ASSERT=all.

TileTensor and Layout

TileTensor API extensions:
- Added TileTensor.bitcast[target_dtype](), which returns a new TileTensor viewing the same storage and layout under a different element dtype, replacing the TileTensor(x.ptr.bitcast[Scalar[T]](), x.layout) idiom.
- Added TileTensor.flat_load() and TileTensor.flat_store() as raw-flat accessors that read and write the underlying storage at a linear offset, bypassing the tensor's layout.
- Added a TileTensor.tile() overload that takes the tile shape as a runtime or compile-time parameter argument, complementing the existing tile APIs.
- GPU TileTensor.load() and load_linear() now default invariant=True for immutable tensors, enabling the compiler to use ldg for read-only memory accesses.
- Added compile-time bounds checks to TileTensor, ManagedTensorSlice, and crd2idx() to catch out-of-range coordinate accesses at compile time.
Layout library extensions:
- Added a compile-time coalesce() function for TensorLayout, mirroring the legacy Layout.coalesce() algorithm (skip shape-1 dims and merge contiguous dims).
- Added write_repr_to() to Layout for writing a debug representation to a Writer.
- vectorize() and distribute() now accept layouts with runtime dimensions.
- row_major() now accepts coord-like arguments directly, no longer requiring them to be wrapped in tuples.
- Introduced weakly compatible layouts, enabling structural compatibility comparisons between layouts and coordinate indices (up to depth 4). Structural equality is now checked via a comptime assert rather than a where clause.
- Changed CoordLike.value() to return Scalar[Self.DTYPE] instead of Int, providing a more expressive return type for layout coordinate values.
- Coord, RowMajorLayout, and ColMajorLayout now take their parameters as variadic arguments, improving ergonomics when specifying individual coords. Use *splat to pass an existing list.

GPU programming

Added support for AMD MI250X accelerators.
Expanded Apple silicon GPU support. Apple Metal GPU is now a more capable Mojo target.
- print() and _printf() now work on Apple Metal GPU. Output is chunked through the Metal os_log path, with a Float32-only formatter that matches Metal's hardware constraints. _printf() currently emits the format string only (not interpolated arguments); |x| < 1e-7 is truncated to 0.0.
- external_memory[]() (dynamic threadgroup memory) is now supported on Apple silicon, so existing GPU kernels using external_memory[]() work unchanged.
- Apple M5 MMA intrinsics (apple_mma_load(), apple_mma_store(), _mma_apple()) in std.gpu.compute.arch.mma_apple enable hardware matrix multiply-accumulate on Apple GPUs.
- Added CompilationTarget.is_apple_m5() to std.sys for detecting Apple M5 targets at compile time; is_apple_silicon() now includes M5 in its check.
- Apple GPU targets now prefer metal4 features by default when the toolchain supports them, automatically appending -metal4 to the arch instead of requiring explicit m5-metal4 selection.
- Atomic ordering: release ordering is not supported on Metal. Apple GPU targets now use monotonic (relaxed) atomic ordering by default.
- Floating-point widths: the compiler now rejects floating-point types wider than 32 bits (Float64/Float80/Float128) for Apple GPU targets, since Metal supports only Float16 and Float32.
GPU device APIs:
- Added support for NVIDIA B300 (sm_103a) accelerators. New helpers in std.sys.info and std.gpu.host.info recognize B300 targets for correct kernel dispatch on Blackwell B300.
- Added DeviceStream.enqueue_host_func(func, user_data) exposing the cuLaunchHostFunc primitive for Mojo kernels and custom ops. Takes a thin def(OpaquePointer[MutAnyOrigin]) -> None callback and an opaque user_data pointer. CUDA-only today; non-CUDA backends raise.
- DeviceContext initialization now runs an automatic GPU health check that detects hardware throttling, uncorrectable ECC errors, and zombie VRAM, and fails device creation with an actionable error message on unhealthy GPUs. Added DeviceContext.run_healthcheck() to re-invoke the check explicitly. Set MODULAR_DEVICE_CONTEXT_DISABLE_HEALTHCHECK=true to disable.
- Optimized GPU elementwise() index computation and dispatch with a use_32bit fast path, 4x unrolled grid-stride processing, warp-aligned block sizes, and SM100+ single-tile routing.
GPU primitive id accessors (thread_idx, block_idx, block_dim, grid_dim, global_idx, lane_id, warp_id, cluster_dim, cluster_idx, and block_id_in_cluster) have migrated from UInt to Int.

This is part of a broader migration to standardize on the Int type for all sizes and offsets in Mojo. As a related step in the same migration, TensorCore.load_a() and TensorCore.load_b() now also take Int arguments instead of UInt.

To provide a gradual migration path, *_uint aliases of the seven non-cluster accessors are temporarily available:

Accessor Legacy UInt alias
thread_idx thread_idx_uint
block_idx block_idx_uint
block_dim block_dim_uint
grid_dim grid_dim_uint
global_idx global_idx_uint
lane_id lane_id_uint
warp_id warp_id_uint

The three cluster accessors (cluster_dim, cluster_idx, block_id_in_cluster) migrated directly without *_uint aliases, since their usage was limited.

Code can preserve its prior UInt behavior by using a renaming import of the *_uint alias:
```
- from std.gpu import thread_idx
+ from std.gpu import thread_idx_uint as thread_idx
```
The temporary *_int accessors that briefly existed during the phased migration as a forward-compatibility aid have been removed; use the unprefixed accessors (which now return Int by default). The *_uint aliases will eventually be deprecated and removed.
CPU DeviceContext expansion. DeviceContext(api="cpu") is now usable as a stream-ordered execution context for CPU work, paving the way for NUMA-aware CPU dispatch.
- Added DeviceContext.enqueue_cpu_function() and DeviceContext.enqueue_cpu_range() for stream-ordered execution of host functions on CPU DeviceContext instances. enqueue_cpu_function() enqueues a single host function; enqueue_cpu_range() enqueues a parallel range whose tasks run concurrently but are stream-ordered relative to surrounding work. Argument passing is not yet supported.
- parallelize(), parallelize_over_rows() (in std.algorithm.backend.cpu.parallelize), and the elementwise() overloads in std.algorithm.functional now accept an optional trailing ctx: Optional[DeviceContext] = None. When supplied, the context is forwarded to sync_parallelize(); otherwise behavior is unchanged.
- Added a parallelism_level() overload that takes a CPU DeviceContext and returns the thread-pool size for that specific context, enabling NUMA-specific introspection.
AMD GPU intrinsics:
- Added the ds_read_tr8_b64() AMD GPU intrinsic in std.gpu.intrinsics, performing a 64-bit LDS transpose load of 8-bit elements via llvm.amdgcn.ds.read.tr8.b64. Supported on AMD CDNA4+ GPUs.
- Added a Scalar[dtype] overload of readfirstlane() so callers no longer need bitcast workarounds to broadcast non-Int32 scalar values across an AMD GPU wavefront.
- AMDBufferResource.load_to_lds() in std.gpu.intrinsics now lowers to the .ptr. form of the AMDGPU buffer-load-to-LDS intrinsic, fixing a strided-layout regression. A new async_copies: Bool = False parameter opts in to attaching the amdgpu.AsyncCopies alias scope on the load, enabling LLVM vmcnt relaxation.
- Added a broadcast=True parameter to GPU warp_id() (and related id accessors) so callers can avoid manual warp.broadcast(warp_id()) patterns.
tile_io module for TileTensor data movement. Added a tile_io module providing TileTensor copier traits and copy utilities for moving data between memory hierarchies (DRAM/SRAM). The module includes:
- GenericToSharedAsyncTileCopier, which moves a TileTensor from generic memory into shared memory via NVIDIA's cp.async. On AMD and Apple GPUs the underlying async_copy() falls back to synchronous loads/stores.
- An optional swizzle: Swizzle parameter on GenericToSharedAsyncTileCopier, mirroring the swizzled write path in LocalToSharedTileCopier.
- A masked: Bool = False parameter on GenericToSharedAsyncTileCopier. When enabled, out-of-bounds vectors receive a zero-byte copy with zero-fill, matching LayoutTensor.copy_from_async[is_masked=True, fill=Fill.ZERO].
- An AsyncTileCopier trait abstracting copier conformance.
TMA gather4 for sparse 2D tensor loads. Added a TMA gather4 operation on SM100 (Blackwell) for loading 4 non-contiguous rows from a 2D tensor in a single TMA instruction, surfaced as the cp_async_bulk_tensor_2d_gather4() intrinsic in std.gpu.memory and integrated with TMATensorTile. The API supports:
- Full 2D tile sparse loads with arbitrary tile_height (multiple of 4) and tile_width, replacing the prior 4-row-per-call limit.
- Arbitrary row_width—previously restricted to the swizzle box width. The API automatically computes the box width from the swizzle constraint and supports non-divisible widths via TMA hardware zero-fill on the last column group, so kernels no longer need to hand-code column-group loops.
1D TMA instructions for SM90+ NVIDIA GPUs. Added 1D TMA (Tensor Memory Accelerator) instruction support in std.gpu.memory. 1D TMA copies do not require a pre-allocated tensormap object on the host, providing greater flexibility than the existing 2D–5D TMA path. New functions: cp_async_bulk_shared_cluster_global(), cp_async_bulk_global_shared_cta(), cp_async_bulk_prefetch(), and cp_async_bulk_reduce_global_shared_cta().
Readable GPU kernel names in profilers. GPU kernels in the standard library and across MAX kernels (elementwise, GEMV, multistage matmul, attention, convolution, MoE, normalization, quantization, BMM, grouped matmul, SM100 matmul, AMD matmul, communication, and sampling) now expose human-readable names in profiler traces such as Nsight Systems, replacing previously mangled KGEN symbols.
Added Span-based overloads for enqueue_copy(), enqueue_copy_from(), and enqueue_copy_to() on DeviceContext, DeviceBuffer, and HostBuffer, providing a safer alternative to raw UnsafePointer for host-device memory transfers.

Accessor	Legacy `UInt` alias
`thread_idx`	`thread_idx_uint`
`block_idx`	`block_idx_uint`
`block_dim`	`block_dim_uint`
`grid_dim`	`grid_dim_uint`
`global_idx`	`global_idx_uint`
`lane_id`	`lane_id_uint`
`warp_id`	`warp_id_uint`

Other library changes

Removed trait_downcast() and trait_downcast_var() from across the standard library, replaced by type refinement (see Language enhancements). Public APIs are unchanged.
external_call()'s return_type requirement has been relaxed from TrivialRegisterPassable to RegisterPassable.
Several standard library APIs gained unified-closure overloads: parallelize() and parallelize_over_rows() (in std.algorithm.backend.cpu.parallelize), bench.bencher(), DeviceContext.execution_time(), and DeviceContext.enqueue_function() (the GPU enqueue path, renamed from the previous enqueue_closure()).

Consolidated the reflection APIs in std.reflection behind a unified entry point reflect[T]() returning a Reflected[T] handle. reflect() is auto-imported via the prelude. Methods on the handle replace the family of struct_field_* free functions (dropping the struct_ prefix—only structs have fields) and the get_type_name() / get_base_type_name() free functions:

struct Point:
    var x: Int
    var y: Float64

def main():
    comptime r = reflect[Point]()
    print(r.name())                          # "Point"
    print(r.field_count())                   # 2
    print(r.field_names()[0])                # x
    comptime y_type = r.field_type["y"]()    # Reflected[Float64]
    print(y_type.name())                     # "SIMD[DType.float64, 1]"
    print(reflect[List[Int]]().base_name())  # "List"
    var v: y_type.T = 3.14

The legacy free functions and the ReflectedType[T] wrapper are now @deprecated; they will be removed in a future release.

align_down() and align_up() now accept generic SIMD[dtype, width] integer values, replacing the prior UInt-only overloads.
Extended FastDiv and mulhi() to support 64-bit integer types.
Added Variadic.contains_value comptime alias to check whether a variadic sequence contains a specific value at compile time.

Tooling changes

Removed the legacy MOJO_ENABLE_STACK_TRACE_ON_ERROR and MOJO_ENABLE_STACK_TRACE_ON_CRASH environment variables. Instead, set the MODULAR_DEBUG environment variable to stack_trace_on_error to enable generation of stack traces when a Mojo program raises an error.
Debugger UX:
- The Mojo debugger now shows a Variant variable's active type name and value in LLDB (for example, Int(42) or String("hello")) instead of exposing raw _DefaultVariantStorage internals.
- The Mojo debugger now displays Optional[T] variables as None or Some(value) in LLDB instead of exposing raw _DefaultVariantStorage internals.
- The Mojo debugger now displays scalar types (for example, UInt8, Float32) as plain values instead of ([0] = value), and elides internal _mlir_value wrapper fields from struct display.
- The Mojo debugger now correctly displays UnsafePointer[T] values in LLDB for all pointed-to types, including signed integers (no longer rendered as huge unsigned values), Bool (True/False), and floats.
- The Mojo debugger now displays StringSlice, StaticString, and their underlying Span[Byte] values as quoted strings in LLDB.
- At -O0, trivially destructible types (Int, Float, Bool, SIMD, etc.) now remain visible in the debugger through the end of their lexical scope instead of disappearing at the ASAP destruction point.
LSP and REPL responsiveness:
- Code completion and signature help in REPL/notebook contexts are now amortized O(1) per request by caching parsed prior cells across requests, eliminating quadratic O(N²) slowdown in long sessions.
- LSP parse time is reduced by deferring body resolution of imported bytecode declarations and resolving named imports lazily, avoiding eager pulls of large transitive dependencies. Files with docstring code blocks parse roughly 2x faster.
- Added a --mojo-version flag to mojo-lsp-server for verifying the Mojo version that the LSP is using.
mojo CLI and toolchain:
- mojo --version now prints a semantic Mojo version (for example, 1.0.0...) instead of an internal build identifier, and the same version is used wherever the compiler performs version checks.
- mojo build --print-supported-targets now lists registered targets sorted alphabetically, with a graceful empty-list message.
- The compiler now selects the target's baseline CPU when cross-compiling with --target-triple without --target-cpu and the host and target architectures differ.
- ASAN-instrumented Mojo binaries on macOS now use llvm-symbolizer instead of atos, so stack traces report the full inlined call chain through user functions.
- Mojo package files (.mojopkg) now use format version 2 with zstd-compressed MLIR bytecode, significantly reducing package, wheel, and Docker image sizes.
mojo doc and docstring validation:
- mojo doc now preserves parameterized type names (for example, List[K], Optional[V], UnsafePointer[Scalar[dtype]]) in the API doc JSON "type" fields, instead of emitting only the bare base name.
- mojo doc now emits a diagnostic when a public Mojo module has no module-level docstring and -mojo-diagnose-missing-doc-strings is active. Private modules and modules nested inside private packages are exempt.
- Docstring validation no longer requires inferred parameters (those before // in a parameter list) to be documented; documenting them remains valid.
- Docstring validation now accepts ! and ? as valid sentence-ending punctuation throughout.
- def ... raises functions now require a Raises: docstring section like any other raising function. The isDef field has been removed from mojo doc JSON output.
mojo format (mblack):
- No longer supports the deprecated fn keyword or the removed owned argument convention.
- Now correctly parses the new unified-closure syntax including raises {captures} effect ordering, and no longer inserts a spurious space between ^ and the operand in var^ captures.
Comptime function calls now print more nicely in error messages and generated documentation, omitting VariadicList/VariadicPack and including keyword argument labels when required.

Removed

The escaping function effect is no longer supported. Migrate def(...) escaping -> T closures to use an explicit capture list {...} (see Language enhancements).
Several constructs deprecated in 26.2 are no longer accepted:
- The @register_passable and @register_passable("trivial") decorators are no longer supported. Conform to the RegisterPassable and TrivialRegisterPassable traits instead. Use of either decorator now produces a hard error pointing to the trait equivalent.
- The legacy __moveinit__() and __copyinit__() method names are no longer auto-rewritten to the unified __init__() form. Rename these methods to __init__() with keyword-only take: Self and copy: Self arguments, respectively, as introduced by init unification in 26.2. Existing legacy spellings now fail to compile with errors such as no matching function in initialization rather than being silently rewritten.
The deprecated @doc_private decorator has been removed. Use @doc_hidden instead.

Removed the store_release(), store_relaxed(), load_acquire(), and load_relaxed() helpers from std.gpu.intrinsics. Use Atomic[dtype, scope=...].store() and Atomic[dtype, scope=...].load() with the desired Ordering instead:

# Before
from std.gpu.intrinsics import store_release, load_acquire
store_release[scope=Scope.GPU](ptr, value)
var v = load_acquire[scope=Scope.GPU](ptr)

# After
from std.atomic import Atomic, Ordering
Atomic[dtype, scope="device"].store[ordering=Ordering.RELEASE](ptr, value)
var v = Atomic[dtype, scope="device"].load[ordering=Ordering.ACQUIRE](ptr)

API removals:
- Removed the param_env.mojo module. Use defines.mojo instead.
- Removed LinkedList.__getitem__(). Indexing a LinkedList is O(n), and exposing __getitem__() encouraged accidentally quadratic code; iterate the list instead.
- Removed the unused UIntSized trait and its prelude re-export.
- Removed the pdl_level parameter from elementwise(), reduction(), and reducescatter() kernel APIs. PDL usage is now an internal compile-time default.

Fixed

Fixed math.sqrt() on Float64 on NVIDIA GPU producing a cryptic could not find LLVM intrinsic: "llvm.nvvm.sqrt.approx.d" failure at LLVM IR translation time. math.sqrt() now rejects Float64 on NVIDIA GPU at compile time with the message DType.float64 isn't supported for approx sqrt on NVIDIA GPU. The math.sin() and math.cos() constraint messages were similarly sharpened to name the op. (Issue #6434)
Fixed pack inference failing with could not infer type of parameter pack ... given value with unresolved type when passing list, dict, set, or slice literals to a *Ts-bound variadic pack parameter (for example, def foo[*Ts: Iterable](*args: *Ts)). Pack inference now applies the same default-type fallback that single-argument trait-bound parameters already use, so foo([1, 2, 3], [4, 5, 6]) resolves each literal to its default type (for example, List[Int]) before binding the pack.
Fixed mojo aborting at startup with std::filesystem::filesystem_error when $HOME is not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412)
mojo run and mojo debug now honor -Xlinker flags by loading the referenced shared libraries into the in-process JIT. Previously the flags were dropped (with a -Xlinker argument unused warning), leaving programs that called into external shared libraries via external_call() unable to resolve those symbols at runtime. Supported -Xlinker forms mirror what the system linker accepts; flags with no JIT meaning are reported as a warning and ignored. (Issue #6155)
Fixed libpython auto-discovery failing for Python 3.14 free-threaded builds. The discovery script constructed the library filename without the ABI flags suffix (for example, looked for libpython3.14.dylib instead of libpython3.14t.dylib). (Issue #6366)
Fixed RTLD.LOCAL having the wrong value on Linux. It was set to 4 (RTLD_NOLOAD) instead of 0, causing dlopen() with RTLD.NOW | RTLD.LOCAL to fail. (Issue #6410)
Fixed mojo format crashing after upgrading Mojo versions due to a stale grammar cache. (Issue #6144)
Fixed atof() producing incorrect results for floats near the normal/subnormal boundary (for example, Float64("4.4501363245856945e-308") returned half the correct value). (Issue #6196)
Fixed a compiler crash ("'get_type_name' requires a concrete type") when using default Writable, Equatable, or Hashable implementations on structs with MLIR-type fields (for example, __mlir_type.index). The compiler now correctly reports that the field does not implement the required trait. (Issue #5872)
Fixed Atomic.store() silently dropping the requested scope. The previous implementation lowered to atomicrmw xchg without forwarding syncscope, so Atomic[..., scope="device"].store(...) was emitting a system-scope store on NVPTX (extra L2/NVLink fences) and an over-synchronized store on AMDGPU. Atomic.store() now lowers via pop.store atomic syncscope(...), emitting st.release.<scope> on NVPTX and a properly-scoped LLVM atomic store on AMDGPU. The Mojo API surface is unchanged.
Fixed Process.run() not inheriting the parent's environment variables. Child processes spawned via Process.run() now correctly receive the parent's environment.
Fixed \xhh and \ooo escape sequences in string literals being interpreted as raw bytes instead of Unicode code points, which produced malformed UTF-8 for values >= 0x80. The escapes now match Python str semantics (and the existing \u/\U handling): "\x85" encodes U+0085 (NEL) as two UTF-8 bytes and ord("\x85") returns 133 instead of 5. Code that relied on \xhh to emit a single raw byte for non-ASCII values must construct the bytes explicitly (for example, via a List[Byte] literal). (Issue #2842)
Fixed incorrect data layout for MI250X AMDGPU architectures. (Issue #6451)
Fixed several Apple-target issues: macOS 26 target detection no longer produces unrecognized arch strings like metal:2-metal4 when the installed Xcode cannot compile Metal 4.0 (the -metal4 suffix is now applied only when the toolchain supports it); UnsafePointer.gather(), UnsafePointer.scatter(), and strided_load() no longer silently read zero on Apple GPUs (the per-lane fallback now uses typed pointer arithmetic; NVIDIA, AMD, and CPU paths are unchanged); and rotate_left() and rotate_right() intrinsics now lower correctly to the Apple AIR backend.
Fixed several TileTensor issues: write_to() now correctly handles 1D, 3D+, nested-layout, and dynamic-shape tensors via a generic elementwise fallback (with bracket-delimited, comma-separated formatting at all ranks); incorrect alignment in __getitem__() is fixed; default alignment in load() and store() now uses the caller-specified width parameter instead of Self.element_size; and SIMD loads/stores on CPU now use alignment=1 to prevent segfaults on naturally-unaligned data (GPU still uses aligned access where the layout guarantees alignment).
Fixed several tile_layout issues: blocked_product() now zips block and tiler dimensions per mode (matching the legacy blocked_product() behavior); complement() now propagates UNKNOWN_VALUE instead of returning a static shape of 0, so downstream layout algebra falls back to runtime dimensions and bounds checks for LayoutTensor.flatten().vectorize[N]() are correct; and idx2crd() now returns correct coordinates for nested layouts.
Fixed mojo --version printing the MAX version instead of the Mojo compiler version.
Fixed comptime and/or expressions to accept any Boolable operands, matching runtime behavior. This also enables mixed-type expressions like comptime if some_Bool and some_Optional.
Fixed several codegen correctness issues affecting valid Mojo programs: an SRoA miscompile that incorrectly promoted arrays accessed via dynamic offsets through a constant GEP; a use-after-free where destructors of live owned values were inserted before, rather than after, a lit.ref.store into a ref with #lit.any.origin; silent memory corruption when calling abi("C") functions that returned structs via sret; and bogus existing function with conflicting attributes errors when calling the same external function more than once with an sret/byval ABI.
Fixed several mojo-lsp-server crashes affecting REPL/notebook contexts, parameter-pack-related diagnostics, files importing from .mojopkg, and files using stateless closures. The LSP also no longer mistakes REPL buffer identifiers (which contain a .mojo extension) for relative module imports.
Fixed several debugger display issues: variables after their ASAP destruction point at -O0 now correctly show "not available" instead of stale values; unsigned integers (UInt, UInt8, etc.) display with correct unsigned semantics; ref loop variables show index instead of pointer<index>; String fields typed as Scalar[T] and Tuple values display correctly.
Fixed two mojo format (mblack) issues: it no longer loses the t prefix when splitting long t-string literals across lines, and no longer inserts a stray space between * and a complex operand in variadic pack unpacking annotations.
Fixed BitSet.set_all() and BitSet.toggle_all() writing ~0 to every underlying 64-bit word, including bits beyond the logical size when size was not a multiple of 64. Those stray high bits were counted by __len__(), producing incorrect population counts; the methods now mask off the unused high bits.
Fixed syncwarp() on AMD GPUs, which was previously implemented as a no-op. It now lowers to llvm.amdgcn.wave.barrier, providing the control-flow synchronization required to correctly sequence shared-memory writes followed by reads across lanes.
Fixed isnan(), isinf(), and isfinite() failing during LLVM lowering for float8_e3m4 and float4_e2m1fn. float4_e2m1fn (no NaN/Inf encodings) folds to constant branches; float8_e3m4 casts through bfloat16 to reuse the existing llvm.is.fpclass path.

Special thanks

Special thanks to our community contributors:

aweifh1-29gh20 (@isaacuselman), Ben Wibking (@BenWibking), Bernhard Merkle (@bmerkle), Brian Grenier (@bgreni), Byungchul Chae (@byungchul-sqzb), c-pozzi (@c-pozzi), Christoph Schlumpf (@christoph-schlumpf), Evan Owen (@ulmentflam), Frost Ming (@frostming), jglee-sqbits (@jglee-sqbits), Manuel Saelices (@msaelices), martinvuyk (@martinvuyk), minkyu (@kkimmk), Parsa Bahraminejad (@prsabahrami), pei0033 (@pei0033), Pierre Gordon (@pierlon), soraros (@soraros), Turcik (@alexturcea)

Highlights​

Documentation​

Language enhancements​

Language changes​

Library changes​

Type system and traits​

Atomic operations​

Pointer and memory​

Collections and iterators​

String and text​

Diagnostics and debug​

TileTensor and Layout​

GPU programming​

Other library changes​

Tooling changes​

Removed​

Fixed​

Special thanks​