Mojo v1.0.0b1
Highlights
-
fnis deprecated; usedef. Mojo now emits a compiler warning on uses offn; this will become a compilation error in the next release. This completes the def/fn unification begun in v0.26.2:defis now Mojo's standard function declaration keyword, with the same non-raising semanticsfnhad. See Language changes. -
Unified closures. This release continues the closure unification work begun in earlier releases: stateless closures auto-lift to top-level functions (and can be passed as FFI callbacks), the
refcapture convention is supported, default capture conventions can be combined with explicit capture lists, and a newthinfunction effect declares a plain function pointer type that doesn't carry captured state. See Language enhancements. -
UnsafePointeris non-null by design. The default null constructor and__bool__()method are deprecated, andUnsafePointerno longer conforms toDefaultableorBoolable. Express nullability withOptional[UnsafePointer[...]], which sharesUnsafePointer's layout (the null address is theNoneniche) so nullable pointers remain zero-overhead and FFI-safe. See Pointer and memory. -
Bounds-checked collections by default. Negative indexing has been removed from all standard library collections —
x[-1]is now a compile-time error; usex[len(x) - 1]instead — and bounds checking is now on by default for all collections on CPU. Out-of-bounds accesses report the user's call site. Bounds checking remains off by default on GPU for performance; usemojo build -D ASSERT=allto enable. See Collections and iterators. -
NDBufferremoved.NDBufferhas been fully removed from the standard library. Migrate toTileTensor. See Collections and iterators. -
Expanded GPU hardware support. Apple Metal becomes a much more capable Mojo target —
print()works, dynamic threadgroup memory (external_memory[]()) is supported, Apple M5 MMA intrinsics enable hardware matrix multiply-accumulate, and Apple GPU targets prefermetal4features by default. Added support for AMD MI250X and NVIDIA B300 (sm_103a) accelerators. See GPU programming. -
GPU primitive id accessors migrated
UInt→Int.thread_idx,block_idx,block_dim,grid_dim,global_idx,lane_id,warp_id, and the cluster accessors now returnIntas part of a broader migration to standardize onIntfor sizes and offsets. Temporary*_uintaliases provide a migration path; they will eventually be deprecated and removed. See GPU programming. -
CPU
DeviceContextexpansion.DeviceContext(api="cpu")is now a stream-ordered execution context for CPU work, paving the way for NUMA-aware CPU dispatch. Newenqueue_cpu_function()andenqueue_cpu_range()enqueue host functions and parallel ranges with stream ordering relative to surrounding work. See GPU programming. -
Grapheme cluster support in
StringandStringSlice. Added UAX #29 grapheme cluster segmentation withgraphemes(),count_graphemes(), the[grapheme=...]slicing syntax, and reverse iteration. Correctly handles combining marks, emoji ZWJ sequences, flag emoji, Hangul syllables, and other multi-codepoint clusters. See String and text. -
Type refinement. The compiler now narrows types from
whereclauses,comptime ifstatements, andcomptime assertstatements, driven byconforms_to()expressions. This makestrait_downcastunnecessary in the common case — Mojo recognizes when a type satisfies a trait inside a refined scope and lets you call its trait methods directly. See Language enhancements. -
Unified reflection API. A new
reflect[T]()entry point instd.reflectionreturns aReflected[T]handle, replacing the family ofstruct_field_*free functions andget_type_name/get_base_type_name.reflectis auto-imported via the prelude. The legacy free functions and theReflectedType[T]wrapper are now@deprecated. See Other library changes.
Documentation
-
Added a new Mojo language reference covering lexical elements, expressions, statements, numeric types, structs, and traits. The reference includes new pages on Functions, the
@doc_hiddendecorator, and inline MLIR, with negative examples that highlight common errors. -
Added a manual section on
TileTensorand TileTensor layouts. -
Separated the Mojo layout library docs from the MAX kernels library, reflecting that the layout library ships with
mojoand the rest of the kernels library ships withmax. -
Added a new Compilation targets doc covering how to inspect your platform, select a target configuration, and cross-compile for other CPUs, operating systems, and accelerators.
-
Added a new Packaging guide for building Mojo packages, currently covering the
rattler-buildworkflow. -
Restructured the Mojo and MAX system requirements docs into a two-level "Continuously tested" / "Known compatible" taxonomy, with a dedicated Mojo GPU compatibility page and per-vendor hardware tables. Added a new Troubleshooting GPU detection section.
-
Split the operators page into separate manual pages, refreshed coverage, and added a new tutorial and reference page.
Language enhancements
-
Added type refinement based on compile-time assumptions, enabling Mojo to narrow types from
whereclauses,comptime ifstatements, andcomptime assertstatements. Refinements in a scope are driven byconforms_to()expressions.Before:
def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):for item in self:if trait_downcast[Equatable](item) == trait_downcast[Equatable](value):return Truereturn FalseAfter:
def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):for item in self:if item == value:return Truereturn False -
Unified closure improvements. This release continues the closure unification work begun in earlier releases: stateless closures auto-lift, the
refcapture convention is supported, default capture conventions can be combined with explicit capture lists, and a newthinfunction effect declares a plain function pointer type that doesn't carry captured state.def main() raises:var a, b, c, d = 1, 2, 3, 4var x = "hello"# Legacy closure: no capture list. Cannot capture variables.def hello():print("hi")# Unified closure with no captures (stateless). Stateless closures# lift to top-level functions and can be passed as FFI callbacks.def add_one(n: Int) {} -> Int:return n + 1# Unified closure with explicit captures and a default capturing# convention:def my_fn() {mut a, b, c^, read}:# capture:# `a` by mut reference# `b` by immut reference# `c` by moving# `d` by immut reference (the default `read` convention)use(a, b, c, d)# Unified closure that captures `x` by ref (carries an# origin-mutability parameter):def show_x() {ref x}:print(x)# Function effects come before the capture list. The calling context# must handle errors raised from a `raises` closure.def fallible() raises {}:raise Error("nope")# Closures are invoked like ordinary functions:hello()print(add_one(41))my_fn()show_x()try:fallible()except e:print(e)# The `thin` function effect declares a plain function pointer# type that doesn't carry captured state. Stateless closures and# top-level functions are compatible with `thin` function pointers:var fn_ptr: def(Int) thin -> Int = add_oneprint(fn_ptr(99)) -
Added
abi("C")as a function effect for declaring the C calling convention on function definitions and function pointer types. Functions marked withabi("C")use the platform C ABI (System V x86-64 / ARM64 AAPCS) for struct arguments and return values, enabling safe interop with C libraries.DLHandle.get_function()now enforces that the type parameter carriesabi("C"), preventing silent ABI mismatches when loading C symbols.# C-ABI function definition (safe as a callback into C code)def add(a: Int32, b: Int32) abi("C") -> Int32:return a + b# C-ABI function pointer type (safe for use with DLHandle.get_function)var f = handle.get_function[def(Float64) abi("C") -> Float64]("sqrt") -
Added support for conditional
RegisterPassableconformance. -
The ternary
if/elseexpression now coerces each element to its contextual type when obvious. For example, this works instead of producing an incompatible-metatypes error:comptime some_type: Movable = Int if cond else String -
Variadic lists and packs can be forwarded through runtime calls with
*packwhen the callee takes a compatible variadic list or pack.def callee[*Ts: Writable](*args: *Ts):comptime for i in range(args.__len__()):print(args[i])def forwarder[*Ts: Writable](*args: *Ts):callee(*args)forwarder(1, "hello", 3.14) # prints each value on a separate line -
Heterogeneous variadic packs can now be specified with a
SomeTypeListhelper. These two are equivalent:def foo[*arg_types: Copyable](*args: *arg_types) -> Int: ...def foo(*args: *SomeTypeList[Copyable]) -> Int: ... -
String literals now support
\uXXXXand\UXXXXXXXXUnicode escape sequences, matching Python. The resulting code point is stored as UTF-8. Invalid code points and surrogates are rejected at parse time. -
T-strings can now be used in
comptime assertmessages:def foo[i: Int]():comptime assert i > 5, t"expected i > 5, got {i}"
Language changes
-
The
fnkeyword for function declarations is deprecated. Mojo now emits a compiler warning on uses offn; this will become a compilation error in the next release. Usedefinstead. -
The
unifiedkeyword has been removed; specify unified-closure semantics with an explicit capture list{...}after the function signature. An empty capture list{}denotes unified with no captures; closures without any capture list are legacy. Mojo also now warns when a function pointer type omits thethineffect; specifythinexplicitly to silence the warning. -
Import statements of the form
from pkg import ...no longer makepkgavailable to the module. -
Removed support for comparing tuples of differing lengths or types. Such comparisons (for example
(1, 2) != (4, 5, 6)) are now rejected statically by the type system instead of silently returning not-equal. -
Variadic parameter lists are now
ParameterListandTypeListinstead of!kgen.param_list, so they can be used like ordinary types:def callee[*values: Int]():var v = 0for i in range(len(values)):v += values[i]for elt in values:v += elt -
Each Mojo function now has its own unique function-literal type. Two separately-defined functions, even with identical signatures, are not interchangeable through their literal types; use a function pointer type (for example,
def(Int) thin -> Int) to abstract over them. -
A if comptime(C) else Bnow skips elaboration of the dead branch, treating the ternary expression as a compile-time evaluation contract analogous tocomptime if C: A else: B. -
@explicit_destroyis now rejected at parse time when paired with an unconditionalImplicitlyDestructibleconformance; it remains valid only on conditional (where-clause-constrained) conformances.
Library changes
Type system and traits
-
The
Boolable,Defaultable, andWritabletraits no longer inherit fromImplicitlyDestructible. Generic code that needs the destructor bound must now request it explicitly:T: Writable & ImplicitlyDestructible. -
Standard library types now use conditional conformances:
Span:Writable,HashableTuple,Optional,Variant, andUnsafeMaybeUninit:RegisterPassableTuple:Defaultable(when all element types areDefaultable)Variant:Copyable,ImplicitlyCopyableOptional:DevicePassable(conditional on element type)
-
ArcPointernow conditionally conforms toHashableandEquatablewhen its inner type does, with__eq__()and__hash__()delegating to the managed value (matching C++shared_ptrand RustArcsemantics). This makesArcPointerusable as aDictkey orSetelement with value-based equality; pointer identity remains available via theisoperator. -
Pathnow conforms toComparable, enabling lexicographic ordering and use withsort().
Atomic operations
-
Atomic operations have moved to a dedicated
std.atomicmodule. TheConsistencytype has been renamed toOrderingand itsMONOTONICmember toRELAXED, to align with conventions used by other languages. Update existing code as follows:# Beforefrom std.os import Atomicfrom std.os.atomic import Atomic, Consistency, fence_ = atom.load[ordering=Consistency.MONOTONIC]()# Afterfrom std.atomic import Atomic, Ordering, fence_ = atom.load[ordering=Ordering.RELAXED]() -
Swapped the ordering arguments of
Atomic.compare_exchange()sosuccess_orderingis listed beforefailure_ordering, matching the convention used by C++, Rust, and other languages. -
Orderingnow has a default constructor that selectsRELEASEon Apple GPU andSEQUENTIALon all other targets. AllAtomicmethods andfence()use this platform-aware default instead of hard-codingSEQUENTIAL.
Pointer and memory
-
UnsafePointeris now non-null by design. The default null constructor__init__(out self)and__bool__(self)method are deprecated, andUnsafePointerno longer conforms toDefaultableorBoolable. See the non-null pointer proposal for the full design.To migrate, express nullability with
Optional[UnsafePointer[...]], which shares the layout ofUnsafePointer(the null address is theNoneniche) so nullable pointers remain zero-overhead and FFI-safe.# Before: null default constructionvar ptr = UnsafePointer[Int, origin]()# After: express absence with Optionalvar ptr: Optional[UnsafePointer[Int, origin]] = None# Before: Bool-based null checkif ptr:use(ptr[])# After: check the Optional, then unwrapif ptr:use(ptr.value()[])For a non-null placeholder for a field that will be populated later (for example, a buffer allocated on demand), use
UnsafePointer.unsafe_dangling()—a well-aligned but dangling pointer. It's not a null sentinel; lazy-init types must track initialization separately. -
CStringSlicecan no longer represent a null pointer. To represent nullability useOptional[CStringSlice], which is guaranteed to have the same size and layout asconst char*(withNULLas the emptyOptional). -
OwnedDLHandle.get_symbol()now returnsOptional[UnsafePointer[...]]instead of aborting when a symbol is not found, allowing callers to handle missing symbols gracefully. -
alloc[T](count, alignment)now aborts if the underlying allocation fails. -
Added
std.memory.forget_deinit()to enable low-level code to skip running a destructor for a value. Use rarely, only when building low-level abstractions.
Collections and iterators
-
NDBufferhas been fully removed. Migrate toTileTensor. -
Negative indexing has been removed from all stdlib collections (
List,Span,InlineArray,String,StringSlice,LinkedList,Deque,IntTuple) to enable cheap CPU bounds checks by default. Using a negativeIntLiteralfor indexing now triggers a compile-time error:constraint failed: negative indexing is not supported, use e.g. `x[len(x) - 1]` insteadUpdate any
x[-1]tox[len(x) - 1]. -
Bounds checking is now on by default for all collections on CPU. Out-of-bounds accesses report the user's call site:
def main():var x = [1, 2, 3]print(x[3])At: /tmp/main.mojo:3:12: Assert Error: index 3 is out of bounds, valid range is 0 to 2Bounds checking is still off by default on GPU for performance. Use
mojo build -D ASSERT=allto enable bounds checking on GPU; use-D ASSERT=noneto disable all asserts including CPU bounds checking. -
range()overloads that took differently-typed arguments or arguments that wereIntable/IntableRaisingbut notIndexerhave been removed. Callers should pass consistent integral argument types. -
Added
IterableOwnedtrait to the iteration module. Types conforming toIterableOwnedimplement__iter__(var self), which consumes the collection and returns an iterator that owns the underlying elements.List,Optional,Deque,LinkedList,Dict,Set,Counter, andInlineArraynow conform;Spanconforms conditionally onT: Copyable, with the owned iterator yielding copies by value.Iterator adaptors (
enumerate(),zip(),map(),peekable(),take_while(),drop_while(),product(),cycle(),count(),repeat()) now conform toIterableOwned. Added owned overloads ofenumerate(),zip(),map(),peekable(),take_while(),drop_while(),product(), andcycle()that consume the input iterable. -
Added
map()andand_then()methods toOptional.map()applies a function to the contained value (returningOptional[To]);and_then()flat-maps over operations that themselves return anOptional.var o = Optional[Int](42)def closure(n: Int) {} -> String:return String(n + 1)var mapped: Optional[String] = o.map[To=String](closure)print(mapped) # Optional("43") -
Added
Optional.destroy_with(destroy_func), which destroys anOptional[T]in-place using a caller-provided destructor. This enablesOptionalto hold element types that are notImplicitlyDestructible(for example, types marked@explicit_destroy), mirroringVariant.destroy_with(). Bothdestroy_with()methods now accept closures that capture local state in addition to plain function references.Variant.destroy_with()callers must now pass the destroyed type explicitly (for example,v^.destroy_with[Int](destroy_func)) sinceTcan no longer be inferred from the closure type. -
Added a generic
__contains__()method toSpanfor any element type conforming toEquatable, not justScalartypes. -
assert_raises()now catches customWritableerror types, not justError.
String and text
-
String.__len__()is deprecated. UseString.byte_length()orString.count_codepoints()instead. -
Grapheme cluster support in
StringandStringSlice. Added UAX #29 grapheme cluster segmentation, correctly handling combining marks, emoji ZWJ sequences, flag emoji, Hangul syllables, and other multi-codepoint clusters.graphemes()returns aGraphemeSliceIteryielding each user-perceived "character" as aStringSlice;count_graphemes()returns the grapheme cluster count.StringSlicesupports slicing by grapheme cluster via thegrapheme=keyword argument, mirroring the existingbyte=indexer (for example,s[grapheme=0:3]). Because grapheme boundaries are discovered by a forward scan, this is O(n) in byte length—preferbyte=when byte offsets are known.- Grapheme-aware algorithms
grapheme_indices(),nth_grapheme(n), andsplit_at_grapheme(n)mirror Rust'sstr::grapheme_indicesand friends, useful for editors and UIs mapping cursor byte positions to grapheme boundaries. GraphemeSliceItersupports reverse iteration vianext_back(),peek_back(), and thegraphemes_reversed()constructors onString/StringSlice. Reverse iteration costs more per cluster than forward iteration because the UAX #29 state machine is forward-scanning.GraphemeSliceIter.remaining_byte_length()reports the iterator's remaining byte range in O(1).count_graphemes()takes a fast path over printable-ASCII runs: ~10x faster on pure-ASCII text, ~5–6x faster on ASCII-dominant mixed text. Pure non-ASCII text (Arabic, Russian, Chinese) is unchanged.
Diagnostics and debug
-
abort(message)now includes the call site location in its output. You can also pass an explicitSourceLocationto override it:abort("something went wrong")# prints: ABORT: path/to/file.mojo:42:5: something went wrongvar loc = current_location()abort("something went wrong", location=loc) -
abort(message)now prints its message on NVIDIA and AMD GPUs, including block and thread IDs. Previously, the message was silently suppressed on these GPUs. On Apple GPUs, the message is silently suppressed for now. -
New diagnostics report the user's call site rather than stdlib source:
check_bounds()for collections asserts on out-of-range indices, anddebug_assert()now accepts acall_locationparameter for callers to override the reportedSourceLocation. -
SourceLocationfields are now private; use theline(),column(), andfile_name()accessor methods instead. -
Added uninitialized memory read detection for float loads. When compiled with
-D MOJO_STDLIB_SIMD_UNINIT_CHECK=true, every float load is checked against the debug allocator's poison pattern (the largest finite value of the float type, for exampleFLT_MAXforFloat32); a match triggersabort(). The poison is non-NaN so it coexists withnan-checkin kernels that intentionally write only active positions. Zero runtime overhead when disabled (the default). -
InlineArray's storage constructor now usesdebug_assert[assert_mode="safe"]for the element-count check, so size mismatches are caught by default instead of only with-D ASSERT=all.
TileTensor and Layout
-
TileTensorAPI extensions:- Added
TileTensor.bitcast[target_dtype](), which returns a newTileTensorviewing the same storage and layout under a different element dtype, replacing theTileTensor(x.ptr.bitcast[Scalar[T]](), x.layout)idiom. - Added
TileTensor.flat_load()andTileTensor.flat_store()as raw-flat accessors that read and write the underlying storage at a linear offset, bypassing the tensor's layout. - Added a
TileTensor.tile()overload that takes the tile shape as a runtime or compile-time parameter argument, complementing the existing tile APIs. - GPU
TileTensor.load()andload_linear()now defaultinvariant=Truefor immutable tensors, enabling the compiler to useldgfor read-only memory accesses. - Added compile-time bounds checks to
TileTensor,ManagedTensorSlice, andcrd2idx()to catch out-of-range coordinate accesses at compile time.
- Added
-
Layout library extensions:
- Added a compile-time
coalesce()function forTensorLayout, mirroring the legacyLayout.coalesce()algorithm (skip shape-1 dims and merge contiguous dims). - Added
write_repr_to()toLayoutfor writing a debug representation to aWriter. vectorize()anddistribute()now accept layouts with runtime dimensions.row_major()now accepts coord-like arguments directly, no longer requiring them to be wrapped in tuples.- Introduced weakly compatible layouts, enabling structural
compatibility comparisons between layouts and coordinate indices (up
to depth 4). Structural equality is now checked via a
comptime assertrather than awhereclause. - Changed
CoordLike.value()to returnScalar[Self.DTYPE]instead ofInt, providing a more expressive return type for layout coordinate values. Coord,RowMajorLayout, andColMajorLayoutnow take their parameters as variadic arguments, improving ergonomics when specifying individual coords. Use*splatto pass an existing list.
- Added a compile-time
GPU programming
-
Added support for AMD MI250X accelerators.
-
Expanded Apple silicon GPU support. Apple Metal GPU is now a more capable Mojo target.
print()and_printf()now work on Apple Metal GPU. Output is chunked through the Metalos_logpath, with a Float32-only formatter that matches Metal's hardware constraints._printf()currently emits the format string only (not interpolated arguments);|x| < 1e-7is truncated to0.0.external_memory[]()(dynamic threadgroup memory) is now supported on Apple silicon, so existing GPU kernels usingexternal_memory[]()work unchanged.- Apple M5 MMA intrinsics (
apple_mma_load(),apple_mma_store(),_mma_apple()) instd.gpu.compute.arch.mma_appleenable hardware matrix multiply-accumulate on Apple GPUs. - Added
CompilationTarget.is_apple_m5()tostd.sysfor detecting Apple M5 targets at compile time;is_apple_silicon()now includes M5 in its check. - Apple GPU targets now prefer
metal4features by default when the toolchain supports them, automatically appending-metal4to the arch instead of requiring explicitm5-metal4selection. - Atomic ordering:
releaseordering is not supported on Metal. Apple GPU targets now usemonotonic(relaxed) atomic ordering by default. - Floating-point widths: the compiler now rejects floating-point
types wider than 32 bits (
Float64/Float80/Float128) for Apple GPU targets, since Metal supports onlyFloat16andFloat32.
-
GPU device APIs:
- Added support for NVIDIA B300 (sm_103a) accelerators. New helpers in
std.sys.infoandstd.gpu.host.inforecognize B300 targets for correct kernel dispatch on Blackwell B300. - Added
DeviceStream.enqueue_host_func(func, user_data)exposing thecuLaunchHostFuncprimitive for Mojo kernels and custom ops. Takes athin def(OpaquePointer[MutAnyOrigin]) -> Nonecallback and an opaqueuser_datapointer. CUDA-only today; non-CUDA backends raise. DeviceContextinitialization now runs an automatic GPU health check that detects hardware throttling, uncorrectable ECC errors, and zombie VRAM, and fails device creation with an actionable error message on unhealthy GPUs. AddedDeviceContext.run_healthcheck()to re-invoke the check explicitly. SetMODULAR_DEVICE_CONTEXT_DISABLE_HEALTHCHECK=trueto disable.- Optimized GPU
elementwise()index computation and dispatch with ause_32bitfast path, 4x unrolled grid-stride processing, warp-aligned block sizes, and SM100+ single-tile routing.
- Added support for NVIDIA B300 (sm_103a) accelerators. New helpers in
-
GPU primitive id accessors (
thread_idx,block_idx,block_dim,grid_dim,global_idx,lane_id,warp_id,cluster_dim,cluster_idx, andblock_id_in_cluster) have migrated fromUInttoInt.This is part of a broader migration to standardize on the
Inttype for all sizes and offsets in Mojo. As a related step in the same migration,TensorCore.load_a()andTensorCore.load_b()now also takeIntarguments instead ofUInt.To provide a gradual migration path,
*_uintaliases of the seven non-cluster accessors are temporarily available:Accessor Legacy UIntaliasthread_idxthread_idx_uintblock_idxblock_idx_uintblock_dimblock_dim_uintgrid_dimgrid_dim_uintglobal_idxglobal_idx_uintlane_idlane_id_uintwarp_idwarp_id_uintThe three cluster accessors (
cluster_dim,cluster_idx,block_id_in_cluster) migrated directly without*_uintaliases, since their usage was limited.Code can preserve its prior
UIntbehavior by using a renaming import of the*_uintalias:- from std.gpu import thread_idx+ from std.gpu import thread_idx_uint as thread_idxThe temporary
*_intaccessors that briefly existed during the phased migration as a forward-compatibility aid have been removed; use the unprefixed accessors (which now returnIntby default). The*_uintaliases will eventually be deprecated and removed. -
CPU
DeviceContextexpansion.DeviceContext(api="cpu")is now usable as a stream-ordered execution context for CPU work, paving the way for NUMA-aware CPU dispatch.- Added
DeviceContext.enqueue_cpu_function()andDeviceContext.enqueue_cpu_range()for stream-ordered execution of host functions on CPUDeviceContextinstances.enqueue_cpu_function()enqueues a single host function;enqueue_cpu_range()enqueues a parallel range whose tasks run concurrently but are stream-ordered relative to surrounding work. Argument passing is not yet supported. parallelize(),parallelize_over_rows()(instd.algorithm.backend.cpu.parallelize), and theelementwise()overloads instd.algorithm.functionalnow accept an optional trailingctx: Optional[DeviceContext] = None. When supplied, the context is forwarded tosync_parallelize(); otherwise behavior is unchanged.- Added a
parallelism_level()overload that takes a CPUDeviceContextand returns the thread-pool size for that specific context, enabling NUMA-specific introspection.
- Added
-
AMD GPU intrinsics:
- Added the
ds_read_tr8_b64()AMD GPU intrinsic instd.gpu.intrinsics, performing a 64-bit LDS transpose load of 8-bit elements viallvm.amdgcn.ds.read.tr8.b64. Supported on AMD CDNA4+ GPUs. - Added a
Scalar[dtype]overload ofreadfirstlane()so callers no longer need bitcast workarounds to broadcast non-Int32scalar values across an AMD GPU wavefront. AMDBufferResource.load_to_lds()instd.gpu.intrinsicsnow lowers to the.ptr.form of the AMDGPU buffer-load-to-LDS intrinsic, fixing a strided-layout regression. A newasync_copies: Bool = Falseparameter opts in to attaching theamdgpu.AsyncCopiesalias scope on the load, enabling LLVMvmcntrelaxation.- Added a
broadcast=Trueparameter to GPUwarp_id()(and related id accessors) so callers can avoid manualwarp.broadcast(warp_id())patterns.
- Added the
-
tile_iomodule forTileTensordata movement. Added atile_iomodule providingTileTensorcopier traits and copy utilities for moving data between memory hierarchies (DRAM/SRAM). The module includes:GenericToSharedAsyncTileCopier, which moves aTileTensorfrom generic memory into shared memory via NVIDIA'scp.async. On AMD and Apple GPUs the underlyingasync_copy()falls back to synchronous loads/stores.- An optional
swizzle:Swizzleparameter onGenericToSharedAsyncTileCopier, mirroring the swizzled write path inLocalToSharedTileCopier. - A
masked: Bool = Falseparameter onGenericToSharedAsyncTileCopier. When enabled, out-of-bounds vectors receive a zero-byte copy with zero-fill, matchingLayoutTensor.copy_from_async[is_masked=True, fill=Fill.ZERO]. - An
AsyncTileCopiertrait abstracting copier conformance.
-
TMA
gather4for sparse 2D tensor loads. Added a TMAgather4operation on SM100 (Blackwell) for loading 4 non-contiguous rows from a 2D tensor in a single TMA instruction, surfaced as thecp_async_bulk_tensor_2d_gather4()intrinsic instd.gpu.memoryand integrated withTMATensorTile. The API supports:- Full 2D tile sparse loads with arbitrary
tile_height(multiple of 4) andtile_width, replacing the prior 4-row-per-call limit. - Arbitrary
row_width—previously restricted to the swizzle box width. The API automatically computes the box width from the swizzle constraint and supports non-divisible widths via TMA hardware zero-fill on the last column group, so kernels no longer need to hand-code column-group loops.
- Full 2D tile sparse loads with arbitrary
-
1D TMA instructions for SM90+ NVIDIA GPUs. Added 1D TMA (Tensor Memory Accelerator) instruction support in
std.gpu.memory. 1D TMA copies do not require a pre-allocated tensormap object on the host, providing greater flexibility than the existing 2D–5D TMA path. New functions:cp_async_bulk_shared_cluster_global(),cp_async_bulk_global_shared_cta(),cp_async_bulk_prefetch(), andcp_async_bulk_reduce_global_shared_cta(). -
Readable GPU kernel names in profilers. GPU kernels in the standard library and across MAX kernels (elementwise, GEMV, multistage matmul, attention, convolution, MoE, normalization, quantization, BMM, grouped matmul, SM100 matmul, AMD matmul, communication, and sampling) now expose human-readable names in profiler traces such as Nsight Systems, replacing previously mangled KGEN symbols.
-
Added
Span-based overloads forenqueue_copy(),enqueue_copy_from(), andenqueue_copy_to()onDeviceContext,DeviceBuffer, andHostBuffer, providing a safer alternative to rawUnsafePointerfor host-device memory transfers.
Other library changes
-
Removed
trait_downcast()andtrait_downcast_var()from across the standard library, replaced by type refinement (see Language enhancements). Public APIs are unchanged. -
external_call()'sreturn_typerequirement has been relaxed fromTrivialRegisterPassabletoRegisterPassable. -
Several standard library APIs gained unified-closure overloads:
parallelize()andparallelize_over_rows()(instd.algorithm.backend.cpu.parallelize),bench.bencher(),DeviceContext.execution_time(), andDeviceContext.enqueue_function()(the GPU enqueue path, renamed from the previousenqueue_closure()). -
Consolidated the reflection APIs in
std.reflectionbehind a unified entry pointreflect[T]()returning aReflected[T]handle.reflect()is auto-imported via the prelude. Methods on the handle replace the family ofstruct_field_*free functions (dropping thestruct_prefix—only structs have fields) and theget_type_name()/get_base_type_name()free functions:struct Point:var x: Intvar y: Float64def main():comptime r = reflect[Point]()print(r.name()) # "Point"print(r.field_count()) # 2print(r.field_names()[0]) # xcomptime y_type = r.field_type["y"]() # Reflected[Float64]print(y_type.name()) # "SIMD[DType.float64, 1]"print(reflect[List[Int]]().base_name()) # "List"var v: y_type.T = 3.14The legacy free functions and the
ReflectedType[T]wrapper are now@deprecated; they will be removed in a future release. -
align_down()andalign_up()now accept genericSIMD[dtype, width]integer values, replacing the priorUInt-only overloads. -
Extended
FastDivandmulhi()to support 64-bit integer types. -
Added
Variadic.contains_valuecomptime alias to check whether a variadic sequence contains a specific value at compile time.
Tooling changes
-
Removed the legacy
MOJO_ENABLE_STACK_TRACE_ON_ERRORandMOJO_ENABLE_STACK_TRACE_ON_CRASHenvironment variables. Instead, set theMODULAR_DEBUGenvironment variable tostack_trace_on_errorto enable generation of stack traces when a Mojo program raises an error. -
Debugger UX:
- The Mojo debugger now shows a
Variantvariable's active type name and value in LLDB (for example,Int(42)orString("hello")) instead of exposing raw_DefaultVariantStorageinternals. - The Mojo debugger now displays
Optional[T]variables asNoneorSome(value)in LLDB instead of exposing raw_DefaultVariantStorageinternals. - The Mojo debugger now displays scalar types (for example,
UInt8,Float32) as plain values instead of([0] = value), and elides internal_mlir_valuewrapper fields from struct display. - The Mojo debugger now correctly displays
UnsafePointer[T]values in LLDB for all pointed-to types, including signed integers (no longer rendered as huge unsigned values),Bool(True/False), and floats. - The Mojo debugger now displays
StringSlice,StaticString, and their underlyingSpan[Byte]values as quoted strings in LLDB. - At
-O0, trivially destructible types (Int,Float,Bool,SIMD, etc.) now remain visible in the debugger through the end of their lexical scope instead of disappearing at the ASAP destruction point.
- The Mojo debugger now shows a
-
LSP and REPL responsiveness:
- Code completion and signature help in REPL/notebook contexts are now amortized O(1) per request by caching parsed prior cells across requests, eliminating quadratic O(N²) slowdown in long sessions.
- LSP parse time is reduced by deferring body resolution of imported bytecode declarations and resolving named imports lazily, avoiding eager pulls of large transitive dependencies. Files with docstring code blocks parse roughly 2x faster.
- Added a
--mojo-versionflag tomojo-lsp-serverfor verifying the Mojo version that the LSP is using.
-
mojoCLI and toolchain:mojo --versionnow prints a semantic Mojo version (for example,1.0.0...) instead of an internal build identifier, and the same version is used wherever the compiler performs version checks.mojo build --print-supported-targetsnow lists registered targets sorted alphabetically, with a graceful empty-list message.- The compiler now selects the target's baseline CPU when
cross-compiling with
--target-triplewithout--target-cpuand the host and target architectures differ. - ASAN-instrumented Mojo binaries on macOS now use
llvm-symbolizerinstead ofatos, so stack traces report the full inlined call chain through user functions. - Mojo package files (
.mojopkg) now use format version 2 with zstd-compressed MLIR bytecode, significantly reducing package, wheel, and Docker image sizes.
-
mojo docand docstring validation:mojo docnow preserves parameterized type names (for example,List[K],Optional[V],UnsafePointer[Scalar[dtype]]) in the API doc JSON"type"fields, instead of emitting only the bare base name.mojo docnow emits a diagnostic when a public Mojo module has no module-level docstring and-mojo-diagnose-missing-doc-stringsis active. Private modules and modules nested inside private packages are exempt.- Docstring validation no longer requires inferred parameters (those
before
//in a parameter list) to be documented; documenting them remains valid. - Docstring validation now accepts
!and?as valid sentence-ending punctuation throughout. def ... raisesfunctions now require aRaises:docstring section like any other raising function. TheisDeffield has been removed frommojo docJSON output.
-
mojo format(mblack):- No longer supports the deprecated
fnkeyword or the removedownedargument convention. - Now correctly parses the new unified-closure syntax including
raises {captures}effect ordering, and no longer inserts a spurious space between^and the operand invar^captures.
- No longer supports the deprecated
-
Comptime function calls now print more nicely in error messages and generated documentation, omitting
VariadicList/VariadicPackand including keyword argument labels when required.
Removed
-
The
escapingfunction effect is no longer supported. Migratedef(...) escaping -> Tclosures to use an explicit capture list{...}(see Language enhancements). -
Several constructs deprecated in 26.2 are no longer accepted:
- The
@register_passableand@register_passable("trivial")decorators are no longer supported. Conform to theRegisterPassableandTrivialRegisterPassabletraits instead. Use of either decorator now produces a hard error pointing to the trait equivalent. - The legacy
__moveinit__()and__copyinit__()method names are no longer auto-rewritten to the unified__init__()form. Rename these methods to__init__()with keyword-onlytake: Selfandcopy: Selfarguments, respectively, as introduced by init unification in 26.2. Existing legacy spellings now fail to compile with errors such asno matching function in initializationrather than being silently rewritten.
- The
-
The deprecated
@doc_privatedecorator has been removed. Use@doc_hiddeninstead. -
Removed the
store_release(),store_relaxed(),load_acquire(), andload_relaxed()helpers fromstd.gpu.intrinsics. UseAtomic[dtype, scope=...].store()andAtomic[dtype, scope=...].load()with the desiredOrderinginstead:# Beforefrom std.gpu.intrinsics import store_release, load_acquirestore_release[scope=Scope.GPU](ptr, value)var v = load_acquire[scope=Scope.GPU](ptr)# Afterfrom std.atomic import Atomic, OrderingAtomic[dtype, scope="device"].store[ordering=Ordering.RELEASE](ptr, value)var v = Atomic[dtype, scope="device"].load[ordering=Ordering.ACQUIRE](ptr) -
API removals:
- Removed the
param_env.mojomodule. Usedefines.mojoinstead. - Removed
LinkedList.__getitem__(). Indexing aLinkedListis O(n), and exposing__getitem__()encouraged accidentally quadratic code; iterate the list instead. - Removed the unused
UIntSizedtrait and its prelude re-export. - Removed the
pdl_levelparameter fromelementwise(),reduction(), andreducescatter()kernel APIs. PDL usage is now an internal compile-time default.
- Removed the
Fixed
-
Fixed
math.sqrt()onFloat64on NVIDIA GPU producing a crypticcould not find LLVM intrinsic: "llvm.nvvm.sqrt.approx.d"failure at LLVM IR translation time.math.sqrt()now rejectsFloat64on NVIDIA GPU at compile time with the messageDType.float64 isn't supported for approx sqrt on NVIDIA GPU. Themath.sin()andmath.cos()constraint messages were similarly sharpened to name the op. (Issue #6434) -
Fixed pack inference failing with
could not infer type of parameter pack ... given value with unresolved typewhen passing list, dict, set, or slice literals to a*Ts-bound variadic pack parameter (for example,def foo[*Ts: Iterable](*args: *Ts)). Pack inference now applies the same default-type fallback that single-argument trait-bound parameters already use, sofoo([1, 2, 3], [4, 5, 6])resolves each literal to its default type (for example,List[Int]) before binding the pack. -
Fixed
mojoaborting at startup withstd::filesystem::filesystem_errorwhen$HOMEis not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412) -
mojo runandmojo debugnow honor-Xlinkerflags by loading the referenced shared libraries into the in-process JIT. Previously the flags were dropped (with a-Xlinker argument unusedwarning), leaving programs that called into external shared libraries viaexternal_call()unable to resolve those symbols at runtime. Supported-Xlinkerforms mirror what the system linker accepts; flags with no JIT meaning are reported as a warning and ignored. (Issue #6155) -
Fixed
libpythonauto-discovery failing for Python 3.14 free-threaded builds. The discovery script constructed the library filename without the ABI flags suffix (for example, looked forlibpython3.14.dylibinstead oflibpython3.14t.dylib). (Issue #6366) -
Fixed
RTLD.LOCALhaving the wrong value on Linux. It was set to4(RTLD_NOLOAD) instead of0, causingdlopen()withRTLD.NOW | RTLD.LOCALto fail. (Issue #6410) -
Fixed
mojo formatcrashing after upgrading Mojo versions due to a stale grammar cache. (Issue #6144) -
Fixed
atof()producing incorrect results for floats near the normal/subnormal boundary (for example,Float64("4.4501363245856945e-308")returned half the correct value). (Issue #6196) -
Fixed a compiler crash ("'get_type_name' requires a concrete type") when using default
Writable,Equatable, orHashableimplementations on structs with MLIR-type fields (for example,__mlir_type.index). The compiler now correctly reports that the field does not implement the required trait. (Issue #5872) -
Fixed
Atomic.store()silently dropping the requestedscope. The previous implementation lowered toatomicrmw xchgwithout forwardingsyncscope, soAtomic[..., scope="device"].store(...)was emitting a system-scope store on NVPTX (extra L2/NVLink fences) and an over-synchronized store on AMDGPU.Atomic.store()now lowers viapop.store atomic syncscope(...), emittingst.release.<scope>on NVPTX and a properly-scoped LLVM atomic store on AMDGPU. The Mojo API surface is unchanged. -
Fixed
Process.run()not inheriting the parent's environment variables. Child processes spawned viaProcess.run()now correctly receive the parent's environment. -
Fixed
\xhhand\oooescape sequences in string literals being interpreted as raw bytes instead of Unicode code points, which produced malformed UTF-8 for values>= 0x80. The escapes now match Pythonstrsemantics (and the existing\u/\Uhandling):"\x85"encodes U+0085 (NEL) as two UTF-8 bytes andord("\x85")returns133instead of5. Code that relied on\xhhto emit a single raw byte for non-ASCII values must construct the bytes explicitly (for example, via aList[Byte]literal). (Issue #2842) -
Fixed incorrect data layout for
MI250XAMDGPU architectures. (Issue #6451) -
Fixed several Apple-target issues: macOS 26 target detection no longer produces unrecognized arch strings like
metal:2-metal4when the installed Xcode cannot compile Metal 4.0 (the-metal4suffix is now applied only when the toolchain supports it);UnsafePointer.gather(),UnsafePointer.scatter(), andstrided_load()no longer silently read zero on Apple GPUs (the per-lane fallback now uses typed pointer arithmetic; NVIDIA, AMD, and CPU paths are unchanged); androtate_left()androtate_right()intrinsics now lower correctly to the Apple AIR backend. -
Fixed several
TileTensorissues:write_to()now correctly handles 1D, 3D+, nested-layout, and dynamic-shape tensors via a generic elementwise fallback (with bracket-delimited, comma-separated formatting at all ranks); incorrect alignment in__getitem__()is fixed; default alignment inload()andstore()now uses the caller-specifiedwidthparameter instead ofSelf.element_size; and SIMD loads/stores on CPU now usealignment=1to prevent segfaults on naturally-unaligned data (GPU still uses aligned access where the layout guarantees alignment). -
Fixed several
tile_layoutissues:blocked_product()now zips block and tiler dimensions per mode (matching the legacyblocked_product()behavior);complement()now propagatesUNKNOWN_VALUEinstead of returning a static shape of0, so downstream layout algebra falls back to runtime dimensions and bounds checks forLayoutTensor.flatten().vectorize[N]()are correct; andidx2crd()now returns correct coordinates for nested layouts. -
Fixed
mojo --versionprinting the MAX version instead of the Mojo compiler version. -
Fixed
comptimeand/orexpressions to accept anyBoolableoperands, matching runtime behavior. This also enables mixed-type expressions likecomptime if some_Bool and some_Optional. -
Fixed several codegen correctness issues affecting valid Mojo programs: an SRoA miscompile that incorrectly promoted arrays accessed via dynamic offsets through a constant GEP; a use-after-free where destructors of live owned values were inserted before, rather than after, a
lit.ref.storeinto a ref with#lit.any.origin; silent memory corruption when callingabi("C")functions that returned structs viasret; and bogusexisting function with conflicting attributeserrors when calling the same external function more than once with ansret/byvalABI. -
Fixed several
mojo-lsp-servercrashes affecting REPL/notebook contexts, parameter-pack-related diagnostics, files importing from.mojopkg, and files using stateless closures. The LSP also no longer mistakes REPL buffer identifiers (which contain a.mojoextension) for relative module imports. -
Fixed several debugger display issues: variables after their ASAP destruction point at
-O0now correctly show "not available" instead of stale values; unsigned integers (UInt,UInt8, etc.) display with correct unsigned semantics;refloop variables showindexinstead ofpointer<index>;Stringfields typed asScalar[T]andTuplevalues display correctly. -
Fixed two
mojo format(mblack) issues: it no longer loses thetprefix when splitting long t-string literals across lines, and no longer inserts a stray space between*and a complex operand in variadic pack unpacking annotations. -
Fixed
BitSet.set_all()andBitSet.toggle_all()writing~0to every underlying 64-bit word, including bits beyond the logicalsizewhensizewas not a multiple of 64. Those stray high bits were counted by__len__(), producing incorrect population counts; the methods now mask off the unused high bits. -
Fixed
syncwarp()on AMD GPUs, which was previously implemented as a no-op. It now lowers tollvm.amdgcn.wave.barrier, providing the control-flow synchronization required to correctly sequence shared-memory writes followed by reads across lanes. -
Fixed
isnan(),isinf(), andisfinite()failing during LLVM lowering forfloat8_e3m4andfloat4_e2m1fn.float4_e2m1fn(no NaN/Inf encodings) folds to constant branches;float8_e3m4casts throughbfloat16to reuse the existingllvm.is.fpclasspath.
Special thanks
Special thanks to our community contributors:
aweifh1-29gh20 (@isaacuselman), Ben Wibking (@BenWibking), Bernhard Merkle (@bmerkle), Brian Grenier (@bgreni), Byungchul Chae (@byungchul-sqzb), c-pozzi (@c-pozzi), Christoph Schlumpf (@christoph-schlumpf), Evan Owen (@ulmentflam), Frost Ming (@frostming), jglee-sqbits (@jglee-sqbits), Manuel Saelices (@msaelices), martinvuyk (@martinvuyk), minkyu (@kkimmk), Parsa Bahraminejad (@prsabahrami), pei0033 (@pei0033), Pierre Gordon (@pierlon), soraros (@soraros), Turcik (@alexturcea)