header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

IntellectMap releases GLM-5.1 High-Speed API, refreshing the global speed record at 400 tokens per second

According to Watchful Beating monitoring, SmartSpectrum has introduced the GLM-5.1 High-Speed API targeting select enterprise clients, with a model inference speed of 400 tokens/s, setting a new end-to-end speed record for global large-scale models' official interfaces.

While fully retaining the original flagship model's capabilities, this high-speed version is powered by SmartSpectrum's and TileRT's jointly developed high-performance inference engine. This engine has fundamentally restructured the GPU's execution scheduling mechanism, compiling the model at runtime into a persistent Engine Kernel that resides continuously on the GPU. During single-card inference, computation, asynchronous I/O, and communication are all decomposed into tile-level microtasks, initiating the kernel only once. Intermediate results between operators are transmitted directly through registers and shared caches, eliminating latency bubbles caused by frequent kernel launches and GPU memory read/writes in traditional inference processes.

When scaled to multiple cards, TileRT further extends the specialization parallelism concept to an entire 8-card NVL topology, specializing each originally homogeneous GPU node into heterogeneous Workers handling different tasks. When processing the attention layer computation of GLM-5.1, the system assigns GPU 0 to run the sparse index Worker, dedicated to sparse index construction and routing decisions. Simultaneously, GPUs 1 to 7 are assigned to run the MLA Workers, responsible for the compute-intensive phase. Communication is fully pipelined at the tile-level, achieving deep overlap between computation and inter-card communication.

This high-speed version of the service is currently open to select enterprise clients of the SmartSpectrum MaaS platform. In the future, this technology will further optimize FP8 inference and ultra-long-context production environments to provide more deterministic performance support for low-latency-sensitive scenarios such as AI programming, real-time interaction, and live speech.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish