NewsFlash Articles Data Fundraising Skill&API

IntellectMap releases GLM-5.1 High-Speed API, refreshing the global speed record at 400 tokens per second

According to Watchful Beating monitoring, SmartSpectrum has introduced the GLM-5.1 High-Speed API targeting select enterprise clients, with a model inference speed of 400 tokens/s, setting a new end-to-end speed record for global large-scale models' official interfaces.

While fully retaining the original flagship model's capabilities, this high-speed version is powered by SmartSpectrum's and TileRT's jointly developed high-performance inference engine. This engine has fundamentally restructured the GPU's execution scheduling mechanism, compiling the model at runtime into a persistent Engine Kernel that resides continuously on the GPU. During single-card inference, computation, asynchronous I/O, and communication are all decomposed into tile-level microtasks, initiating the kernel only once. Intermediate results between operators are transmitted directly through registers and shared caches, eliminating latency bubbles caused by frequent kernel launches and GPU memory read/writes in traditional inference processes.

When scaled to multiple cards, TileRT further extends the specialization parallelism concept to an entire 8-card NVL topology, specializing each originally homogeneous GPU node into heterogeneous Workers handling different tasks. When processing the attention layer computation of GLM-5.1, the system assigns GPU 0 to run the sparse index Worker, dedicated to sparse index construction and routing decisions. Simultaneously, GPUs 1 to 7 are assigned to run the MLA Workers, responsible for the compute-intensive phase. Communication is fully pipelined at the tile-level, achieving deep overlap between computation and inter-card communication.

This high-speed version of the service is currently open to select enterprise clients of the SmartSpectrum MaaS platform. In the future, this technology will further optimize FP8 inference and ultra-long-context production environments to provide more deterministic performance support for low-latency-sensitive scenarios such as AI programming, real-time interaction, and live speech.

Source

Correction/Report

On-Chain Activity