Name: Extending Latency Models for Long-Sequence Inference: Nonlinear, Adaptive, and Empirical Enhancements
Start: 2026-04-09T09:30:00+0700
End: 2026-04-09T11:30:00+0700

Extending Latency Models for Long-Sequence Inference: Nonlinear, Adaptive, and Empirical Enhancements

Thursday April 9, 2026 9:30am - 11:30am GMT+07

Virtual Room D

Open Zoom

Authors - Koutaro HACHIYA, Ioannis PATIAS
Abstract - Inference latency remains a critical bottleneck in deploying large language models, for real-time and resource-constrained environments. Prior work has proposed latency formulations that express latency as a function of key parameters. However, they often assume a linear dependence on sequence length, which fails to generalize to tasks involving significantly longer sequences, such as document-level language modeling, long-context retrieval, or time-series forecasting, where latency scales nonlinearly and unpredictably. This paper addresses the limitations of existing latency formulations by proposing three complementary enhancements to improve generalization across varying sequence lengths. First, we introduce a nonlinear term for sequence length, capturing the superlinear growth in latency observed in transformer-based architectures due to quadratic attention mechanisms and memory overhead. Second, we propose a sequence-length-dependent scaling factor for the sequence length parameter itself, allowing the model to adaptively adjust its sensitivity based on empirical latency profiles across different tasks and hardware configurations. Third, we incorporate an empirical correction term enabling calibration of the latency model to account for hardware-specific and implementation-level nuances. By explicitly modeling the nonlinear and context-sensitive behavior of sequence length, our approach offers a more faithful representation of latency dynamics. This work lays the foundation for more adaptive and hardware-aware latency estimation frameworks, with implications for model deployment, scheduling, and cost optimization in production systems. We conclude by discussing future directions for integrating dynamic profiling and reinforcement learning to further refine latency predictions in evolving runtime environments.

Paper Presenter

Ioannis PATIAS

Bulgaria

Thursday April 9, 2026 9:30am - 11:30am GMT+07
Virtual Room D Bangkok, Thailand

Virtual Room 1D, Virtual Room D

11th International Conference on ICTIS

Ioannis PATIAS

Get help with the event

11th International Conference on ICTIS

Ioannis PATIAS

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event