Authors - Yuuki Ario, Yuyu Araki, Hiroshi Sakamoto
Abstract - Crowd analysis has become a critical component of modern urban and smart surveillance systems, where effective monitoring of densely populated public areas is essential for resource management, emergency response, and public safety. YOLO-based models are popular for detecting a person or an object. In this study, we present a comprehensive objective evaluation analysis of state-of-the-art object detection architectures—YOLOv5, YOLOv8, and YOLOv11. We have implemented YOLO models for detecting groups of people as integrated entities to enable crowd classification based on group size, including individuals, small groups, and large crowds. The evaluation was con-ducted using four diverse benchmark datasets: VSCrowd, Crowd Mall, Crowd11, and NWPU-Crowd, with all images annotated using LabelImg. Each model was rigorously trained and tested under consistent conditions. Experimental results reveal that on the VSCrowd dataset, YOLOv5s achieved an
[email protected] of 0.454, while YOLOv5l slightly improved this to 0.459. YOLOv8m demonstrated high performance with an
[email protected] of 0.530. On Crowd Dataset, YOLOv5m achieved an
[email protected] of 0.300, YOLOv8m obtained 0.306, and YOLOv11m achieved 0.302. These results indicate that newer YOLO architectures provide enhanced detection capabilities in highly crowded scenes, exhibiting better generalization, robustness, and adaptability for real-world crowd analysis applications.