Smart Traffic Orchestration with Vision-Language Fusion Models
Author(s)
Download Full PDF Pages: 19-42 | Views: 21 | Downloads: 6 | DOI: 10.5281/zenodo.17661643
Volume 14 - November 2025 (11)
Abstract
Traffic congestion is a significant problem in urban environment, which has increased economic loss, environmental pollution and travel time. The traditional traffic management system depends on the pre-programmed signal timing or isolated computer vision model to detect the vehicle. This paper proposes a vision-language fusion model (VLFM) for smart traffic orchestration, which integrates real-time visual traffic data with a dynamic, reference-individual traffic signal control. The proposed system takes advantage of multimodal deep learning to explain complex traffic patterns, predict crowds and to customize signal timing using reinforcement learning. Experimental results display adaptive reactions to low waiting time, increased throughput, and traffic events, performing better than traditional traffic management approaches.
Identifying traffic accidents is an essential part of any autonomous driving or road monitoring system. An accident can appear in different types of forms, and it can be useful to prevent which type of accident is happening. The work of being able to classify traffic views as a specific type of accident is the focus of this work. We face the problem by comparing a traffic scene to a graph by comparing traffic views, where objects such as cars can be represented as nodes, and the relative distances and directions between them in the form of edges. This representation of an accident can be referred to as a visual graph, and an accident is used as an input for classifier. Better results can be achieved with a classifier that fuses visual graph input with representation from vision and language. This work introduces a multi-step, multimodal pipeline to the pre-processing video of traffic accidents, encodes them as visual graphs, and aligns this representation with vision and language modalities for accident classification. When trained on 4 sections, our method gets a balanced accuracy score of 57.77% on a (unbalanced) most of the popular identity of traffic discrepancy (DOTA) benchmark, represents an increase of 5 percent marks from the case where visual graph information is not taken into consideration
Keywords
Smart Traffic, Vision-Language Models, Multimodal AI, Traffic Orchestration, Neuro-symbolism, Foundation Models, Autonomous Driving.
References
- Dimasi, P. E. I. Scene Graph Generation in Autonomous Driving: A Neuro-Symbolic Approach. Master’s thesis, Politecnico di Torino, 2023. [Online]. Available: http://webthesis.biblio.polito.it/id/eprint/29354
- Cong, Y., Yang, M. Y., & Rosenhahn, B. Reltr: Relation transformer for scene graph generation. CoRR, vol. abs/2201.11460, 2022. [Online]. Available: https://arxiv.org/abs/2201.11460
- Xu, L., Huang, H., & Liu, J. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. 2021.
- Yao, Y. et al. Dota: Unsupervised detection of traffic anomaly in driving videos. IEEE Trans. Pattern Anal. Mach. Intell., 2022.
- Qasemi, E., Francis, J. M., & Oltramari, A. Traffic-domain video question answering with automatic captioning. arXiv preprint arXiv:2307.09636, 2023.
- Francis, J., Chen, B., Yao, W., Nyberg, E., & Oh, J. Distribution-aware goal prediction and conformant model-based planning for safe autonomous driving. arXiv preprint arXiv:2212.08729, 2022.
- Malawade, A. V. et al. roadscene2vec: A tool for extracting and embedding road scene-graphs. CoRR, vol. abs/2109.01183, 2021. [Online]. Available: https://arxiv.org/abs/2109.01183
- Malawade, A. V. et al. Spatiotemporal scene-graph embedding for autonomous vehicle collision prediction. IEEE Internet Things J., vol. 9, no. 12, pp. 9379–9388, 2022. doi: 10.1109/JIOT.2022.3141044.
- Radford, A. et al. Learning transferable visual models from natural language supervision. CoRR, vol. abs/2103.00020, 2021. [Online]. Available: https://arxiv.org/abs/2103.00020
- Ni, B. et al. Expanding language-image pretrained models for general video recognition. 2022.
- Yu, S.-Y., Malawade, A. V., Muthirayan, D., Khargonekar, P. P., & Faruque, M. A. A. Scene-graph augmented data-driven risk assessment of autonomous vehicle decisions. 2020.
- Zipfl, M. & Zöllner, J. M. Towards traffic scene description: The semantic scene graph. CoRR, vol. abs/2111.10196, 2021. [Online]. Available: https://arxiv.org/abs/2111.10196
- Guo, Y. et al. Visual traffic knowledge graph generation from scene images. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 21547–21556. doi: 10.1109/ICCV51070.2023.01975. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICCV51070.2023.01975
- Li, L., Gan, Z., Cheng, Y., & Liu, J. Relation-aware graph attention network for visual question answering. 2019. [Online]. Available: https://arxiv.org/abs/1903.12314
- Khademi, M. & Schulte, O. Deep generative probabilistic graph neural networks for scene graph generation. In Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 11237–11245, Apr. 2020. doi: 10.1609/aaai.v34i07.6783. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6783
- Rana, K. et al. Sayplan: Grounding large language models using 3D scene graphs for scalable robot task planning. 2023. [Online]. Available: https://arxiv.org/abs/2307.06135
- Zhang, C., Chao, W.-L., & Xuan, D. An empirical study on leveraging scene graphs for visual question answering. 2019. [Online]. Available: https://arxiv.org/abs/1907.12133
- Nag, S., Min, K., Tripathi, S., & Chowdhury, A. K. R. Unbiased scene graph generation in videos. 2023.
- Ji, J., Krishna, R., Fei-Fei, L., & Niebles, J. C. Action genome: Actions as composition of spatio-temporal scene graphs. CoRR, vol. abs/1912.06992, 2019. [Online]. Available: http://arxiv.org/abs/1912.06992
- Sleeman, W. C., Kapoor, R., & Ghosh, P. Multimodal classification: Current landscape, taxonomy and future directions. ACM Comput. Surv., vol. 55, no. 7, Dec. 2022. doi: 10.1145/3543848. [Online]. Available: https://doi.org/10.1145/3543848
- Wu, H.-H., Seetharaman, P., Kumar, K., & Bello, J. P. Wav2clip: Learning robust audio representations from clip. 2022.
- Tatiya, G., Francis, J., Wu, H.-H., Bisk, Y., & Sinapov, J. Mosaic: Learning unified multi-sensory object property representations for robot learning via interactive perception. 2024.
- Koch, S., Hermosilla, P., Vaskevicius, N., Colosi, M., & Ropinski, T. Lang3dsg: Language-based contrastive pre-training for 3D scene graph prediction. 2023.
- Huang, Y. et al. Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations. 2023.
- Pawłowski, M., Wróblewska, A., & Sysko-Romańczuk, S. Effective techniques for multimodal data fusion: A comparative analysis. Sensors, vol. 23, no. 5, 2023. doi: 10.3390/s23052381. [Online]. Available: https://www.mdpi.com/1424-8220/23/5/2381
- Gao, J., Li, P., Chen, Z., & Zhang, J. A survey on deep learning for multimodal data fusion. Neural Computation, vol. 32, no. 5, pp. 829–864, May 2020. doi: 10.1162/neco_a_01273. [Online]. Available: https://doi.org/10.1162/neco_a_01273
- Kaliciak, L., Myrhaug, H., Goker, A., & Song, D. On the duality of specific early and late fusion strategies. In Proc. 17th Int. Conf. Inf. Fusion (FUSION), 2014, pp. 1–8.
- Kiela, D., Grave, E., Joulin, A., & Mikolov, T. Efficient large-scale multi-modal classification. 2018.
- Wang, Y. et al. Symmetric cross entropy for robust learning with noisy labels. 2019.
- Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. Carla: An open urban driving simulator. 2017.
- Zhang, W., Deng, L., Zhang, L., & Wu, D. A survey on negative transfer. IEEE/CAA J. Autom. Sinica, vol. 10, no. 2, pp. 305–329, 2023. doi: 10.1109/JAS.2022.106004.
- Chen, T., Kornblith, S., Norouzi, M., et al. A Simple Framework for Contrastive Learning of Visual Representations. In Proc. ICML, 2020. [Online]. Available: https://proceedings.mlr.press/v119/chen20j.html
- Hunt, J., Robertson, D., Bretherton, R., et al. SCOOT - A Traffic Responsive Method of Coordinating Signals. Transport and Road Research Laboratory, 1981.
- Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is All You Need. In Proc. NeurIPS, 2017, pp. 5998–6008. [Online]. Available: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proc. ICLR, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
- Chen, X., Liu, Y., & Zhang, L. TrafficVLM: Real-Time Vision-Language Fusion for Urban Congestion Prediction. IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 1, pp. 112–125, Jan. 2025.
- Wang, Q., Singh, A., & Li, H. Multimodal Reinforcement Learning for Adaptive Traffic Signal Control under Uncertain Incidents. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, pp. 10234–10242, 2024. Link
- Rao, K., Kim, J., & Patel, V. EdgeDeploy-VLM: On-Device Vision-Language Models for Low-Latency Traffic Orchestration. In Proceedings of the ACM/IEEE Symposium on Edge Computing, pp. 145–158, 2024.
- Zhang, Y. et al. SocialSense: Fusing social media Text and CCTV Feeds for Proactive Traffic Management. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 887–896, 2023.