Which AI Sees Like Us? Investigating the Cognitive Plausibility of Language and Vision Models via Eye-Tracking in Human-Robot Interaction

Ghamati Ghamsari, Khashayar, Banitalebi Dehkordi, Maryam and Zaraki, Abolfazl (2025) Which AI Sees Like Us? Investigating the Cognitive Plausibility of Language and Vision Models via Eye-Tracking in Human-Robot Interaction. Sensors, 25 (15): 4687. ISSN 1424-8220

Copy

As large language models (LLMs) and vision–language models (VLMs) become increasingly used in robotics area, a crucial question arises: to what extent do these models replicate human-like cognitive processes, particularly within socially interactive contexts? Whilst these models demonstrate impressive multimodal reasoning and perception capabilities, their cognitive plausibility remains underexplored. In this study, we address this gap by using human visual attention as a behavioural proxy for cognition in a naturalistic human-robot interaction (HRI) scenario. Eye-tracking data were previously collected from participants engaging in social human-human interactions, providing frame-level gaze fixations as a human attentional ground truth. We then prompted a state-of-the-art VLM (LLaVA) to generate scene descriptions, which were processed by four LLMs (DeepSeek-R1-Distill-Qwen-7B, Qwen1.5-7B-Chat, LLaMA-3.1-8b-instruct, and Gemma-7b-it) to infer saliency points. Critically, we evaluated each model in both stateless and memory-augmented (short-term memory, STM) modes to assess the influence of temporal context on saliency prediction. Our results presented that whilst stateless LLaVA most closely replicates human gaze patterns, STM confers measurable benefits only for DeepSeek, whose lexical anchoring mirrors human rehearsal mechanisms. Other models exhibited degraded performance with memory due to prompt interference or limited contextual integration. This work introduces a novel, empirically grounded framework for assessing cognitive plausibility in generative models and underscores the role of short-term memory in shaping human-like visual attention in robotic systems.

Item Type	Article
Identification Number	10.3390/s25154687
Additional information	© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Date Deposited	30 Jul 2025 10:12
Last Modified	25 Feb 2026 01:06

Explore Further

Sensors

picture_as_pdf: sensors-25-04687_2_.pdf
subject: Published Version
: Available under Creative Commons: BY 4.0

View

Download

EndNote

BibTeX

Reference Manager

Refer

Atom

Dublin Core

RIOXX2 XML

OpenURL ContextObject in Span

MODS

METS

Data Cite XML

MPEG-21 DIDL

OpenURL ContextObject

HTML Citation

ASCII Citation

Export

Downloads