Yohei Kiyono, Ryosuke Yamanishi: High-context intention within one-word speech, The 2024 Technologies and Applications of Artificial Intelligence Conference, 2024

In human communication, how we say things is just as important as what we say. This how'' is called paralanguage, e.g., the pitch of our voice, how we stress words, how fast we speak, and the rhythm of our speech. These elements help us understand what someone means, even if they say only one-word likeha” In our study, we focus on how people use and understand the word ha'' in different situations as an extreme challenge for paralanguage recognition. We applied a game to collect data on how people sayha” in eight different contexts. When people could see and hear others saying `ha," they guessed the correct context about 63% of the time. We then used a machine learning model to analyze the acoustic features of theseha” utterances and estimate the intended context. The model was able to estimate the correct context with an average F1-Score of about 0.58. This showed that even a machine learning model could understand some of the meaning behind how we say “ha” at almost the same level as humans watching the speaker’s faces and movements.
We found some interesting differences by comparing how well humans and a machine learning model could estimate the context. The findings have helped us understand more about how we communicate with just a single word.