This study examined modality-specific processing in impression formation by comparing environmental sounds and text representations of identical urban environments. Total 796 participants evaluated five urban scenes across fifteen dimensions including Russell’s valence-arousal model and ISO 12913 soundscape scales. MANOVA with modality (environmental sound: n = 455, text: n = 314) and scene (five locations) revealed significant main effects and interactions. Modality main effects were largest for valence (η2p = .130) and pleasantness (η2p = .167), while scene effects peaked for the chaotic dimension (η2p = .162). Valence-arousal correlations showed modality dependence: environmental sounds (r = .335) versus text . Modality × scene interactions remained small, and modality effects persisted after controlling for imagery vivid-ness and confidence. These findings indicate multimodal AI systems require modality-specific architectures that account for differential dimensional independence and integration