This study examined modality-specific processing in impression formation by comparing environmental sounds and text representations of identical urban environments. Total 769 participants evaluated five urban scenes across fifteen dimensions including Russell’s valence-arousal model and ISO 12913 soundscape scales. MANOVA with modality and scene revealed significant main effects and interactions. Modality main effects were largest for valence and pleasantness, while scene effects peaked for the chaotic dimension. Valence-arousal correlations showed modality dependence: no significant correlation was found for environmental sounds, whereas text showed a moderate to large positive correlation. Valence-arousal correlations showed modality dependence: environmental sounds versus text. Modality×scene interactions remained small, and modality effects persisted after controlling for imagery vividness and confidence. These findings indicated multimodal AI systems require modality-specific architectures that account for differential dimensional independence and integration across modalities.