While the multimodality of ELT textbooks has in recent years been increasingly studied for their pedagogic implications, unlike other genres of texts such as webpages, advertisements, picture books, and newspapers, textbooks have less often been viewed as semiotic artefacts, a perspective which leads to the study of image and text for intermodal interaction. To address this gap in the literature, this study explores visual and verbal interaction in ELT textbooks. In particular, the article focuses on the conversation section of an EFL senior high school textbook in Taiwan. The study found that the multimodal nature of face-to-face communication and the turn-taking mechanisms involved in conversations render the relations between images and texts unique in language learning materials. Consequently, frameworks of image-text relations that are developed based on narratives (such as picture books) or information texts are not sufficient for the understanding of this particular genre of multimodal text. This article discusses the distinct ways in which visual and verbal modes interact in multimodal ELT conversation texts and provides a preliminary framework for future examinations of language learning materials of a broader scope.