Use of language in face-to-face context is multimodal. Production and perception of speech take place in the context of visual articulators such as lips, face, or hand gestures which convey relevant information to what is expressed in speech at different levels of language. While lips convey information at the phonological level, gestures contribute to semantic, pragmatic, and syntactic information, as well as to discourse cohesion. This chapter overviews recent findings showing that speech and gesture (e.g. a drinking gesture as someone says, “Would you like a drink?”) interact during production and comprehension of language at the behavioral, cognitive, and neural levels. Implications of these findings for current psycholinguistic theories and how they can be expanded to consider the multimodal context of language processing are discussed.