Hand gestures and speech form a single integrated system of meaning during language comprehension, but is gesture processed with speech in a unique fashion? We had subjects watch multimodal videos that presented auditory (words) and visual (gestures and actions on objects) information. Half of the subjects related the audio information to a written prime presented before the video, and the other half related the visual information to the written prime. For half of the multimodal video stimuli, the audio and visual information contents were congruent, and for the other half, they were incongruent. For all subjects, stimuli in which the gestures and actions were incongruent with the speech produced more errors and longer response times than did stimuli that were congruent, but this effect was less prominent for speech-action stimuli than for speech-gesture stimuli. However, subjects focusing on visual targets were more accurate when processing actions than gestures. These results suggest that although actions may be easier to process than gestures, gestures may be more tightly tied to the processing of accompanying speech.