The underlying units of visual representation often transcend lower-level properties, for example when we see objects in terms of a small number of generic stimulus types (e.g. animals, plants, faces, etc.). There has been much less attention, however, to the possibility that we also represent dynamic information in terms of a small number of primitive *event types* — such as twisting, rotating, bouncing, rolling, etc. (In models that posit a “language of vision”, these would be the foundational visual *verbs*.) We explored the possibility that such ‘event type’ representations are formed quickly and spontaneously during visual perception — even when they are entirely task-irrelevant. We did so by exploiting the phenomenon of *categorical perception* — wherein the differences between two stimuli are more readily noticed when they are represented in terms of different underlying categories. Observers simply viewed pairs of images or animations (presented very briefly, one at a time), and reported for each pair whether they were the same or different in any way. Cross-Type changes involved switches in the underlying event type (e.g. a towel being *twisted* in someone’s hands, replaced by a towel being *rotated* in someone’s hands), while Within-Type changes maintained the same event type (e.g. a towel being more or less twisted in someone’s hands). Critically, this distinction was always task-irrelevant, and Within-Type changes were always objectively greater in magnitude than were Cross-Type changes. Nevertheless, Cross-Type changes were much more readily noticed. And additional controls confirmed that such effects could not be explained by appeal to lower-level stimulus differences (such as the different hand positions involved in twisting vs. rotating). This spontaneous perception of a potentially continuous range of stimuli in terms of a smaller set of primitive “visual verbs” might promote both generalization and prediction about how events are likely to unfold.