Dynamic Bayesian Networks for Multimodal Interaction
Dynamic Bayesian networks (DBNs) offer a natural upgrade path beyond classical hidden Markov models and become especially relevant when temporal data contains higher order structure, multiple modalities or multi-person interaction. We describe several instantiations of dynamic Bayesian networks that are useful for modeling temporal phenomena spanning audio, video and haptic channels in single, two-person and multi-person activity. These models include input-output hidden Markov models, switched Kalman filters and, most generally, dynamical systems trees (DSTs). These models are used to learn audio-video interaction in social activities, video interaction in multi-person game playing and haptic-video interaction in robotic laparoscopy. Model parameters are estimated from data in an unsupervised setting using generalized expectation maximization methods. Subsequently, these models can predict, synthesize and classify various types of rich multimodal human activity. Experiments in gesture interaction, audio-video conversation, football game playing and surgical drill evaluation are shown.
Author: Tony Jebara, Columbia University