4.6 Article

COSM2IC: Optimizing Real-Time Multi-Modal Instruction Comprehension

期刊

IEEE ROBOTICS AND AUTOMATION LETTERS
卷 7, 期 4, 页码 10697-10704

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/LRA.2022.3194683

关键词

Data sets for robotic vision; deep learning for visual perception; embedded systems for robotic and automation; human-robot collaboration; RGB-D perception

类别

资金

  1. National Research Foundation, Singapore [NRF-NRFI05-2019-0007]
  2. Ministry of Education, Singapore [19-C220-SMU-008]
  3. Agency for Science, Technology and Research (A*STAR), Singapore [A18A2b0046]

向作者/读者索取更多资源

The article introduces the challenge of executing multi-modal referring instruction comprehension models on embedded devices and proposes the COSM2IC framework to address this challenge by assessing instructional complexity with multiple sensor inputs and dynamically switching between models to reduce computational resources. The importance of embodied Human-Robot Interaction.
Supporting real-time, on-device execution of multi-modal referring instruction comprehension models is an important challenge to be tackled in embodied Human-Robot Interaction. However, state-of-the-art deep learning models are resource-intensive and unsuitable for real-time execution on embedded devices. While model compression can achieve a reduction in computational resources up to a certain point, further optimizations result in a severe drop in accuracy. To minimize this loss in accuracy, we propose the COSM2IC framework, with a lightweight Task Complexity Predictor, that uses multiple sensor inputs to assess the instructional complexity and thereby dynamically switch between a set of models of varying computational intensity such that computationally less demanding models are invoked whenever possible. To demonstrate the benefits of COSM2IC, we utilize a representative human-robot collaborative table-top target acquisition task, to curate a new multi-modal instruction dataset where a human issues instructions in a natural manner using a combination of visual, verbal, and gestural (pointing) cues. We show that COSM2IC achieves a 3-fold reduction in comprehension latency when compared to a baseline DNN model while suffering an accuracy loss of only similar to 5%. When compared to state-of-the-art model compression methods, COSM2IC is able to achieve a further 30% reduction in latency and energy consumption for a comparable performance.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据