4.6 Article

Student engagement detection in online environment using computer vision and multi-dimensional feature fusion

Journal

MULTIMEDIA SYSTEMS
Volume -, Issue -, Pages -

Publisher

SPRINGER
DOI: 10.1007/s00530-023-01153-3

Keywords

Feature fusion; Facial expression; Head pose; Student engagement

Ask authors/readers for more resources

In the post-COVID-19 era, online learning has become a normalized teaching method, but it faces challenges of low participation and high dropout rates. To address these issues, this paper proposes a system that utilizes image data from individual webcams to accurately detect students' classroom engagement levels. The system incorporates multi-dimensional feature fusion and multimodal analysis techniques to provide real-time support for teachers and enhance students' engagement in online courses.
In the post-COVID-19 era, online learning has changed from an emergency teaching method to a new, normalized one. However, online learning is often plagued by low participation and high dropout compared to offline learning. A critical way to address these issues is the accurate detection of student engagement, which will help teachers promptly assess learners' status. Image data are one of the most straightforward ways to reflect student engagement levels. However, traditional engagement detection methods with images either rely on manual analysis or interfere with student behavior, which leads to a need for more objectivity in final engagement levels. This paper proposes a system that utilizes images obtained from individual webcams in online classrooms. Based on the techniques of multi-dimensional feature fusion and multimodal analysis, this system can rapidly detect and output students' classroom engagement levels which provides real-time support for teachers to adjust their teaching methods during the teaching process, aiming to enhance students' engagement in online courses. In the feature extraction module, VGG16 is utilized to recognize students' facial expressions, ResNet-101 is designed to estimate head pose in each image, and Mediapipe is applied to estimate facial landmarks that reflect eye-mouth behavior. Subsequently, a BP neural network is constructed to fuse these multi-dimensional features and output the engagement level in each image in the data fusion module. The present method is evaluated on the wacv2016 data set and achieves an accuracy of 62.03%, outperforming the single-dimensional method. It is also applied in online courses to demonstrate its validity in the actual scenario further. Pearson Correlation between engagement levels calculated by our multi-dimensional fusion method and NSSE-China survey scores filled out by students is 0.714. It indicates that the method can enable real-time monitoring of students' classroom engagement with similar results to traditional questionnaires with little human resources and time.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available