4.5 Article

Automated Scoring of Chinese Grades 7-9 Students' Competence in Interpreting and Arguing from Evidence

Journal

JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY
Volume 30, Issue 2, Pages 269-282

Publisher

SPRINGER
DOI: 10.1007/s10956-020-09859-z

Keywords

Automated scoring; Scientific argumentation; Chinese writing; LightSIDE

Funding

  1. International Joint Research Project of Faculty of Education, Beijing Normal University
  2. China Scholarship Council (CSC) [201806040088]

Ask authors/readers for more resources

The study found that at least 800 human-scored student responses were needed as the training sample size for accurately building scoring models in automated scoring of Chinese written responses. There was nearly perfect agreement between human scoring and computer-automated scoring for both holistic and analytic scores.
Assessing scientific argumentation is one of main challenges in science education. Constructed-response (CR) items can be used to measure the coherence of student ideas and inform science instruction on argumentation. Published research on automated scoring of CR items has been conducted mostly in English writing, rarely in other languages. The objective of this study is to investigate issues related to the automated scoring of Chinese written responses. LightSIDE was used to score students' written responses in Chinese. The sample of this study was from Beijing (grades 7-9) consisting of 4000 students. Items for assessing interpreting data and making claims under an ecological topic developed by the Stanford NGSS Assessment Project were translated into Chinese and used to assess student competence of interpreting data and making claims. The results show that: (1) at least 800 human-scored student responses were needed as the training sample size to accurately build scoring models. When doubling the training sample size, the accuracy in kappa increased only slightly by 0.03-0.04; (2) there was a nearly perfect agreement between human scoring and computer-automated scoring based on both holistic scores and analytic scores, although analytic scores produced slightly better accuracy than holistic scores; (3) automated scoring accuracy did not differ substantially by student response length, although shorter text length produced slightly higher human-machine agreement. The above findings suggest that automated scoring of Chinese writings produced a similar level of accuracy compared with that of English writings reported in literature, although there are specific considerations, e.g., training data set size, scoring rubric, and text lengths, to be considered using automated scoring of student written responses in Chinese.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available