☆ 4.4 Article

Learning lenient parsing & typing via indirect supervision

EMPIRICAL SOFTWARE ENGINEERING (2021)

期刊

EMPIRICAL SOFTWARE ENGINEERING

卷 26, 期 2, 页码 -

出版社

SPRINGER

DOI: 10.1007/s10664-021-09942-y

关键词

Program repair; Naturalness; Deep learning

类别

Computer Science, Software Engineering

资金

National Science Foundation, via NSF CISE SHF [1414172]
Microsoft PhD Fellowship
UC Davis Dean's Distinguished Graduate Fellowship
Division of Computing and Communication Foundations
Direct For Computer & Info Scie & Enginr [1414172] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper presents a novel method to train a lenient parser without human-curated training data, utilizing the large corpus of correct code on GitHub and the learning capacity of Transformer-based NN architectures. With this approach, reasonable performance can be achieved in parsing and typing imperfect code, demonstrating good performance on shorter student error programs.

Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes such code more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse and type imperfect code requires a large training set including many pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel, indirectly supervised, approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach performs well on shorter student error program and achieves best-in-class performance on longer programs that have more than 400 tokens. We also show that by blending DeepFix and our tool, we could achieve 77% accuracy, which outperforms all previously reported student error correction tools.

Learning lenient parsing & typing via indirect supervision

期刊

EMPIRICAL SOFTWARE ENGINEERING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Learning lenient parsing & typing via indirect supervision

期刊

EMPIRICAL SOFTWARE ENGINEERING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文