☆ 4.6 Article

Code4ML: a large-scale dataset of annotated Machine Learning code

PEERJ COMPUTER SCIENCE (2023)

期刊

PEERJ COMPUTER SCIENCE

卷 9, 期 -, 页码 -

出版社

PEERJ INC

DOI: 10.7717/peerj-cs.1230

关键词

ML code dataset; Jupyter code snippets

类别

Computer Science, Artificial Intelligence Computer Science, Information Systems Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The use of program code as a data source is becoming more common among data scientists for various purposes. However, the application of machine learning models is limited without annotated code snippets. To address this issue, the Code4ML corpus is introduced, which consists of annotated code snippets collected from Kaggle. This dataset can assist in solving software engineering or data science challenges through a data-driven approach, such as semantic code classification and code generation based on natural language specifications for ML tasks.

The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle-the leading platform for hosting data science competitions. The corpus consists of similar to 2.5 million snippets of ML code collected from similar to 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.

Code4ML: a large-scale dataset of annotated Machine Learning code

期刊

PEERJ COMPUTER SCIENCE

出版社

PEERJ INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Code4ML: a large-scale dataset of annotated Machine Learning code

期刊

PEERJ COMPUTER SCIENCE

出版社

PEERJ INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文