CodeEditorBench

This is the formal repo for paper: "CodeEditorBench: Evaluating Code Editing Capability of Large Language Models"

Introduction

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, a pioneering evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development.

We curated diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluating 17 LLMs revealed that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem type and prompt sensitivity. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners in the field.

Quick Start

Set Environment

Entering the workspace: git clone https://github.com/CodeEditorBench/CodeEditorBench.git
Create a new conda environment with coder.yml: conda env create -f coder.yml
Activate the coder environment: conda activate coder

Download Data

Our datasets are available on CodeEditorBench.

To organize the datasets, you can create a folder named data by mkdir data, and then move the datasets into this data/ folder.

Download Models

Before inferencing with open models, make sure you have download all of them from HuggingFace.

We suggest you using huggingface-cli to acclerate your downloading process.

huggingface-cli download --resume-download deepseek-ai/deepseek-coder-33b-instruct --local-dir ./model/deepseek-coder-33b-instruct

Inference

We use vllm for inferencing with open models. You can simply run bash vllm_inference.sh to inference with all open models we supported. Make sure you have created the output folder.

mkdir -p greedy_result/{code_debug,code_translate,code_polishment,code_switch}

Here is a demo code snippet used to explain the hyperparameters.

python vllm_inference.py \
    --base_model "$base_model" \
    --dataset "$dataset" \
    --input_data_dir "./data/" \
    --output_data_dir "./greedy_result/" \
    --batch_size 64 \
    --num_of_sequences 1 \
    --num_gpus 8 \
    --prompt_type "zero" \
    --start_idx 0 \
    --end_idx -1

base_model: The open model used to generate output. You can view all supported models in vllm_inference.sh.
--dataset: The dataset used for the inference process.
--input_data_dir: The directory where the input data is located.
--output_data_dir: The destination path where the output data will be saved.
--batch_size: Number of samples that will be processed in parallel.
--num_of_sequences: Number of sequences that will be generated for the same question.
--num_gpus: Number of GPUs will be used for computation.
--prompt_type: Type of prompts that model use to generate output.
--start_idx: The starting index for processing the dataset.
--end_idx: The ending index for processing the dataset.

Remember that to fully understand these hyperparameters, you should consult the source code of vllm_inference.py.

Evaluation

Evaluation is performed within Docker. To perform evaluation on CodeEditorBench, please refer to Evaluation README.md for more details.

We have conducted secondary development on HUSTOJ, the content within the evaluation module adheres to the GPL-2.0 license.

Results

We propose evaluating LLMs across four scenarios capturing various code editing capabilities, namely code debug, code translate, code polish, and code requirement switch.The figure depicts various model performances across the four scenarios available in CodeEditorBench_Plus in a radial plot – highlighting how relative differences across models change across the scenarios. We also give the Performance of open-source and closed-source models on CodeEditorBench_Plus in zero-shot evaluated through win_rate.

Citation

@misc{guo2024codeeditorbench,
      title={CodeEditorBench: Evaluating Code Editing Capability of Large Language Models}, 
      author={Jiawei Guo and Ziming Li and Xueling Liu and Kaijing Ma and Tianyu Zheng and Zhouliang Yu and Ding Pan and Yizhi LI and Ruibo Liu and Yue Wang and Shuyue Guo and Xingwei Qu and Xiang Yue and Ge Zhang and Wenhu Chen and Jie Fu},
      year={2024},
      eprint={2404.03543},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
cluster		cluster
evaluation		evaluation
few_shot_prompt		few_shot_prompt
mdPICs		mdPICs
prompt_function		prompt_function
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_handler.py		api_handler.py
api_inference.py		api_inference.py
coder.yml		coder.yml
dataset.py		dataset.py
result_postprocess.py		result_postprocess.py
vllm_inference.py		vllm_inference.py
vllm_inference.sh		vllm_inference.sh

License

CodeEditorBench/CodeEditorBench

Folders and files

Latest commit

History

Repository files navigation

CodeEditorBench

Introduction

Quick Start

Set Environment

Download Data

Download Models

Inference

Evaluation

Results

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages