These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Semantic understanding of programs has attracted great attention in the
community. Inspired by recent successes of large language models (LLMs) in
natural language understanding, tremendous progress has been made by treating
programming language as another sort of natural language and training LLMs on
corpora of program code. However, programs are essentially different from texts
after all, in a sense that they are normally heavily structured and
syntax-strict. In particular, programs and their basic units (i.e., functions
and subroutines) are designed to demonstrate a variety of behaviors and/or
provide possible outputs, given different inputs. The relationship between
inputs and possible outputs/behaviors represents the functions/subroutines and
profiles the program as a whole. Therefore, we propose to incorporate such a
relationship into learning, for achieving a deeper semantic understanding of
programs. To obtain inputs that are representative enough to trigger the
execution of most part of the code, we resort to fuzz testing and propose fuzz
tuning to boost the performance of program understanding and code
representation learning, given a pre-trained LLM. The effectiveness of the
proposed method is verified on two program understanding tasks including code
clone detection and code classification, and it outperforms current
state-of-the-arts by large margins. Code is available at
https://github.com/rabbitjy/FuzzTuning.