Infrastructure as Code (IaC) automates the provisioning and management of IT
infrastructure through scripts and tools, streamlining software deployment.
Prior studies have shown that IaC scripts often contain recurring security
misconfigurations, and several detection and mitigation approaches have been
proposed. Most of these rely on static analysis, using statistical code
representations or Machine Learning (ML) classifiers to distinguish insecure
configurations from safe code.
In this work, we introduce a novel approach that enhances static analysis
with semantic understanding by jointly leveraging natural language and code
representations. Our method builds on two complementary ML models: CodeBERT, to
capture semantics across code and text, and LongFormer, to represent long IaC
scripts without losing contextual information. We evaluate our approach on
misconfiguration datasets from two widely used IaC tools, Ansible and Puppet.
To validate its effectiveness, we conduct two ablation studies (removing code
text from the natural language input and truncating scripts to reduce context)
and compare against four large language models (LLMs) and prior work. Results
show that semantic enrichment substantially improves detection, raising
precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from
0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively.