One of the most significant challenges in the field of software code auditing
is the presence of vulnerabilities in software source code. Every year, more
and more software flaws are discovered, either internally in proprietary code
or publicly disclosed. These flaws are highly likely to be exploited and can
lead to system compromise, data leakage, or denial of service. To create a
large-scale machine learning system for function level vulnerability
identification, we utilized a sizable dataset of C and C++ open-source code
containing millions of functions with potential buffer overflow exploits. We
have developed an efficient and scalable vulnerability detection method based
on neural network models that learn features extracted from the source codes.
The source code is first converted into an intermediate representation to
remove unnecessary components and shorten dependencies. We maintain the
semantic and syntactic information using state of the art word embedding
algorithms such as GloVe and fastText. The embedded vectors are subsequently
fed into neural networks such as LSTM, BiLSTM, LSTM Autoencoder, word2vec,
BERT, and GPT2 to classify the possible vulnerabilities. We maintain the
semantic and syntactic information using state of the art word embedding
algorithms such as GloVe and fastText. The embedded vectors are subsequently
fed into neural networks such as LSTM, BiLSTM, LSTM Autoencoder, word2vec,
BERT, and GPT2 to classify the possible vulnerabilities. Furthermore, we have
proposed a neural network model that can overcome issues associated with
traditional neural networks. We have used evaluation metrics such as F1 score,
precision, recall, accuracy, and total execution time to measure the
performance. We have conducted a comparative analysis between results derived
from features containing a minimal text representation and semantic and
syntactic information.