CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails

TOP Literature Database CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails

arxiv

AI Security Portal bot

Information in the literature database is collected automatically.

Source

https://arxiv.org/abs/2010.03484

PDF

https://arxiv.org/pdf/2010.03484

Paper Information

Author: Younghoo Lee,Joshua Saxe,Richard Harang
Published: 10-8-2020
Affiliation: Sophos AI
Country: United Kingdom
Conference: Computing Research Repository (CoRR)

Labels Estimated by AI

Model Architecture Improvement of Learning Machine Learning

These labels were automatically added by AI and may be inaccurate.
For details, see About Literature Database.

Abstract

Targeted phishing emails are on the rise and facilitate the theft of billions of dollars from organizations a year. While malicious signals from attached files or malicious URLs in emails can be detected by conventional malware signatures or machine learning technologies, it is challenging to identify hand-crafted social engineering emails which don't contain any malicious code and don't share word choices with known attacks. To tackle this problem, we fine-tune a pre-trained BERT model by replacing the half of Transformer blocks with simple adapters to efficiently learn sophisticated representations of the syntax and semantics of the natural language. Our Context-Aware network also learns the context representations between email's content and context features from email headers. Our CatBERT(Context-Aware Tiny Bert) achieves a 87% detection rate as compared to DistilBERT, LSTM, and logistic regression baselines which achieve 83%, 79%, and 54% detection rates at false positive rates of 1%, respectively. Our model is also faster than competing transformer approaches and is resilient to adversarial attacks which deliberately replace keywords with typos or synonyms.

External Datasets

large-scale dataset

labelled target dataset

dataset of about five million emails

training dataset

validation dataset

test dataset