These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
Phishing attacks are increasingly prevalent, with adversaries creating
deceptive webpages to steal sensitive information. Despite advancements in
machine learning and deep learning for phishing detection, attackers constantly
develop new tactics to bypass detection models. As a result, phishing webpages
continue to reach users, particularly those unable to recognize phishing
indicators. To improve detection accuracy, models must be trained on large
datasets containing both phishing and legitimate webpages, including URLs,
webpage content, screenshots, and logos. However, existing tools struggle to
collect the required resources, especially given the short lifespan of phishing
webpages, limiting dataset comprehensiveness. In response, we introduce
Phish-Blitz, a tool that downloads phishing and legitimate webpages along with
their associated resources, such as screenshots. Unlike existing tools,
Phish-Blitz captures live webpage screenshots and updates resource file paths
to maintain the original visual integrity of the webpage. We provide a dataset
containing 8,809 legitimate and 5,000 phishing webpages, including all
associated resources. Our dataset and tool are publicly available on GitHub,
contributing to the research community by offering a more complete dataset for
phishing detection.