Model Interpretability

Towards LLM Guardrails via Sparse Representation Steering

Authors: Zeqing He, Zhibo Wang, Huiyu Xu, Kui Ren | Published: 2025-03-21
Sparse Representation Method
Model Interpretability
Role of Machine Learning

A Synergistic Approach In Network Intrusion Detection By Neurosymbolic AI

Authors: Alice Bizzarri, Chung-En Yu, Brian Jalaian, Fabrizio Riguzzi, Nathaniel D. Bastian | Published: 2024-06-03
NSAI Integration
Model Interpretability
Unknown Attack Detection

Explainable Malware Detection with Tailored Logic Explained Networks

Authors: Peter Anthony, Francesco Giannini, Michelangelo Diligenti, Martin Homola, Marco Gori, Stefan Balogh, Jan Mojzis | Published: 2024-05-05
Malware Classification
Model Interpretability
Evaluation Method

Explainability Guided Adversarial Evasion Attacks on Malware Detectors

Authors: Kshitiz Aryal, Maanak Gupta, Mahmoud Abdelsalam, Moustafa Saleh | Published: 2024-05-02
Watermarking
Malware Classification
Model Interpretability

Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

Authors: Xi Xin, Giles Hooker, Fei Huang | Published: 2024-04-29 | Updated: 2024-05-01
Model Interpretability
Adversarial Training
Watermark Evaluation

MISLEAD: Manipulating Importance of Selected features for Learning Epsilon in Evasion Attack Deception

Authors: Vidit Khazanchi, Pavan Kulkarni, Yuvaraj Govindarajulu, Manojkumar Parmar | Published: 2024-04-24 | Updated: 2024-05-02
Model Interpretability
Attack Method
Adversarial Training

Decomposing and Editing Predictions by Modeling Model Computation

Authors: Harshay Shah, Andrew Ilyas, Aleksander Madry | Published: 2024-04-17
Watermarking
Model Interpretability
Model editing techniques

Conformal Predictions for Probabilistically Robust Scalable Machine Learning Classification

Authors: Alberto Carlevaro, Teodoro Alamo Cantarero, Fabrizio Dabbene, Maurizio Mongelli | Published: 2024-03-15
Watermarking
Model Interpretability
Quantification of Uncertainty

An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach

Authors: Mohammad Amaz Uddin, Iqbal H. Sarker | Published: 2024-02-21
Phishing Detection
Model Interpretability
Model Performance Evaluation

LIPSTICK: Corruptibility-Aware and Explainable Graph Neural Network-based Oracle-Less Attack on Logic Locking

Authors: Yeganeh Aghamohammadi, Amin Rezaei | Published: 2024-02-06
Graph Neural Network
Model Interpretability
Watermark Evaluation