Model Interpretability

Towards LLM Guardrails via Sparse Representation Steering

Authors: Zeqing He, Zhibo Wang, Huiyu Xu, Kui Ren | Published: 2025-03-21

Sparse Representation Method

Model Interpretability

Role of Machine Learning

2025.03.21 2025.05.27

Literature Database

A Synergistic Approach In Network Intrusion Detection By Neurosymbolic AI

Authors: Alice Bizzarri, Chung-En Yu, Brian Jalaian, Fabrizio Riguzzi, Nathaniel D. Bastian | Published: 2024-06-03

NSAI Integration

Model Interpretability

Unknown Attack Detection

2024.06.03 2025.05.27

Literature Database

Explainable Malware Detection with Tailored Logic Explained Networks

Authors: Peter Anthony, Francesco Giannini, Michelangelo Diligenti, Martin Homola, Marco Gori, Stefan Balogh, Jan Mojzis | Published: 2024-05-05

Malware Classification

Model Interpretability

Evaluation Method

2024.05.05 2025.05.27

Literature Database

Explainability Guided Adversarial Evasion Attacks on Malware Detectors

Authors: Kshitiz Aryal, Maanak Gupta, Mahmoud Abdelsalam, Moustafa Saleh | Published: 2024-05-02

Watermarking

Malware Classification

Model Interpretability

2024.05.02 2025.05.27

Literature Database

Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

Authors: Xi Xin, Giles Hooker, Fei Huang | Published: 2024-04-29 | Updated: 2024-05-01

Model Interpretability

Adversarial Training

Watermark Evaluation

2024.04.29 2025.05.27

Literature Database

MISLEAD: Manipulating Importance of Selected features for Learning Epsilon in Evasion Attack Deception

Authors: Vidit Khazanchi, Pavan Kulkarni, Yuvaraj Govindarajulu, Manojkumar Parmar | Published: 2024-04-24 | Updated: 2024-05-02

Model Interpretability

Attack Method

Adversarial Training

2024.04.24 2025.05.27

Literature Database

Decomposing and Editing Predictions by Modeling Model Computation

Authors: Harshay Shah, Andrew Ilyas, Aleksander Madry | Published: 2024-04-17

Watermarking

Model Interpretability

Model editing techniques

2024.04.17 2025.05.27

Literature Database

Conformal Predictions for Probabilistically Robust Scalable Machine Learning Classification

Authors: Alberto Carlevaro, Teodoro Alamo Cantarero, Fabrizio Dabbene, Maurizio Mongelli | Published: 2024-03-15

Watermarking

Model Interpretability

Quantification of Uncertainty

2024.03.15 2025.05.27

Literature Database

An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach

Authors: Mohammad Amaz Uddin, Iqbal H. Sarker | Published: 2024-02-21

Phishing Detection

Model Interpretability

Model Performance Evaluation

2024.02.21 2025.05.27

Literature Database

LIPSTICK: Corruptibility-Aware and Explainable Graph Neural Network-based Oracle-Less Attack on Logic Locking

Authors: Yeganeh Aghamohammadi, Amin Rezaei | Published: 2024-02-06

Graph Neural Network

Model Interpretability

Watermark Evaluation

2024.02.06 2025.05.27

Literature Database