Explainable Ensemble Learning for IoT Intrusion Detection: Multi-device Evaluation using SHAP-based Interpretability and Class Balancing

Muhammad Irfan

doi:10.63163/jpehss.v4i2.1433

Authors

Muhammad Irfan MNS university of Agriculture Multan. Email: dairfankhan382@gmail.com Author

DOI:

https://doi.org/10.63163/jpehss.v4i2.1433

Abstract

The existing literature on IoT intrusion detection (ID) has two common drawbacks: Most of the models are tested for a single device type and they do not provide much information about the reasons for their decisions. This paper tackles both these issues by performing interpretable ensemble-learning experiments on seven types of IoT devices ranging from consumer appliances to industrial sensors to environmental monitors and by studying the behaviour of the resulting models in detail. Three challenges are identified: first is the extreme class imbalance, in which attacks make up a very small share of the samples; second, limited interpretability, which limits the amount of trust security teams can give to the results of their detection; and third, a lack of evidence of the generalization of the detection across different types of devices. Gradient-boosting ensembles (LightGBM and XGBoost) were used, along with class balancing techniques such as the Synthetic Minority Over-sampling Technique (SMOTE) and interpretability techniques known as SHapley Additive exPlanations (SHAP). On 197,811 traffic samples, the ROC-AUC scores were boosted from 0.88–0.94 to 0.94–1.00 by SMOTE, while the inference latency increased from 2.3–3.1 to 3.2–3.4 ms. The most important features that contributed to the predictive signal in the models were found to be packet-size statistics, inter-arrival timing, and protocol attributes, with the SHAP analysis showing that about 68-73% of the signal was captured by these three feature groups. Compared with the baseline of a Long Short-Term Memory (LSTM) model (ROC-AUC 0.86–0.91 and latency 47 ms), the ensemble models outperformed the baseline in terms of recall and had significantly better interpretability with only a sub-50 MB memory footprint needed to deploy them on the edge.