NDSS'22 Summer Paper #102 Reviews and Comments =========================================================================== Paper #102 Interpretable Federated Transformer Log Learning for Cloud Threat Forensics Review #102A =========================================================================== Overall Recommendation ---------------------- 2. Leaning towards reject Writing Quality --------------- 3. Adequate Reviewer Confidence ------------------- 2. Passable confidence Paper Summary ------------- This paper applied Federated Learning (FL), which aggregates the learned model parameters from local models to generate a global model, to identify cyber threat activities in system logs (syslogs). Compared with previous centralized approaches, this FL-based method protects clients' privacy by not exposing syslogs to central servers. Also, the proposed approach integrated attention layer to make the result interpretable. For evaluation, the authors used one public dataset, HDFS, and one dataset collected by themselves, CTDD. Their FL-based method achieved comparable performance on HDFS. They also tested the proposed model in 2 real-world scenarios, which proves its applicability of being interpretable. Strengths --------- + Adapt Federated Learning to Syslog analysis + Release a dataset (CTDD) consisting of syslogs representing cloud collaboration services and systems compromised under different classes of cyber-attacks. + Extensive evaluation Weaknesses ---------- - The accuracy and recall on CTDD dataset are not high. - Whether the interpretability provided by attention is sufficient in the cyber-security setting is unjustified. - The adaptive attacks against FL and the approach itself are not discussed. Detailed Comments for Authors ----------------------------- FL is a widely-used apporach in many fields such as CV and Visualization to solve the privacy issues in training a ML model. Applying it on syslog to detect cyber-attacks seems to be a new application. Though the way how FL is used is standard, it's always exciting to explore the combination of new AI techniques and cyber-security applications. My major concern is the performance on CTDD (about 80% precision, recall and f-score). This might make this approach less practical in the real-world scenarios with a large volume of logs to process. Also, I suggest the authors to test the 4 SOTA methods on CTDD in comparison. CTDD has the simulated cyber-attacks and the result should be more convincing than HDFS, which has system anomalies not necessarily cyber-attacks. Also, the proposed model is based on 5 assumptions. One of the assumptions is that "Federated server and clients are trusted." (Section III, Page 3) This is a strong assumption since clients in FL-related models are not always trustworthy in the real world, as there have been some recent works show during the training phase, a malicious client can poison the whole model at the central server [3]. Some defenses have also been proposed [1,2]. There are also discussion in other domains like CV [4,5]. [1] Xiaoyu Cao, Minghong Fang, Jia Liu, and Neil Zhenqiang Gong. "FLTrust: Byzantine-robust Federated Learning via Trust Bootstrapping". In ISOC Network and Distributed System Security Symposium (NDSS), 2021. [2] Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. "Provably Secure Federated Learning against Malicious Clients". In AAAI Conference on Artificial Intelligence (AAAI), 2021. [3] Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. "Local Model Poisoning Attacks to Byzantine-Robust Federated Learning". In USENIX Security Symposium, 2020. [4] Q. Li, X. Wei, H. Lin, Y. Liu, T. Chen and X. Ma, "Inspecting the Running Process of Horizontal Federated Learning via Visual Analytics," in IEEE Transactions on Visualization and Computer Graphics, doi: 10.1109/TVCG.2021.3074010. [5] Ghosh, A., Hong, J., Yin, D. and Ramchandran, K., 2019. Robust federated learning in a heterogeneous environment. arXiv preprint arXiv:1906.06629. The definition of interpretability also concerns me. The author uses histograms and heat maps generated by an attention-based model in order to represent interpretability. However, in security domain, I'm not sure if this is the best approach. There have been other approaches for cyber-security [6]. I hope the authors can justify their choice or compare to the other methods. [6] LEMNA: Explaining Deep Learning based Security Applications, CCS'18. Another issue is the labeling of log keys in CTDD. Since the interpretability essentially tells which log keys are related to a cyber-attack, they need to be labeled individually as well, rather than as a whole on each VM. I don't find detail in the log key labeling. The closest I found is "In addition, a total of 2,501 syslogs were collected from 16 compromised VMs running malicious threat samples presented in Table II". For 16 compromised VMs, there must be syslogs generated by system routine processes which are normal. It seems the authors don't differentiate the log keys at the compromised VMs. For writing, I think motivating the problem with Covid-19 is unnecessary. Even without COVID-19, protecting syslogs with sensitive information is still a very important problem. This paper also has a number of typos. To name a few: - Page 3: Many *researches* addressed - Page 9, Section IV.C: Our model it's limit Review #102B =========================================================================== Overall Recommendation ---------------------- 3. Major revision Writing Quality --------------- 2. Needs improvement Reviewer Confidence ------------------- 2. Passable confidence Paper Summary ------------- In this paper, the authors proposed an interpretable federated transformer log learning algorithm and compared it with SOTA methods. The authors also proposed a dataset CTDD of a cloud-based operating system. The experiments show that the attention of the proposed method can distinguish between normal operation and cyber threats. Strengths --------- + a novel perspective: an interpretable method in federated learning scenarios to do threat detection. + interesting finding: the distribution of the model's attention to normal and abnormal operations is very different. Weaknesses ---------- - The proposed method does not outperform the SOTA works. - The comparative experiment is conducted on only one dataset. - There are some mistakes that need careful proofreading. For example, in the annotation of Fig2(b), $\upsilon \in \Upsilon$ but $\Upsilon$ is missing. Detailed Comments for Authors ----------------------------- (1) What’s the meaning of duration in Table I? This paper lacks the explanation of 38.7 hours and 235 days. (2) The experimental results of the proposed method in the paper are insufficient. Furthermore, the SOTA methods should also be evaluated on CTDD. Otherwise, it is difficult to demonstrate the usefulness of the dataset. (3) One of the reasons for the interpretation of AI is that explaining the wrong decision of AI can help improve the model. However, this paper lacks the explanation and analysis of examples of prediction errors. An example should contain at least the following three parts: 1) The behavior which is predicted incorrectly; 2) The explanation of the incorrectly predicted behavior according to the attention mechanism; 3) The analysis of the prediction and its explanation. (4) The authors trained the proposed method under the centralized setting for comparison. However, federated learning can also be used to train AI-based methods. Therefore, the performance of other SOTA methods under the federated setting should also be evaluated for a comprehensive comparison. (5) Attention is the mechanism of the transformer. I’m curious about the following two problems: 1) Can other transformer-based methods (e.g., HitAnomaly) use this technique to explain their decisions? 2) Furthermore, there are many interpretation methods of machine learning (e.g., LIME[1]). Can the discovery in the third contribution of the introduction be generalized to such interpretation methods? [1] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). " Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144). Review #102C =========================================================================== Overall Recommendation ---------------------- 3. Major revision Writing Quality --------------- 2. Needs improvement Reviewer Confidence ------------------- 4. High confidence Paper Summary ------------- The paper presents a cyber forensics system that utilizes pattern-based anomaly detection on log patterns in system logs to detect potential signs of cyber attacks. The approach utilizes federated learning to aggregate intelligence about normal log entry patterns over several different systems, thereby increasing system accuracy. The system also tries to identfy log keys contributing to potential anomalies, thereby providing additional information about potential triggering events of cyber attacks. The approach transforms log entries to a sequence of log keys that are used to train an autoencoder-type machine learning model capturing normal log key patterns. Locally trained models are sent to a central entity who aggregates the models to a global model. This model is then used to estimate whether sequences of log keys appearing in log files represent anomalies or not. Strengths --------- - More efficient ways to process system logs for security purposes in an autonomous way are very welcome - Novel extension of federated learning-based anomaly detection to cyber intelligence on log files Weaknesses ---------- - Paper is at places hard to follow and used formalisms inconsistent. Detailed Comments for Authors ----------------------------- I find this approach of extending FL-based approaches for anomaly detection as they have been employed in, e.g., IoT intrusion detection, an interesting new application area for anomaly detection. Considering logs as sequences of entry types represented by log keys enables to capture patterns of normal behavior efficiently and thereby make better use of system logs in managing overall system security. I would like to see a bit more information on how the log keys are extracted / managed in practice. Are there known matching patterns / templates for the keys? How is it made sure that all systems participating in the federation extract the same keys for the same log entry types? Unfortunately I find the used formalisms at places very hard to follow, especially since capital letters are used to denote both sets and vectors and scalars and at places also lowercase letters are used to denote vectors. Also the use of super- and subscripts is quite excessive, making it difficult to appreciate what they stand for in particular cases. For example, what is the difference between eqs. (6) and (8)? In many places the description of concepts is very terse and difficult to understand, e.g., "Self-attention, used to define the relationships between every key and the other elements in the sequence, is calculated using 3 vectors, namely, query(Q), key(K), and value(V) generated by multiplying every vector xi with 3 matrices WQ, W K, WV. The resulting vectors Q, K, V have the corresponding dimensions dq, dk, dv; all being smaller than dm. The output of the self-attention layer Zs is calculated as follows:" Here, the key concept of self-attention has not been introduced nor defined, and suddenly vectors and matrices are introduced without explaining what they stand for or where they come from. This makes this paragraph almost impossible to understand. Another difficult-to-understand item is Fig. 1, which is used to introduce the system in Sect. 3. The figure is far too complex and detailed that it could be understood without a very thorough explanation in the main text. I would therefore recommend to greatly simplify this figure to increase understandability. I would also recommend add a high-level description of the system before diving into the details of the implementation of individual components. The discussion on model interpretability in Sect. 3.D should also be extended, as it is not quite clear what is meant with the 'attention' of each key and how it is calculated in simple terms. Due to the problems with the used formalism mentioned above, the subsequent discussion is very hard to follow. As a minor note, I think it is typographically more conventional to place table captions above the table, not beneath it. Review #102D =========================================================================== Overall Recommendation ---------------------- 2. Leaning towards reject Writing Quality --------------- 3. Adequate Reviewer Confidence ------------------- 3. Sufficient confidence Paper Summary ------------- The paper proposes a transformer based framework for analyzing log files. The attention weights of this model are then used to offer insights into what triggered the model's behavior. Additionally, the framework deploys federated learning for added privacy. Strengths --------- + Novel dataset. Weaknesses ---------- - Not clear what criterion the model uses for its prediction. - Problems with the evaluation. Detailed Comments for Authors ----------------------------- The paper tackles an important problem. I specifically applaud the authors' efforts to collect a novel dataset, which is always a problem in security research. However, I have two major problems with the paper: ### Not clear what criterion the model uses for its prediction. It's unclear to me how the model detects abnormal behavior. First, the authors introduce a mechanism in III-C based on the prediction of the model. If the reality deviates from the model's prediction (to some predefined degree), they flag the system as abnormal. This is also reflected in Figure 4. Additionally, the compute a $\chi^2$ test on the local aggregate attention and the global attention of all models. They use this test in Section V to argue that the attention of systems under attack heavily deviates from a normal system. Is this also used to flag abnormal behavior? The test is also used in algorithm 1. However, the computed test statistic $D_i$ is never used. I'm also unsure why it's computed as part of the training loop. As far as I understand the paper, these so-called "attention weights" are never used in any computation by the model. They are a simple and aggregation of the attention over the entire sequence $S$ and all keys $K$ (or split by individual keys). Thus, the weights are entirely determined by the local weights $W$ of the model, and there is nothing to update in step III. I would advise the authors to change the name "attention weights" since the term "weights" is usually reserved for model parameters. ### Problems with the evaluation While I applaud the authors for collecting a new data set, the other methods are never evaluated on it. This makes judging the performance of their model hard. Additionally, the use cases chosen by the authors are insufficient. The authors investigate a SYN flood and NTP DDoS attack. According to Table II, these attacks both produce a large number of logs. It would be interesting if other more hidden attack cases also show up so clearly. Overall, I'm not convinced of the model's performance or its interpretability capabilities. Review #102E =========================================================================== Overall Recommendation ---------------------- 3. Major revision Writing Quality --------------- 3. Adequate Reviewer Confidence ------------------- 2. Passable confidence Paper Summary ------------- To detect the abnormal log files for the threat forensics, the author proposed an interpretable transformer-based federated learning model. The author evaluated the model with other centralized detection models based on the benchmark dataset HDFS. The performance of the model was also evaluated in the cluster of virtual machines built by the author. To further show the model interpretability, the author evaluated the attention distribution of the log keys in normal and abnormal log files. Strengths --------- + It's interesting to see the authors show the model interpretability by evaluating the attention distributions of the features in the classification. Weaknesses ---------- - The motivation of this work is weak. - The methodology lacks novelty - The model interpretability needs deeper evaluations and discussions. Detailed Comments for Authors ----------------------------- The study of this paper focuses on threat forensics based on interpretable federated transformer log learning. The author proposed an interpretable federated transformer log learning model to detect abnormal log files. The author evaluated the model’s performance by the benchmark dataset HDFS and the dataset CTDD he generated. The model interpretability was also evaluated in this work. However, some issues exist in this paper. -Lack of sufficient evaluation in model interpretability In the evaluation of the model interpretability, the author revealed and compared the attention distributions across all log keys of normal operations and attacks. I appreciate the author’s attempt in this part. However, the evaluation of the attention distribution is superficial in this work. It's suggested that the authors show which log keys have higher weights and what these log keys mean. Besides, it is essential to further discuss why these log keys are more important to real more insights. -The weakness of the motivation The cyberattack detection and the users’ data privacy-preserving motivated the author to propose the interpretable federated transformer log learning model. However, for the federated learning against cyber attack, one of the key characteristics is the organization-specific model design, given the differences of the log features and templates between different organizations. In this work, the model was designed in general, not motivated by the threat under the specific scenario. The weakness of the motivation also leads that the feature engineering is only based on a public log parser without considering the specific features for one threat scenario. More specific features shown the behaviors in the specific threat are meaningful for the machine learning and result understanding. -Lack of novelty in the methodology The interpretable federated transformer log learning model is a designed transformer-based federated learning model. Some works also started to apply the transformer framework into the federated learning setting, like “FedNLP: A Research Platform for Federated Learning in Natural Language Processing” by Lin et al. However, the author lacked the novelty in this model design aiming for the log-based attack detection. Response by Joseph Khoury (Author) (0 words) --------------------------------------------------------------------------- Comment @A1 by Reviewer C --------------------------------------------------------------------------- Thank you for your contributions and comments so far. After discussion, the reviewers see promise in the work presented in this paper. However, it is also seen that some issues still would need to be addressed to improve the quality of the paper. In particular, following points should be addressed: 1. The evaluation of the paper should be extended so that it also includes and takes into consideration prior work on the CTDD dataset in order to make comparison of this paper with related work easier. 2. The case studies should be extended and made broader, so that also negative cases and more hidden attacks are considered. 3. The role and use of attention weights should be made clearer as it not sufficiently clear in the present manuscript. 4. In the evaluation, also examples of prediction errors should be added and an analysis of predction errors should be provided. In addition, also other comments provided by the reviewers should be considered. MajorRevision Response by Joseph Khoury (Author) --------------------------------------------------------------------------- ***Response Letter*** **Title: Interpretable Federated Transformer Log Learning for Threat Forensics Paper: #102** *NDSS'22 Summer* **(September 21, 2021)** We are grateful to the reviewers for carefully evaluating the submitted manuscript and providing their constructive feedback. We have updated the revised manuscript by addressing the raised concerns and incorporating the given suggestions. Additionally, we respond to each reviewers' comments in this document. For their convenience, we follow each comment with a highlighted response, which justifies our choices and elaborates on further actions taken to address the comments throughout the paper. We also attached a "diff.pdf" file, which highlights all implemented changes to the manuscript with respect to the original submission. We hope that the reviewers find our answers and actions satisfactory for their comments/concerns and that they have indeed ameliorated the scientific and readability quality of the presented work. We summarize first the main changes, second the minor changes and finally responses to reviewer's comment: **Main changes:** - The evaluation of the paper should be extended so that it also includes and takes into consideration prior work on the CTDD dataset in order to make comparison of this paper with related work easier. We thank the reviewer for their comments and appreciate their careful review and detailed feedback. **Action(s):** To address the reviewer's comment, we revisited the experiments section presented and took the appropriate action to reproduce the DeepLog model [@du2017deeplog] (i.e., the only publicly available code and reproducible model) and evaluated it with our `CTDD` dataset using a centralized setting. This evaluation revealed that DeepLog's performance drastically dropped when trained and tested with our dataset. Accordingly, we extended our analysis in terms of variability and complexity of our dataset and compared it to `HDFS` dataset (i.e., Figure 2 added to the manuscript). To further support our statement, we manually generated a set of parsing rules that tested each log in the `HDFS` dataset against an exact template match observed in the log. We demonstrated that the `HDFS` dataset was composed of only 50 log templates and presented our results in paretos that validated the complexity of our `CTDD` dataset compared to the `HDFS` dataset. - The case studies should be extended and made broader, so that also negative cases and more hidden attacks are considered. We thank the reviewer for this comment. **Action(s):** To address the reviewer's comment, we developed a hidden attack, more specifically a low footprint ransomware attack based on Data Encryption Standard (DES) that targets Linux-based systems (i.e., described in Table II). The executed attack posted a very low number of syslogs demonstrating its low footprint. Nonetheless, our model was able to detect its activities as malicious with high precision. In addition, one of the previously presented attack cases, namely, ech0raix, manifested as a negative case. We dedicated a subsection for these two cases to present these test cases with regards to hidden attacks and negative cases against our proposed model. - The role and use of attention weights should be made clearer as it not sufficiently clear in the present manuscript. We appreciate the reviewer's observation **Action(s):** To address the reviewer's observation, we first updated our definitions by replacing the term "Interpretability attention-based weights" with "Interpretability weights" throughout the manuscript. Subsequently, we modified all the manuscript's structure as well as narratives, figures, charts, and tables to clearly state the role and use of interpretability weights. More precisely, in section V, we presented the objective of the attention mechanism and identified the top 10 false positives log keys (ground truth) and top 10 true positive log keys (ground truth), and provided an explanation on the model's use of attention that leads to each classification. - In the evaluation, also examples of prediction errors should be added and an analysis of prediction errors should be provided. We thank the reviewer for this suggestion. **Action(s):** To address the reviewer's comment, we performed an analysis over the logs identified as true positive cases and false positive cases (i.e., Figure 3 added to the manuscript). We identified the log keys (ground truths) with the top aggregated attention across all of it's input sequence keys. We identified three common log keys (ground truth) in these two groups and provided an explanation on the approach used by the model to arrive to a prediction error or a good prediction. This was supported by relating the events in the input sequence to the actual running system processes demonstrating the influence (i.e. attention) each had on the model's prediction. **Minor changes:** - In addition, also other comments provided by the reviewers should be considered. We thank the reviewer for this comment. **Action(s):** To address the reviewer's comment, we addressed in the revised manuscript all editorials remarks including long phrases, table captions, typos, generalizations concerning COVID-19 in the introduction, mathematical formulas, figure simplification. **First Round Reviewers comments:** 1. R1: Trustworthiness of FL clients: We thank the reviewer for this comment. **Action(s):** To address the Reviewer's comment, we revised the script and included the security mechanisms that ensure the integrity of the data collected from uncompromised systems. Some of the security mechanisms that we used include IP-based access to authorized users, firewall rules, and security groups that restricted ingress traffic to most network ports. Moreover, during the period of data collection, all environments where monitored usind a SIEM enviornment based on elasticsearch that would have notified us about any observed anomaly. We also present the major differences between each of the three uncompromised end-user environments with respect to administrator privileges. More over, each user assigned to each VM was a known reputable user. These combined set of actions allows us to ensure trustworthiness of FL clients. 2. R2: Granular description of simulation environment and collected CTDD dataset: We thank the reviewer for this comment. **Action(s):** To address the Reviewer's comment, we have included in the manuscript information about the process implemented in our pipeline to collect log data, the removal of data with errors, the segregation of logs that are also found in specialized log files (i.e., filter syslogs, kernel logs, auth logs), and the mechanisms in our pipeline to ensure data integrity. We also presented information on the restrictions of admin privileges for each deployed environment based on specific pre-configured cloud images.