A key element of any security solution, whether its a WAF, NGWAF, RASP or even a SIEM or a classic IDS, is the ability to correctly detect whether an incoming API request is malicious.

The traditional way to do it is using signatures and regular expressions (regex). Some sets of signatures are open-sourced such as Core Rule Set, others are commercial sources of signatures. Although wide-spread, classifying inputs with the help of signatures is not very accurate. The problems arise not only because the signatures are not updated often enough, but also because of the logic issues and loops that can arise when regular expressions are used with thousands of individual rules.

Open Source ML-based false-positive detection project: WallNet

An alternative approach to detection is applying machine learning neural network predictive model. Wallarm has created and open-sourced an implementation that does just that.

The code can be found here: https://github.com/wallarm/WallNet

The current implementation shows how to detect SQL Injections (SQLi) but similar AI-powered approaches can be applied for XSS, XXE, path traversal and other threats.

For this model, the most interesting insight is representing an injection as time series. From there, variations of the models commonly used in time series classifications can be applied.

The neural network architecture is shown below. The key processing layer is a variant of a Recurrent Neural Network (RNN), a Bidirectional Long Short Term Memory (BiLSTM) layer. In this implementation, BiLSTM network contains two sub-networks for the forward and backward sequence context respectively. After the initial analysis, the data are processed by Max and Average Pooling layers.

Figure 1: WallNet Layers: Embedding Tokens, BLSTM Layer * 2, Max pooling Layer, Average pooling Layer, Attention Layer, and Output Layer
Figure 1: WallNet Layers: Embedding Tokens, BLSTM Layer * 2, Max pooling Layer, Average pooling Layer, Attention Layer, and Output Layer

Results of the Model

We applied our model to Wallarm ML Hackathon dataset. This dataset contains of 275329 lines in train part and 71962 in test part of the dataset; the train set contains 122606 lines with SQL injections. We convert some bytes to tokens by the rules in Table 1 followed by some additional rules. After that, we use SentencePiece as a tokenizer.

Rules for converting bytes to tokens
Rules for converting bytes to tokens

We choose Receiver Operating Characteristic Area Under Curve (ROC AUC) metric as an evaluation metric. In test set we got 0.98931 ROC AUC score after 8 epoch on training. Our additional tests on private real data illustrate that this approach is useful in real life. Moreover, validation of learning curves in our results for the Wallarm ML Hackathon Dataset can be further improved using different hyper-parameters optimization techniques and applying them during pre- and post-processing.

Malicious Intent Detection Challenge Hackathon

To promote the use of machine learning for detecting, Wallarm is sponsoring a Kaggle competition/hackathon. In this competition, Kagglers will develop models that identify injections among neutral input vectors using machine learning. The hope is that the community will develop and use more scalable methods to accurately detect attacks and injections. Detect attacks to help good guys be faster than hackers.

Prizes and Timeline

Submissions will be accepted from November 16, 2018 through December 12

The following prizes will be awarded based on the results of the leaderboard

  1. First prize: $1,000
  2. Second prize: $500
  3. Third prize: $500

In addition to the prizes based on the Leadership board, Wallarm’s own Ivan Novikov and Stepan Ilyin will review the top five submissions for its applicability to deployment in production and award an additional prize of one Ethereum coin.

Join the challenge today on Kaggle.