Putting Wallarm Management Console on a Fast Track

How we reduced the visualization lag from minutes to 5 seconds

The main objective of Wallarm WAF is to protect apps and APIs. At the same time, our users, fellow security pros, are expecting expect their experience of managing the solution to be effective and enjoyable, and for the product interfaces to be fast and responsive.

Recently we were faced with an unpleasant situation where, because of the explosive growth in our customer base and the amount of data that Wallarm cloud has to handle, the delay from where a potential attack was identified by the filtering node to the time the same information has appeared in the management console could be as long as ten minutes at times. Contributing to the issue was also a high number of data transformation steps the information would need to go through before showing up in the console, which made both insert and read operations quite costly.

We huffed, and we puffed, and we changed the product architecture.. But in the end, we were able to improve the performance of our system by the factor of 16.

Wallarm Management Console’s back end is using Elasticsearch. This is how we can deliver a clean search based interface and help our users to quickly find the subset of attacks they are interested in.

We performed two big optimizations:

Updating Elasticsearch from 1..6 to a newer 6.7 version
System architecture changes

What is Elasticsearch?

“Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.” (from https://www.elastic.co/products/elasticsearch). Elasticsearch is fast and horizontally scalable. It communicates with the world through the HTTP API and it can generate and exchange JSON documents for indexing and storage.

To get the benefits of this new optimized version, we’ve ended up upgrading our Elasticsearch system from 1.6 to 6.7

To do this we had to carefully go over all our indices and migrate each of them to the new version. This was no small fit as every monthly index could be as large as 45 million records. And we managed to do it without downtime!

Our tooling

From the very beginning, we understood that successful migration of huge amount of data could be very difficult without special tooling. So, we decided to develop two special utilities: migrator and Zanuda (Russian slang for pedant).

Migrator is a massively parallel tool which allowed us to perform copying historical client data between old Elasticsearch cluster and the new one. Nature of data we’d migrated allows us to efficiently parallelize workloads between several servers without any side effects on production. Ability to configure a range of clients’ date ranges allowed us to perform migrations on demand.

At the end of each migration, we should be able to confidently say: “We have migrated with no errors, clusters are exactly the same”. To check that this statement is actually true, we developed our second tool, Zanuda checker, especially for our task. Zanuda is a massively parallel tool for simplifying quality assurance and data validation. It does this by performing Wallarm-specific queries on old and new Elasticsearch clusters via Attacks API and comparing the results of these queries.

Both instruments helped a lot during the migration process and allowed us to find and eliminate difficult bugs during migration. Last but not least, they can be efficiently used in further migrations almost for any API.

Migration process

The idea is quite simple: we have two clusters, new (Elasticsearch 6.7) and old (Elasticsearch 1.6) running at the same time in production. We need to migrate data from old to new without downtime.

The microservice architecture of Wallarm gave us the ability to copy all Attacks API functionality from the old version supporting Elasticsearch 1.6.7 into a new version 6.7 so that the two work concurrently. It also allowed us to mirror traffic in different proportions for write-only as well as read-only requests to new and old clusters. We started to mirror write only traffic in September 2019.

It was time for Zanuda to check the correctness of mirroring data in production. Its flexibility and parallelism allowed us to prove the correctness in a very short time.

The entire migration process took about two month to complete. It took that long because of the huge amount of historical data to be migrated. Migrator and Zanuda again were a rescue.

So in November, we’ve had a proven mirror of Elasticsearch and Wallarm Attacks API, but all read-only traffic was still arriving into the old cluster. Again, microservice architecture allowed us to switch traffic from the old cluster into the new one using blue/green approach and increasing the share of the traffic to the new system gradually: 1%, 10%, 100%.

Finally, the main requirements were met: we migrated 2 500 millions of hits; we didn’t lose any sensitive client data; we managed the migration without downtime and got 100% percent traffic onto the new cluster.

System architecture changes.

Migration to new Elasticsearch 6.7 helped a lot, but we were still far away from the desired 5 seconds. Below we will explain how we’ve got 5 seconds from when an attack starts to when it shows in the UI.

An attack is a cluster of similar hits grouped together by several parameters: domain, IP, method, path, etc. In the old architecture, an attack was built sequentially in batches while new hits occurred. While this approach is straightforward, it caused a heavy load on the Elasticsearch cluster.

By its nature Elasticsearch is a distributed storage which means that you can only achieve eventual consistency of the data on a whole cluster. Technically, a hit detected by one Walarm Node could be written on ANY node in the ElasticSearch cluster. As mentioned above, an attack is a group of hits with the same parameter values. Thus, to calculate an attack, first we have to perform group and aggregate queries over entire Elasticsearch cluster.

The key challenge of that implementation is that attack data are available and consistent for hits batching only at the end of the refresh interval of an Elasticsearch cluster. Refresh interval is a very important parameter of Elasticsearch config which strongly affects the responsiveness of the whole cluster. This limitation means that the first attack can not be shown in the UI in time less than refresh_interval in seconds times two. In our case it was 120 seconds.

Here is how we dealt with the problem.

The system architecture optimization we implemented is the use of in-memory Redis database as kind of write-through cache. Redis is a NoSql key-value database that focused on achieving maximum performance in atomic operations, which is enabling our users to search for the right attack data quickly. Gone are the days of having to calculate attack aggregates using resource-intensive and slow Query DSL requests to the Elasticsearch, to the database of recorded hits.

First, an attack is created, even before a hit is detected. Attack_id is saved to the Redis with the parameters by which we determine its hit. For each subsequent new hit, we get the previously created attack_id from Redis. Thus, there is no need for a separate deferred attack indexing job and search for hits in Elasticsearch cluster-wide.

That approach allowed us to show attacks in UI in terms of seconds when first hits achieved cloud. Our ops forces deployed Redis Sentinel cluster so our solution will be fault tolerant and reliable.

A picture is worth a thousand words

Observe the resulting drastic improvement! Below are the Grafana screenshots showing the system performance before and after optimization.

Before optimizations :

After optimizations:

Or even so (maximum time per client) before:

After:

Conclusion.

Due to this we significantly reduced Elasticsearch load and thereby improved the responsiveness of the user interface with an attack showing up within seconds of being detected. Our new user-friendly interface has a lag time of no more than 5 seconds.

If you are an active Wallarm user, please, let us know what else would you like to work better in Wallarm Console.