The previous two blog articles in this series describe how to set up Wallarm Ingress controller and configure it so that it can properly allow or block traffic from trusted or suspicious/malicious IP addresses.  This is essential to the functionality of Wallarm’s Ingress controller but it isn’t enough for production environments.

In a production environment, it is essential that security products provide high availability, and that the operator has a high degree of visibility into their operations.  This enables easy investigation of issues and performance monitoring during normal operations.

By making a few tweaks to the configuration of your Wallarm Ingress controller, you can improve the availability of the assets that it protects and gain access to monitoring metrics.

Configuring Wallarm to Ensure High Availability

For mission-critical systems, running a single Wallarm Ingress controller pod may not be sufficient.  A sudden surge in request volume could overwhelm the controller’s ability to manage inbound traffic.  If the pod crashes or experiences high latency, then user experience suffers.

Many of these issues can be addressed by taking advantage of some of the configuration settings provided in the value.yml file.  Useful modifications include:

  • Increase controller replica count: This allows the use of multiple controller pod instances.
controller:
    replicaCount: 2
  • Implement pod anti-affinity: Distribute controller pods over multiple nodes using Kubernetes’ pod anti-affinity feature to decrease the impact of a node failure.
controller:
    affinity:
        podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - nginx-ingress
            topologyKey: "kubernetes.io/hostname"
  • Activate controller autoscaling: Enable Kubernetes’ horizontal pod autoscaling feature to allow pods to scale to meet demand.
controller:
    autoscaling:
        enabled: true
        minReplicas: 1
        maxReplicas: 11
        targetCPUUtilizationPercentage: 50
        targetMemoryUtilizationPercentage: 50
  • Increase tarantool replica count: Run multiple instances of Wallarm postanalytics service based on Tarantool database
controller:
    wallarm:
        tarantool:
            replicaCount: 2

By making these modifications to the configuration of your Wallarm Ingress controller, you can increase its resiliency and scalability.  This helps to ensure that implementing strong security on your Kubernetes cluster does not impact application availability.

Monitoring Performance via Metrics

The Wallarm Ingress controller is designed to protect a Kubernetes cluster with minimal impact on network performance and latency.  The metrics service provides valuable visibility into the controller and how well it is doing its job.

Whether during normal operations or while diagnosing a network or performance issue, it is helpful to know if the controller is successfully blocking inbound attacks and how efficiently it is doing so.  Configuring the metrics service and querying it when needed provides data that can help to answer these questions.

The operation of the metrics services is controlled by values.yml.  To enable metrics on your Wallarm Ingress controller, modify this file as shown below.

wallarm:
    metrics:
      enabled: true
      service:
        annotations:
          prometheus.io/scrape: "true"
          prometheus.io/path: /wallarm-metrics
          prometheus.io/port: "18080"

Making the above modifications to the values.yml file enables the /wallarm-metrics endpoint on the Wallarm Ingress controller.  By querying this newly-exposed endpoint, you can access metrics about the Ingress controller in Prometheus format as shown below.

# HELP nginx_wallarm_requests requests count
# TYPE nginx_wallarm_requests gauge
nginx_wallarm_requests 5
# HELP nginx_wallarm_attacks attack requests count
# TYPE nginx_wallarm_attacks gauge
nginx_wallarm_attacks 5
# HELP nginx_wallarm_blocked blocked requests count
# TYPE nginx_wallarm_blocked gauge
nginx_wallarm_blocked 5
# HELP nginx_wallarm_abnormal abnormal requests count
# TYPE nginx_wallarm_abnormal gauge
nginx_wallarm_abnormal 5
# HELP nginx_wallarm_tnt_errors tarantool write errors count
# TYPE nginx_wallarm_tnt_errors gauge
nginx_wallarm_tnt_errors 0
# HELP nginx_wallarm_api_errors API write errors count
# TYPE nginx_wallarm_api_errors gauge
nginx_wallarm_api_errors 0
# HELP nginx_wallarm_requests_lost lost requests count
# TYPE nginx_wallarm_requests_lost gauge
nginx_wallarm_requests_lost 0
# HELP nginx_wallarm_overlimits_time overlimits_time count
# TYPE nginx_wallarm_overlimits_time gauge
nginx_wallarm_overlimits_time 0
# HELP nginx_wallarm_segfaults segmentation faults count
# TYPE nginx_wallarm_segfaults gauge
nginx_wallarm_segfaults 0
# HELP nginx_wallarm_memfaults vmem limit reached events count
# TYPE nginx_wallarm_memfaults gauge
nginx_wallarm_memfaults 0
# HELP nginx_wallarm_softmemfaults request memory limit reached events count
# TYPE nginx_wallarm_softmemfaults gauge
nginx_wallarm_softmemfaults 0
# HELP nginx_wallarm_proton_errors libproton non-memory related libproton faults events count
# TYPE nginx_wallarm_proton_errors gauge
nginx_wallarm_proton_errors 0
# HELP nginx_wallarm_time_detect_seconds time spent for detection
# TYPE nginx_wallarm_time_detect_seconds gauge
nginx_wallarm_time_detect_seconds 0
# HELP nginx_wallarm_db_id proton.db file id
# TYPE nginx_wallarm_db_id gauge
nginx_wallarm_db_id 9
# HELP nginx_wallarm_lom_id LOM file id
# TYPE nginx_wallarm_lom_id gauge
nginx_wallarm_lom_id 38
# HELP nginx_wallarm_proton_instances proton instances count
# TYPE nginx_wallarm_proton_instances gauge
nginx_wallarm_proton_instances{status="success"} 4
nginx_wallarm_proton_instances{status="fallback"} 0
nginx_wallarm_proton_instances{status="failed"} 0
# HELP nginx_wallarm_stalled_worker_time_seconds time a worker stalled in libproton
# TYPE nginx_wallarm_stalled_worker_time_seconds gauge

The first set of metrics discuss how well the controller is processing requests.  The metrics service tracks the total requests, attacks, blocked attacks, and abnormal traffic:

  • requests: The number of requests that have been processed by the filter node.
  • attacks: The number of recorded attacks.
  • blocked: The number of blocked requests.
  • abnormal: The number of requests the application deems abnormal.

If a local post-analytics module is being used, the following metrics apply:

  • tnt_errors: Requests that were not analyzed by the post-analytics module
  • api_errors: Requests that were not submitted to the API for further analysis
  • requests_lost: Requests that did not reach the post-analytics module and API.

Errors in the operation of the Wallarm Ingress controller can affect the availability and security of the Kubernetes cluster.  The metrics track segmentation and memory faults:

  • segfaults: The number of issues that led to the emergency termination of the worker process.
  • memfaults: The number of issues where the virtual memory limits were reached.
  • soft_memfaults: The number of issues where the memory limits of request memory were reached.

The time_detect parameter tells how long the service has been analyzing requests:

  • time_detect: The total time of requests analysis.

The Wallarm Ingress controller uses proton.db and LOM, and the metrics service tracks a number of pieces of information about these two services:

  • db_id: proton.db version
  • lom_id: LOM version.
  • proton_instances: Information about proton.db + LOM pairs:
  1. total: The number of proton.db + LOM pairs.
  1. success: The number of the successfully uploaded proton.db + LOM pairs.
  2. fallback: The number of proton.db + LOM pairs loaded from the last saved files.
  3. failed: The number of proton.db + LOM pairs that were not initialized and run in the “do not analyze” mode.

In filter node version 2.12, the metrics provide information about stalled workers.  This can help to diagnose performance issues with the Wallarm Ingress controller:

  • stalled_workers_count: The quantity of workers that exceeded the time limit for request processing
  • stalled_workers: The list of the workers that exceeded the time limit for request processing and the amount of time spent on request processing.

Use of the metrics included in the Wallarm Ingress controller provides valuable data regarding the current health of your Wallarm Ingress controller.  It is recommended that you perform regular queries to monitor system health and to inform investigations into performance issues if needed.

High-Performance Kubernetes Security with Wallarm

This series of three articles provides a walkthrough for configuring and deploying Wallarm’s Ingress controller to protect a Kubernetes cluster.  After testing these steps in a development environment and ensuring that the controller works as expected and desired, duplicate the configuration in production to take full advantage of its security benefits.