To my knowledge, the first reference to the idea and principles of signatures for detecting network attacks dates back to 1987. This was a scientific paper by Dorothy E. Denning from Stanford Research Institute (SRI) (Here’s the link to the paper). According to the publication’s records, it was sent to the editors in 1985, but was published almost two years later(Manuscript was received December 20, 1985; revised August 1, 1986). This work was supported by the Space and Naval Warfare Command (SPAWAR) under Contract 83F830100 and by the National Science Foundation under Grant MCS-8313650.”). A fun fact about this article is that the author is younger than that paper.
The fundamental work mentioned above described the principles of intrusion detection systems, which have been used to this day for the past 31 years. Without going too deeply into technical details, I would like to note that the paper discussed various models for profiling abnormal behaviors of information systems, including statistical algorithms and Markov chains. This study defined a signature as a description of normal activity of a given subject with respect to a given object.
In the 20 years since the publication of that very first article, the concept of signatures has somewhat changed over the course of a evolution in marketing, rather than technical advancement. Based on deviations from the norm and more general detection rules, signatures now may also mean file hash sums or regular expressions describing one or more attack signs based on the presence of certain data identified analytically. For example, this is how the following signatures appeared: regular expressions for viruses and malware (Which, by the way, often are simply represented markers for binary data of known components used in viruses), hash-based signatures containing hash sums of known virus modifications, and regular expressions for searching in traffic attacks on both low-level network services and web applications.
An entire industry has emerged to collect and sell signatures for various types of network attacks. At the same time, this provoked the other side of the barricade to continue developing various methods of circumventing signatures including repackaging, obfuscation, polymorphism, and others. As a result of such an arms race, cyber security vendors began to resort to a different approach, supplementing the static file or network request signatures with behavioral mechanisms.
The general approach to behavioral analysis can be described as a sandbox, where we safely execute an object we want to check, analyze its behavior, and decide whether to block it or not. Sandboxes can be either completely isolated from protected systems (e.g., a sandbox can be installed on a separate server that checks all files in incoming emails) or integrated into protected systems, (i.e., applications or operating systems). Most modern applications, such as browsers, also have integrated sandboxes to prevent compromising the operating system if application vulnerabilities are exploited in them.
The most reliable option seems to be a fully isolated sandbox, which has several limitations. In this scenario, the security system should emulate or virtualize all varieties of the protected systems, which is impossible, such as in the case of web applications, as their variety is unlimited. Attackers also found ways to bypass emulations and virtualizations by checking for their indirect signs: from the language settings and device names to the response of the runtime environments to attackers’ various actions (For example, window opening time, file availability, etc.). Moreover, the code execution time is always limited during emulation, while the real system can be compromised a few days after the malicious code was started.
Machine learning and neural networks marked the third fundamentally new milestone in the development of protection tools and systems. Without going into technical details, this approach has the following advantages:
- Adjustment to the object of protection (training)
- Ability to detect malicious activities “similar”, but not equivalent, to those observed in the past
- Ease of adding new examples of malicious activity for detection
Along with the advantages, one should immediately note a number of technical problems that held back the widespread use of neural networks for years. Those are:
- Long training time, which most often depends on the complexity of the model used and the volume of the training set
- Non-transparent decision-making processes—when such a system is triggered, it is most often impossible for the user to know why it is behaving in this way, and not the other way around
- Lack of operational readiness due to the technology immaturity.
However, the rapid growth of computing power, investment in development, and the million dollar machine learning community, researchers have almost completely resolved all the technical limitations and made it possible to widely use neural networks over the past 5–8 years. When talking about computing resources, it is especially worth noting the significant progress of Intel, NVIDIA, Google, and other corporations. They have not only made a qualitative technological breakthrough by increasing the productivity of computing devices, but they also have provided manufacturers with ready-made libraries and frameworks for applied programming using machine learning technologies.
The main advantage of the machine learning approach is the ability to automatically scale protection systems to the protected system environment. Here is a practical case on web application security: suppose you need to block attacks on a resource hosting technical content, like stackoverflow or habrahabr. For such a product (a protected system), it is normal to accept attack descriptions in plain text. In the case of signature analysis, you would have to manually disable the signature, which blocks the sending of articles about attacks with their examples to disable false positives. Machine learning, by contrast, enables the system to adapt to this behavior automatically without disabling signatures for a specific attack type. Note that earlier, for such situations, protection systems used an approach of “disinfecting” data—based on various encodings—which did not justify itself when new applications using APIs appeared.
In conclusion, it should be noted once again almost all basic approaches to organizing cyber defenses were proposed back in 1985. Since then, almost nothing has changed, but the technical aspects of their implementation. The use of machine learning, and in particular neural networks, has become recently widespread due to the development of the industry as a whole: from the performance of computing devices to libraries, frameworks, and the expert community. The emerging cloud technologies also played an important role in solving the practical application problems of machine learning. The new technologies made it possible to redistribute resources more efficiently providing high performance and scalability that could not be achieved within a limited on-premise environment. As a result, although security tools are still installed in a local system with limited resources, it becomes possible to perform high-performance processing of this system’s metrics in the vendor’s cloud and to train the security system to respond fast and accurately. This hybrid approach to security system architectures (i.e. the local installation of agents, firewalls, antiviruses, etc., which then allows for cloud services to analyze their metrics and develop adapted attack detection methods using machine learning) is currently the most popular among vendors.