Finding Anomalies in Your Server Traffic

Whether it's API abuse, scraping, DDoS attacks, or a threat actor probing your endpoints, detecting anomalies in server traffic can help contain these attacks and improve system's resilience. Building real-time detection pipelines however can be challenging because of variety of factors including infrastructure, cost, etc. Offline anomaly detection is an important tool that can not only be used in the absence of real-time detection but can also actually complement it. Root Cause Analysis, Forensics & Security Audit, Shadow Testing, Training & Tuning Real-Time Detectors, etc. are some of the use cases. In this article, I will present a simple offline anomaly detection pipeline that can be used to detect anomalies in server traffic.

Photo by Tobias Tullius on Unsplash

The example here uses Apache Tomcat access logs, but the principles can be applied to any text-based log format.

On a high level, the implementation simply does the following.

Parse log lines
Generate text embeddings using a lightweight language model
Cluster the embeddings using DBSCAN
Flag outliers as anomalies

You should be able to run this on a commodity server with minimal dependencies. I've hosted the code on this GitHub Gist.

Parse the Logs

def read_tomcat_logs(log_path):
    with open(log_path, 'r') as f:
        return [line.strip() for line in f if line.strip()]

Create Log Embeddings

from sentence_transformers import SentenceTransformer

def embed_logs(log_lines):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    return model.encode(log_lines, show_progress_bar=True)

We use a compact transformer model to convert each log line into a vector representation. This allows us to compare logs based on semantics, not just string matching. I'm using all-MiniLM-L6-v2, which is lightweight and fast, but you can choose any model that fits your needs.

Detect Anomalies with DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that can find clusters of varying shapes and sizes. In simple terms, it groups similar items together and identifies outliers as anomalies. The code below uses DBSCAN implemented in the sklearn Python module.

from sklearn.cluster import DBSCAN

def detect_anomalies(embeddings, log_lines, eps=0.5, min_samples=20):
    clustering = DBSCAN(eps=eps, min_samples=min_samples).fit(embeddings)
    labels = clustering.labels_
    return [log for log, label in zip(log_lines, labels) if label == -1]

labels with -1 indicate outliers, which we consider anomalies. You can play with 2 parameters to tune the sensitivity of anomaly detection.

eps: the maximum distance between two samples for one to be considered as in the neighborhood of the other. In simple terms, it defines how close points need to be to be considered part of the same cluster.
min_samples: the number of samples in a neighborhood for a point to be considered as a part of a cluster.

Combined Code

if __name__ == "__main__":
    log_path = "localhost_access_log.txt"
    logs = read_tomcat_logs(log_path)
    embeddings = embed_logs(logs)
    anomalies = detect_anomalies(embeddings, logs)

    for a in anomalies:
        print("[ANOMALY]", a)

Here's a screenshot of the output from my local test. I generated log lines with a simple script that simulates normal traffic and some anomalies. You can access the code in this GitHub Gist. The output shows the detected anomalies. The combined code to run detection can be found in this GitHub Gist.

It won't catch everything. But it's fast to implement, easy to run nightly/hourly, and surprisingly effective at surfacing weird traffic. For high-volume systems, this kind of batch detection can be the foundation for training smarter real-time detectors. For simple use cases, it can be a great way to get started with anomaly detection without the complexity of real-time systems.

In one of my previous articles, I discussed about setting up OpenSearch for log analysis that included a section in there for configuring anomaly detection. You can check it out here.

What measures do you take to detect anomalies in your server traffic? Do you have a real-time detection pipeline or do you rely on offline detection?