SiftLog: Correlation

Abstract visualization of vertical blue data streams representing distributed system logs merging into a unified pipeline

SiftLog – First Log Analysis for Distributed Systems

Modern distributed systems generate logs at a scale that makes manual triage nearly impossible. When something breaks at 2AM, you’re not looking for a needle in a haystack – you’re looking for a needle in a warehouse full of haystacks, each one from a different service, each one timestamped by a different clock. Most log tools make this worse: they show you everything, paginated into oblivion, and leave the correlation work to you.

SiftLog is a different approach. It’s a Go command-line tool built around one idea: surface the signal, suppress the noise. Not a dashboard. Not a query interface. A pipeline that ingests logs from multiple distributed sources, merges them into a single globally-ordered stream, and tells you what broke, in what order, and why it probably spread.

The Problem with Existing Tools

Log aggregation tools like Splunk, Datadog, and Grafana Loki are genuinely powerful. But they’re designed for exploration – you bring a hypothesis and the tool helps you verify it. That’s useful when you have time. At 2AM with an SLA burning, you don’t have a hypothesis yet. You have a pager alert, a screen full of dashboards, and three services all logging errors at once.

The standard workflow looks like this: open Grafana, pull up the service that triggered the alert, scroll through errors, notice a trace ID, copy it, search for it in two other services, realize the timestamps don’t quite line up because the clocks are skewed, give up and start from the beginning. This takes 20 minutes on a good day.

SiftLog compresses that to a single command.

How It Works

At its core, SiftLog is a k-way merge pipeline. It ingests event streams from multiple sources simultaneously – files, Loki, CloudWatch, Elasticsearch – and merges them into a single time-ordered stream using a min-heap. Every event, regardless of origin, is sorted globally by timestamp before it reaches the signal detectors. That sounds simple, but it has a meaningful implication: correlation becomes a first-class operation rather than an afterthought.

Adapters

SiftLog ships with four adapters in v0.1.0:

File and stdin – reads JSON structured logs or plain text, handles eight timestamp formats including Unix epoch (seconds and milliseconds), RFC3339, and several common syslog variants. Feed it a log file, pipe it from another tool, or point it at stdin. It handles 1MB+ lines without flinching.

Loki – queries the query_range API with full pagination, bearer token auth, and a LogQL builder that constructs label selectors from your source config. Useful for pulling recent history from an existing Loki deployment without needing to open a browser.

CloudWatch – uses the FilterLogEvents paginator with the full IAM credential chain (environment, instance role, assumed role). Supports multiple log groups per source so you can pull from a cluster of services in one pass.

Elasticsearch – scroll API with ascending sort, supports both API key and basic auth. Designed for environments where ES is the long-term log store.

All four implement the same Adapter interface: a single Fetch(ctx, since, until) method that returns a channel of events. The merge engine doesn’t know or care which adapter is feeding it.

Clock Skew

Distributed systems have distributed clocks. Even with NTP, services running in different regions, different containers, or on different VM hosts can disagree on the current time by hundreds of milliseconds. When you merge logs from these sources into a single stream, a naive timestamp sort produces interleaving that doesn’t reflect real causality.

SiftLog handles this with a timestamp_offset_ms per source – a signed integer that shifts every timestamp from that source before it enters the merge heap. The analogy is audio delay compensation: if you’re mixing audio from two microphones with different cable lengths, you don’t fix the latency by moving the mics, you add a delay to the faster one. Same principle here. You measure the systematic offset for each source (by comparing known-simultaneous events, or from your infrastructure’s monitoring) and configure it once. The merge sees corrected timestamps from that point forward.

Signal Detection

Once the event stream is globally ordered, three detectors run in series:

Anomaly detection compares error rates across a rolling baseline window (default: 5 minutes). The window is split in half – the older half establishes a baseline rate, the recent half is measured against it. If recent errors are arriving at 10x the baseline rate (configurable), the current event is flagged as an anomaly signal. This approach specifically avoids comparing against a fixed threshold, which fails when a service has a variable error floor during normal operation.

Cascade detection identifies failure propagation across service boundaries. It uses trace ID correlation as the primary signal: if services A and B are both logging errors that share a trace ID within a configurable time window, B is likely failing because of A. When trace IDs aren’t available (not all stacks propagate them), it falls back to temporal correlation: errors from multiple services clustering within the same time window, ordered by first occurrence, treated as a probable cascade. The output shows the full chain – which service failed first, which followed, and how many errors each accumulated.

Silence detection catches the failure mode that error-rate monitors miss entirely: services that go quiet. A service that crashes or gets network-partitioned often stops logging rather than logging errors. SiftLog maintains a per-service baseline of logging activity. After a bootstrap period (to avoid false positives on startup), if a service’s event rate drops below a configurable threshold of its baseline, a silence signal is emitted. This catches the “payment service stopped responding but nobody noticed because there were no errors” class of incident.

Output Modes

The default output is a colored terminal display designed for human reading during an incident. Events stream in timestamp order with severity coloring, cascade chain annotations, and silence warnings surfaced inline. A --quiet flag suppresses all non-signal output – you see only anomalies, cascades, and silences, which is often all you need when you’re triaging fast. A --verbose flag goes the other direction, showing every event with full field data.

For programmatic use, --output json switches to newline-delimited JSON. Every event is emitted as a JSON object with a boolean is_signal field and a signal_type field when applicable. This makes it straightforward to pipe to jq for further filtering:

siftlog --output json app.log | jq 'select(.is_signal)'
siftlog --output json app.log | jq 'select(.signal_type == "cascade")'

The JSON format uses RFC3339Nano timestamps, omits empty fields, and is designed to be stable enough to pipe into alerting systems or store for post-incident review.

Design Decisions Worth Noting

A few choices in SiftLog’s design are intentional and worth understanding before you try to bend it in a different direction.

No plugin system. Adapters are compiled in. This means adding a new adapter requires a code change, not a config change. The tradeoff is deliberate: a plugin system adds surface area for runtime failures and makes the tool harder to distribute as a single binary. The adapter interface is clean enough that adding Datadog or Google Cloud Logging takes a few hours, not days, and the compiled result is a single static binary with no runtime dependencies.

Per-session baseline only. SiftLog doesn’t persist baseline data between runs. Each invocation starts fresh and bootstraps its own baseline from the event window. This makes the tool stateless and simple to run anywhere without setup. The cost is that very short queries (less than a few minutes of history) may have insufficient data to trigger anomaly detection. For the target use case – querying 15 minutes to an hour of recent history during an incident – this is rarely a problem.

Correlation-first, not query-first. SiftLog doesn’t have a query language. You don’t filter within SiftLog; you control the input window with --since and --until, and you filter the output with jq or shell tools. This is intentional. Building a query language inside the tool would duplicate work that already exists and works well in the source systems. SiftLog’s job is correlation and signal detection, not storage or retrieval.

What’s Coming

The tool was built to handle historical queries first, but the real use case is live incidents. Version 0.2.0 adds streaming mode: run SiftLog with --live and it tails your sources in real time, emitting correlated events as they arrive. The merge engine handles streaming naturally – it’s the same min-heap, with a flush timer added so a quiet source doesn’t stall output from busy ones. File sources gain tail -f semantics. For Loki and CloudWatch, the adapters simply don’t set an until bound, and the polling loop continues until you hit Ctrl-C.

After streaming, the roadmap includes Datadog and Google Cloud Logging adapters (v0.3.0) and a stable v1.0 once the tool has been battle-tested in production incident response.

Getting Started

SiftLog is open source under the MIT license. The repository is at github.com/mmediasoftwarelab/siftlog. With Go installed:

git clone https://github.com/mmediasoftwarelab/siftlog
cd siftlog
go build -o siftlog .
./siftlog testdata/sample.log

For a typical incident query against a Loki deployment, a siftlog.yaml config with your source credentials is all you need. The --quiet flag is the recommended starting point: run it once and see only what siftlog thinks is worth your attention. If the signal looks wrong, add --verbose and trace back through the full event stream.

The tool is opinionated by design. It makes specific claims about what’s signal and what isn’t, and it’s transparent about how it makes those claims. Whether that’s the right tool for a given incident depends on the incident – but for the class of problem it was built for, distributed cascade failures in systems that don’t propagate trace IDs consistently, it’s faster than anything else in the toolbox.

Leave a Reply

Your email address will not be published. Required fields are marked *

Tracking Scripts
Telemetry Services
Anonymous Statistics
Your Privacy

No Bloat. No Spyware. No Nonsense.

Modern software has become surveillance dressed as convenience. Every click tracked, every behavior analyzed, every action monetized. M Media software doesn't play that game.

Our apps don't phone home, don't collect telemetry, and don't require accounts for features that should work offline. No analytics dashboards measuring your "engagement." No A/B tests optimizing how long you stay trapped in the interface.

We build tools, not attention traps.

The code does what it says on the tin — nothing more, nothing less. No hidden services running in the background. No dependencies on third-party APIs that might disappear tomorrow. No frameworks that require 500MB of node_modules to display a button.

Your data stays on your device
No "anonymous" usage statistics
Minimal dependencies, fewer risks
Respects CPU, RAM, and battery
// real.developer.js
const approach = {
investors: false,
buzzwords: false,
actualUse: true,
problems: ['real', 'solved']
};
// Ship it.

Built by People Who Actually Use the Software

M Media software isn't venture-funded, trend-chasing, or built to look good in pitch decks. It's built by developers who run their own servers, ship their own products, and rely on these tools every day.

That means fewer abstractions, fewer dependencies, and fewer "coming soon" promises. Our software exists because we needed it to exist — to automate real work, solve real problems, and keep systems running without babysitting.

We build software the way it used to be built: practical, durable, and accountable. If a feature doesn't save time, reduce friction, or make something more reliable, it doesn't ship.

Every feature solves a problem we actually had
No investor timelines forcing half-baked releases
Updates add value, not just version numbers
Documentation written by people who got stuck first

This is software designed to stay installed — not be replaced next quarter.