How It Works (Log Parsing)

The technical detail of how Opterius parses your Nginx access logs into the analytics dashboard.

The visitor analytics in Opterius are built on a simple idea: the data is already in your logs, we just have to read it. No tracking script, no client-side JavaScript, no third-party API calls. Here's exactly how it works under the hood.

What Nginx already logs

Every time a visitor hits your site, Nginx writes one line to /home/{user}/{domain}/logs/access.log. The default "combined" log format produces lines like:

203.0.113.42 - - [09/Apr/2026:14:23:11 +0000] "GET /blog/post-1 HTTP/2.0" 200 12345 "https://google.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X) Chrome/120.0"

That single line tells us:

Field	Value	What we use it for
IP address	`203.0.113.42`	Country lookup, unique visitor count
Timestamp	`09/Apr/2026 14:23:11`	Time-series chart
Method	`GET`	Filtering
URL	`/blog/post-1`	Top pages
Status code	`200`	Status code distribution
Response size	`12345`	Bandwidth calculation
Referrer	`https://google.com/`	Top referrers
User-Agent	`Mozilla/...Chrome/120.0`	Browser, OS, bot detection

Multiply that by every visitor and you get a complete record of your site's traffic — already on your server, waiting to be parsed.

The agent worker

When the Opterius agent starts, it spawns a background worker that:

Discovers all access logs by scanning /home/*/*/logs/access.log and /home/*/*/logs/*/access.log (the second pattern catches subdomain logs).
Tracks an offset per log file, same pattern as the live log viewer. The worker remembers how many bytes it has already read so it never reprocesses the same line twice.
Reads new log content every 60 seconds — only the bytes that were appended since the last read.
Parses each line with a regex matching the Nginx combined format. Lines that don't match (corrupted or custom format) are silently skipped.
Aggregates the results into hourly buckets in memory (more on this below).
Flushes buckets to disk every 5 minutes as JSON files.

The worker handles log rotation automatically: if the log file shrinks (because logrotate just truncated it), the worker resets its offset to the start of the new file.

Bucket aggregation

The agent stores aggregated stats in per-hour buckets rather than individual visit records. Each bucket has:

{
  "bucket_start": 1712664000,
  "visits": 487,
  "unique": 312,
  "bandwidth": 12598234,
  "bot_visits": 89,
  "pages": {"/blog/post-1": 234, "/about": 89, ...},
  "referrers": {"google.com": 145, "twitter.com": 67, ...},
  "countries": {"US": 234, "DE": 89, "RO": 56, ...},
  "browsers": {"Chrome": 312, "Safari": 145, ...},
  "os": {"Windows": 234, "macOS": 145, ...},
  "status": {"2xx": 421, "4xx": 23, "5xx": 2, ...}
}

This is roughly 5 KB per bucket per domain per hour. For a busy site with 100,000 visits per hour, the bucket size is about the same — because we're storing counts, not individual events. For 100 domains × 24 hours × 90 days that's roughly 1 GB of total storage, which is negligible.

The bucket files live at /var/lib/opterius/analytics/{domain}/{YYYY-MM-DD}.json. Each daily file contains all 24 hourly buckets for that day, keyed by hour.

Top-N aggregation

For top URLs, referrers, and other categorical fields, the agent only keeps the top 50 values per bucket. When the in-memory bucket exceeds 500 distinct values, it's trimmed back to 100. This bounds memory usage on high-traffic sites with thousands of unique URLs while still capturing the most popular ones.

Bot detection

The User-Agent string contains a clear bot signature for ~95% of bots. The agent matches against a regex of known patterns:

Search engines: googlebot, bingbot, yandex, baidu, duckduckbot
SEO tools: ahrefsbot, semrushbot, mj12bot
Social: facebookexternalhit, twitterbot, linkedinbot, whatsapp
Programmatic: curl/, wget/, python-requests, go-http-client, java/
Generic: anything containing bot, crawler, spider, scraper

Bot visits are counted separately so they show up in the Bot Traffic card on the dashboard but don't pollute the top pages / countries / browsers stats. Sophisticated bots that fake a real browser User-Agent slip through this filter — that's expected.

Geolocation

The agent uses MaxMind GeoLite2 — a free downloadable database that maps IP ranges to countries. When the database is installed at /var/lib/opterius/GeoLite2-Country.mmdb, the agent loads it into memory at startup and does ~1 microsecond lookups per IP.

Without the database, the country field is left empty and visits are counted under "Unknown". The dashboard then shows a hint asking the admin to configure MaxMind in System Settings → Integrations.

Why hourly buckets and not real-time?

Two reasons:

Storage efficiency: storing every visit individually would require 1000-10000× more disk space for nothing useful. You don't need to know that visit #4,532 came in at exactly 14:23:11.456 — you need to know that you had 487 visits in that hour.
Query performance: aggregating 90 days of data when the dashboard loads is fast because we only have ~2160 buckets to read instead of millions of rows.

The "current" hour bucket is held in memory and merged with the on-disk buckets at query time, so the dashboard always shows up-to-the-minute data despite the 5-minute disk flush interval.

Pruning

A daily background job deletes JSON files older than 90 days. Disk usage stays bounded automatically.

What Nginx already logs#

The agent worker#

Bucket aggregation#

Top-N aggregation#

Bot detection#

Geolocation#

Why hourly buckets and not real-time?#

Pruning#