The visitor analytics in Opterius are built on a simple idea: the data is already in your logs, we just have to read it. No tracking script, no client-side JavaScript, no third-party API calls. Here's exactly how it works under the hood.
What Nginx already logs
Every time a visitor hits your site, Nginx writes one line to /home/{user}/{domain}/logs/access.log. The default "combined" log format produces lines like:
203.0.113.42 - - [09/Apr/2026:14:23:11 +0000] "GET /blog/post-1 HTTP/2.0" 200 12345 "https://google.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X) Chrome/120.0"
That single line tells us:
| Field | Value | What we use it for |
|---|---|---|
| IP address | 203.0.113.42 |
Country lookup, unique visitor count |
| Timestamp | 09/Apr/2026 14:23:11 |
Time-series chart |
| Method | GET |
Filtering |
| URL | /blog/post-1 |
Top pages |
| Status code | 200 |
Status code distribution |
| Response size | 12345 |
Bandwidth calculation |
| Referrer | https://google.com/ |
Top referrers |
| User-Agent | Mozilla/...Chrome/120.0 |
Browser, OS, bot detection |
Multiply that by every visitor and you get a complete record of your site's traffic — already on your server, waiting to be parsed.
The agent worker
When the Opterius agent starts, it spawns a background worker that:
- Discovers all access logs by scanning
/home/*/*/logs/access.logand/home/*/*/logs/*/access.log(the second pattern catches subdomain logs). - Tracks an offset per log file, same pattern as the live log viewer. The worker remembers how many bytes it has already read so it never reprocesses the same line twice.
- Reads new log content every 60 seconds — only the bytes that were appended since the last read.
- Parses each line with a regex matching the Nginx combined format. Lines that don't match (corrupted or custom format) are silently skipped.
- Aggregates the results into hourly buckets in memory (more on this below).
- Flushes buckets to disk every 5 minutes as JSON files.
The worker handles log rotation automatically: if the log file shrinks (because logrotate just truncated it), the worker resets its offset to the start of the new file.
Bucket aggregation
The agent stores aggregated stats in per-hour buckets rather than individual visit records. Each bucket has:
{
"bucket_start": 1712664000,
"visits": 487,
"unique": 312,
"bandwidth": 12598234,
"bot_visits": 89,
"pages": {"/blog/post-1": 234, "/about": 89, ...},
"referrers": {"google.com": 145, "twitter.com": 67, ...},
"countries": {"US": 234, "DE": 89, "RO": 56, ...},
"browsers": {"Chrome": 312, "Safari": 145, ...},
"os": {"Windows": 234, "macOS": 145, ...},
"status": {"2xx": 421, "4xx": 23, "5xx": 2, ...}
}
This is roughly 5 KB per bucket per domain per hour. For a busy site with 100,000 visits per hour, the bucket size is about the same — because we're storing counts, not individual events. For 100 domains × 24 hours × 90 days that's roughly 1 GB of total storage, which is negligible.
The bucket files live at /var/lib/opterius/analytics/{domain}/{YYYY-MM-DD}.json. Each daily file contains all 24 hourly buckets for that day, keyed by hour.
Top-N aggregation
For top URLs, referrers, and other categorical fields, the agent only keeps the top 50 values per bucket. When the in-memory bucket exceeds 500 distinct values, it's trimmed back to 100. This bounds memory usage on high-traffic sites with thousands of unique URLs while still capturing the most popular ones.
Bot detection
The User-Agent string contains a clear bot signature for ~95% of bots. The agent matches against a regex of known patterns:
- Search engines:
googlebot,bingbot,yandex,baidu,duckduckbot - SEO tools:
ahrefsbot,semrushbot,mj12bot - Social:
facebookexternalhit,twitterbot,linkedinbot,whatsapp - Programmatic:
curl/,wget/,python-requests,go-http-client,java/ - Generic: anything containing
bot,crawler,spider,scraper
Bot visits are counted separately so they show up in the Bot Traffic card on the dashboard but don't pollute the top pages / countries / browsers stats. Sophisticated bots that fake a real browser User-Agent slip through this filter — that's expected.
Geolocation
The agent uses MaxMind GeoLite2 — a free downloadable database that maps IP ranges to countries. When the database is installed at /var/lib/opterius/GeoLite2-Country.mmdb, the agent loads it into memory at startup and does ~1 microsecond lookups per IP.
Without the database, the country field is left empty and visits are counted under "Unknown". The dashboard then shows a hint asking the admin to configure MaxMind in System Settings → Integrations.
Why hourly buckets and not real-time?
Two reasons:
- Storage efficiency: storing every visit individually would require 1000-10000× more disk space for nothing useful. You don't need to know that visit #4,532 came in at exactly 14:23:11.456 — you need to know that you had 487 visits in that hour.
- Query performance: aggregating 90 days of data when the dashboard loads is fast because we only have ~2160 buckets to read instead of millions of rows.
The "current" hour bucket is held in memory and merged with the on-disk buckets at query time, so the dashboard always shows up-to-the-minute data despite the 5-minute disk flush interval.
Pruning
A daily background job deletes JSON files older than 90 days. Disk usage stays bounded automatically.