This content originally appeared on DEV Community and was authored by CESAR NIKOLAS CAMAC MELENDEZ
Modern software systems are more distributed, dynamic, and complex than ever. Microservices, serverless functions, containers, and event-driven architectures make traditional monitoring insufficient.
To understand what is actually happening inside your application, you need observability.
This article explains the foundational pillars of observability, best practices, common tools, and includes a hands-on real-world example using Grafana + Prometheus for metrics and ELK Stack for logs.
๐ฆ What Is Observability?
Observability is the ability to understand the internal state of a system based solely on the data it produces, such as metrics, logs, and traces.
Unlike classical monitoring, which answers โIs the system up?โ, observability answers:
- Why is the system slow?
- Where is the latency coming from?
- Which dependency failed?
- What changed recently that caused errors?
๐ The Three Pillars of Observability
1. Metrics
Numeric measurements collected over time.
Examples:
- CPU usage
- Request throughput
- Error rate
- Database latency
Tools: Prometheus, Grafana, Datadog Metrics, AWS CloudWatch Metrics.
2. Logs
Immutable records about events happening in the system.
Examples:
- Info logs
- Warnings
- Errors
- Audit logs
Tools: ELK Stack, Datadog Logs, Azure Log Analytics, Loki.
3. Traces
A trace follows a single request across distributed components.
Examples:
- Microservices call chain
- Latency across services
- Errors from downstream dependencies
Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray, New Relic.
๐งฑ Core Observability Practices
โ๏ธ 1. Use Structured Logging
Instead of plain text logs, use JSON.
{
"timestamp": "2025-11-23T10:15:20Z",
"level": "ERROR",
"service": "orders-api",
"message": "Payment gateway timeout",
"orderId": 10422,
"durationMs": 3200
}
Structured logs allow better filtering and search on platforms like ELK, Datadog, and CloudWatch.
โ๏ธ 2. Define Standard Business Metrics
Examples:
orders_created_totalfailed_logins_totalpayment_latency_seconds
Business KPIs help identify anomalies beyond pure infrastructure.
โ๏ธ 3. Trace End-to-End Requests
Use OpenTelemetry to instrument services.
Traces give visibility across microservices.
โ๏ธ 4. Set SLOs and Alert Policies
For example:
- SLO: 99% of HTTP requests < 150ms
- Alert: More than 5% errors in 5 minutes
Good alerts are meaningful, not noisy.
โ๏ธ 5. Centralize All Telemetry Data
Use one platform for metrics + logs + traces.
Examples:
- Datadog
- New Relic
- Grafana Cloud
- AWS CloudWatch
๐ ๏ธ Real-World Example: Observability With Prometheus + Grafana + ELK Stack
Below is a complete example using a simple Node.js API instrumented with Prometheus metrics and ELK logging.
๐ Example Application (Node.js)
We will expose:
-
/metricsendpoint for Prometheus scraping - Structured logs sent to Logstash (ELK stack)
- A simple API endpoint
๐ง Step 1 โ Install Dependencies
npm install express prom-client winston winston-elasticsearch
๐งช Step 2 โ Node.js Code With Observability
// app.js
const express = require("express");
const client = require("prom-client");
const winston = require("winston");
const { ElasticsearchTransport } = require("winston-elasticsearch");
const app = express();
const register = new client.Registry();
// ----- PROMETHEUS METRICS -----
const httpRequestCounter = new client.Counter({
name: "http_requests_total",
help: "Total HTTP requests",
labelNames: ["method", "route", "status"]
});
register.registerMetric(httpRequestCounter);
const httpRequestDuration = new client.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request latency",
buckets: [0.1, 0.3, 0.5, 1, 2]
});
register.registerMetric(httpRequestDuration);
client.collectDefaultMetrics({ register });
// ----- ELK LOGGING -----
const esTransportOpts = {
level: "info",
clientOpts: { node: "http://localhost:9200" }
};
const logger = winston.createLogger({
transports: [new ElasticsearchTransport(esTransportOpts)]
});
// ----- NORMAL API ROUTE -----
app.get("/api/orders", async (req, res) => {
const end = httpRequestDuration.startTimer();
logger.info({
message: "Fetching orders",
service: "orders-api",
environment: "production",
timestamp: new Date().toISOString()
});
res.json({ orderId: 123, status: "OK" });
httpRequestCounter.inc({
method: "GET",
route: "/api/orders",
status: 200
});
end();
});
// ----- METRICS ENDPOINT FOR PROMETHEUS -----
app.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.send(await register.metrics());
});
// ----- START SERVER -----
app.listen(3000, () => console.log("API running on port 3000"));
๐ Step 3 โ Prometheus Scrape Configuration
Add to prometheus.yml:
scrape_configs:
- job_name: "orders-api"
static_configs:
- targets: ["localhost:3000"]
Prometheus now collects:
- request count
- request latency
- default node metrics
๐ Step 4 โ Grafana Dashboard
Grafana automatically detects metrics from Prometheus and you can plot:
- Requests per second
- Error rate
- Latency percentiles (P95, P99)
Create a panel and use this query:
rate(http_requests_total[5m])
Another useful panel for latency:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
๐ Step 5 โ ELK Stack for Logs
Logs go to Elasticsearch, and you can search them with Kibana:
Example search query:
service: "orders-api" AND level: "info"
Example rendered log:
{
"message": "Fetching orders",
"service": "orders-api",
"environment": "production",
"timestamp": "2025-11-23T10:20:34.123Z"
}
๐ Final Thoughts
Modern systems demand more than simple health checksโ
they require full observability to identify, diagnose, and prevent problems in production.
By applying:
- structured logs
- custom metrics
- distributed tracing
- centralized dashboards
- SLO-driven alerting
you significantly increase reliability, and most importantly, you gain the ability to understand your system deeply.
This content originally appeared on DEV Community and was authored by CESAR NIKOLAS CAMAC MELENDEZ
CESAR NIKOLAS CAMAC MELENDEZ | Sciencx (2025-11-24T02:34:56+00:00) ๐ Observability Practices: A Practical Guide With Real-World Examples. Retrieved from https://www.scien.cx/2025/11/24/%f0%9f%94%8d-observability-practices-a-practical-guide-with-real-world-examples/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.