Network monitoring with Elastic: Unifying network observability

Introduction: The Network Monitoring Fragmentation Problem

In five years working with Enterprise accounts at Elastic, I have heard the same challenge again and again:

"We have several network monitoring tools, and we would love to correlate all of them into one platform."

For many organizations, the barrier to true correlation isn't a lack of data, but where that data lives. Frequently, we see SNMP metrics, flow data, and logs isolated in purpose-built silos or dashboards. Without a unified data store and a proper correlation engine, piecing together the full narrative — from a topology change to a performance degradation — becomes a manual, time-consuming puzzle.

When an incident happens, engineers become human correlation engines — manually jumping between systems, copying timestamps, cross-referencing device names, and trying to piece together what actually happened. A simple question like "Did this interface failure impact application performance?" requires querying multiple tools and mentally correlating the results.

The real cost isn't the tool licenses — it's the time lost during critical incidents.

This lab is my answer to a fundamental question: Can Elastic become the unified foundation that actually correlates network data?

More importantly, it demonstrates that Elastic is fully ready for network operations — capable of ingesting diverse telemetry and using AI to correlate relationships, identify root causes, and resolve issues in seconds instead of hours.

The Problem: Network Observability is Broken

Let me paint a typical scenario I encounter with enterprise network teams:

The Fragmented Reality:

No single source of truth
Manual correlation during incidents (15-30 minutes per event)
Fragmented teams (network vs. platform engineers)
Limited automation capabilities
No AI-powered analysis

When a link goes down at 2 AM:

Notice the alert - 2 minutes
Log into monitoring tool to see the metric - 3 minutes
Switch to traffic analyzer to check impact - 5 minutes
Open log management to search for related messages - 10 minutes
Manually correlate timestamps across systems - 8 minutes
Create a ticket and copy context from multiple tools - 8 minutes

Time to initial diagnosis: 36 minutes

This workflow is expensive, error-prone, and doesn't scale.

The Vision: Elastic as a Unified Network Observability Platform

What if you could:

Collect SNMP metrics, NetFlow, traps, and topology data in one platform
Correlate network events with application performance automatically
Generate executive dashboards without separate BI tools
Use AI to analyze incidents in seconds, not hours
Trigger alerting from network events

This is what this lab aims to demonstrate.

What I Built: A Production-Grade Network Simulation

To demonstrate how Elastic unifies network data, I needed a realistic environment that generates real-world telemetry. Enter Containerlab — a Docker-based solution that enables us to create a network simulation framework.

Lab Architecture

I simulated a Service Provider core network with:

7 FRR routers forming an OSPF Area 0 mesh
2 Ubuntu hosts for additional use cases
2 Layer 2 switches for access layer segmentation
3 telemetry collectors feeding Elastic Cloud

Total containers: 14

Deployment time: 12-15 minutes (fully automated)

Full deployment instructions and topology details are available in the GitHub repository README.

The Three Telemetry Pipelines: Proving Multi-Source Correlation

What makes this lab production-ready is its hybrid observability approach — proving that Elastic can unify disparate network data sources.

Pipeline	Data Type	Collection Method	Collector	Use Case
SNMP Metrics	Interface stats, system health, LLDP topology	Active polling	OTEL Collector	Capacity planning, trend analysis
NetFlow	Traffic flows	Push-based export	Elastic Agent	Top talkers, security investigation
SNMP Traps	Interface up/down events	Event-driven	Logstash	Real-time incident detection

This unified architecture proves Elastic can replace multiple specialized network monitoring tools with a single platform.

The Power of Correlation: One Platform, One Query

When a network incident occurs, you need to answer questions like:

Which interface failed? (SNMP metrics)
What traffic was affected? (NetFlow)
What was the sequence of events? (SNMP traps)
Which devices are downstream? (LLDP topology)

The Problem: modern tools offer separate modules glued together, forcing users to navigate different spaces for different sets of data.

The Reality: You still have to pivot. You see a spike in the Metrics module, but to see why, you have to open the Logs module and manually align the time picker. The data lives in different tables or backends, making true correlation impossible without human intervention.

The Elastic Difference: One Store, One Language, One AI

Elastic makes it simple. Whether it's an SNMP counter (metric), a NetFlow record (flow), or a Syslog message (log), it is all stored in a unified datastore powered by the Elasticsearch engine. This allows users to easily search across multiple datasets in a single query.

FROM logs-*
| WHERE host.name == "csr23" AND interface.name == "eth1"

Time required: 3 seconds

Furthermore, as you will see later, the exact location of the data becomes agnostic to the user when leveraging the AI Assistant.

Data Transformation: From Cryptic OIDs to Actionable Intelligence

Raw SNMP traps are notoriously difficult to interpret at a glance. In our current lab setup, the data arrives looking like this:

OID: 1.3.6.1.6.3.1.1.5.3
ifIndex: 2
ifDescr: eth1

While traditional Network Management Platforms (NMPs) handle OID translation natively, bringing that clarity into Elastic requires a specific configuration.

In this initial lab, we are intentionally working with this raw data to demonstrate how AI assistants can interpret these events even without pre-existing context.

However, the strategy for the next phase of this project is to implement Elasticsearch Ingest Pipelines. This will allow us to map raw OIDs to human-readable names. This step is crucial for bridging the gap between Network tools and Application Observability platforms, allowing network events to be instantly correlated with application errors and infrastructure logs.

The Target State

Once the pipeline is implemented in the next lab, we will transform that raw trap into searchable, meaningful data:

{
  "event.action": "interface-down",
  "host.name": "csr23",
  "interface.name": "eth1",
  "interface.oper_status_text": "Link Down"
}

The result:

Human-readable fields
Searchable dimensions for filtering
Context for automation rules and dashboards
Correlation keys for joining with metrics and flows

In our next blog post, we will walk through building the ingest pipeline that performs this transformation — step by step.

Intelligent Alerting: From Noise to Actionable Intelligence

Traditional network monitoring relies on simple threshold alerts — "interface down," "high CPU." These alerts flood your inbox but provide zero context about root cause, impact, or remediation.

The Lab's Approach: ES|QL + AI Assistant

1. Semantic Detection with ES|QL

Instead of generic threshold alerts, the lab uses ES|QL to detect specific event patterns:

FROM logs-snmp.trap-prod
| WHERE snmp.trap_oid == "1.3.6.1.6.3.1.1.5.3"
| KEEP @timestamp, host.name, interface.name, message

2. Automatic AI-Powered Investigation

When the alert triggers, it invokes the Observability AI Assistant with a structured investigation prompt that:

Performs immediate triage (which device, which interface, when)
Assesses OSPF impact and traffic rerouting
Correlates with other recent failures
Generates severity assessment and recommended actions

The Transformation

Traditional Alerting	Intelligent Alerting (Elastic)
Email: "Interface down on csr23"	Structured analysis with device context
Manual investigation: 20-30 min	AI-automated investigation: 90 seconds
Engineer correlates across tools	Automatic cross-source correlation
No business impact assessment	Severity + recommended actions included

Accelerating Incident Response with the Elastic AI Assistant

This is where the Elastic AI Assistant demonstrates its operational value — moving beyond passive data collection to actively interpret and explain network events in real-time

When an engineer views a trap document in Discover and asks:

"Explain this log message"

The AI Assistant provides comprehensive analysis including:

What happened: Plain-language explanation of the SNMP trap
Device context: Router role, interface purpose, network position
Impact analysis: OSPF neighbor status, traffic rerouting assessment
Root cause possibilities: Physical layer, link layer, administrative causes
Recommended actions: Immediate steps, investigation queries, validation checks
Severity assessment: Business and technical impact rating

Manual Triage vs. AI-Assisted Investigation

Before	After (Elastic AI)
Google the OID → 5 min	Click "Explain this log" → 20 seconds
Open network diagram → 3 min	Topology context auto-provided
Query multiple tools → 10 min	Cross-source correlation instant
Assess business impact → 5 min	Impact analysis auto-generated
Total: ~28 minutes	Total: ~20 seconds

The Value Proposition: One Platform, One Data Model, One AI

What This Lab Demonstrates

Elastic provides:

One unified platform for metrics, logs, flows
One data model (SemConv) for consistent correlation
One search interface (Kibana) for all network data
One AI assistant that understands all your network telemetry
AI-powered alerting with automated investigation

Business Impact

Efficiency Gains:

85% reduction in MTTR (36 min → 5 min for initial diagnosis)
90% reduction in manual correlation time
Junior engineers gain access to AI-powered expert analysis

Operational Benefits:

Network engineers focus on strategy, not tool-switching
Cross-functional collaboration in one platform
Reduced tool sprawl and management overhead

Lessons Learned

After building this lab, several key insights emerged regarding how network data fits into the broader observability ecosystem:

1. Extending Observability to the Network

Elastic is already the gold standard for high-volume logs and application traces. This lab demonstrates that the same engine seamlessly handles network telemetry without needing a separate, siloed tool.

Scale: The same architecture that ingests petabytes of application logs easily handles millions of interface counters.
Structure: Native support for complex nested documents allows for rich SNMP trap data (variable bindings) without flattening or losing context.
Speed: Real-time search applies equally to network events, enabling sub-second troubleshooting.

2. OpenTelemetry Semantic Conventions (SemConv) as the Universal Translator

The power isn't just in storing the data, but in standardizing it. By mapping SNMP and NetFlow to the OpenTelemetry Semantic Conventions (SemConv), network data finally speaks the same language as the rest of the stack.

Unified Search: Query across firewall logs, server metrics, and switch telemetry in a single search bar.
Instant Visualization: Pre-built dashboards work immediately because the field names are standardized.
Cross-Domain Correlation: Easily correlates a spike in application latency with a specific interface saturation event.

3. AI Assistants Thrive on Context

While the AI in this lab was powerful on its own, the experiment highlighted a critical realization: an AI Assistant becomes exponentially more effective when coupled with a specific Knowledge Base.

Context is King: The AI delivers better root cause analysis when provided with rich metadata, such as device roles and topology maps. Without it, the advice remains generic.

Pro Tip (and What’s Next):

To get organization-specific advice rather than generic suggestions, you need to feed the AI your documentation.

The Goal: Create a Knowledge Base containing device roles, network topology diagrams, and troubleshooting procedures.
The Next Step: In my next blog post, I will demonstrate exactly how to do this — connecting a Knowledge Base to the AI Assistant to enable fully context-aware troubleshooting.

Conclusion: Completing the Observability Picture

Elastic is already widely recognized as the standard for Application and Security observability. The goal of this lab wasn't to ask if Elastic can handle networking, but to demonstrate the immense value of bringing network data into that existing ecosystem.

The verdict is clear: Elastic acts as that unified foundation. It effectively breaks down the silo between Network Engineering and the rest of IT.

This isn't just about consolidating dashboards or replacing legacy tools. It is about establishing the Elasticsearch AI Platform as the single source of truth where network telemetry sits right alongside application and infrastructure data.

By treating network data as a first-class citizen in the observability stack, we unlock automated correlation, AI-assisted investigation, and the speed required to resolve incidents before they impact the business. The capabilities are in place, and the foundation is solid — Elastic is ready to unify your network with the rest of your digital business.

Ready to Try It Yourself?

Check out github.com/DeBaker1974/Containerlab-OSPF

The repository includes:

Complete deployment scripts (12-15 minute automated setup)
Pre-configured telemetry pipelines
Kibana dashboards
Alert rules with AI Assistant integration
Detailed README

Not ready to build? Try Elastic Serverless: Start a free 14-day trial and explore AI-powered observability with your own data.

Special thanks to the Containerlab and FRRouting communities for their incredible open-source tools, and to Sheriff Lawal (CCIE, CISSP), Sr. Manager, Solutions Architecture at Elastic, for mentoring on this project.