en-US

Metrics


Metrics Overview

Agent-to-target probing measures end-to-end network connectivity. Agents send packets to targets and measure the response. Hops along the way forward the packets to the final destination. This enables the viewer to quickly spot any changes in trends and find the affected parts of the network.

Packets are sent once, and each hop forwards the packet to other hops on the path to that target. The cumulative time it takes for each hop to forward the packet to the next hop determines the latency between the agent and the target.

The following article describes the types of metrics provided by agents.


Latency, Jitter, Loss metrics

Click to expand

Latency

  • Measured in milliseconds, latency is the round trip time (RTT) ICMP, HTTP, and UDP synthetic packet traffic takes between the agent to the target through the network
  • In the case of HTTP, the reported values are TCP latency and jitter reported by the OS for the probing TCP connection.
  • Latency greater than 150 ms results in poor quality real-time communications (audio and video calls), slow application response, and network degradation that is noticeable to most end users
  • Applications depending on lower latency response times may become unusable during periods of high latency
  • The distance between the agent and target directly affects latency

Jitter

  • Jitter is the absolute value of the difference in round trip latency between two consecutive packets.
  • While jitter affects all traffic, high jitter is most noticeable to end users during audio or video calls

Loss

  • Loss is the percentage of ICMP or UDP packets lost in the alignment period selected.
  • The agent determines packet loss differently for ICMP and UDP
Methods for determining packet loss
Timeout Out of Order Packets

ICMP

  • If the probing interval is between 1 and 5 seconds, the timeout for determining packet loss is equal to the probing interval (1 to 5 seconds)
  • If the probing interval is more than 5 seconds, the timeout for determining loss is 5 seconds

Out of order packets are always considered as packet loss

UDP

Response not received within 5 seconds is considered loss

N/A - Packet order not considered for packet loss

HTTP and speed test targets do not report packet loss. This changes how metrics are displayed in the Dashboard:

  • The loss plot is replaced by plots for HTTP metrics when only HTTP targets are selected
  • Selecting HTTP and ICMP or UDP targets displays both the loss and HTTP metrics plots
  • To ensure plot time alignment and to maintain line color consistency, all target labels appear on the left of each plot
  • No plot line is visible for the HTTP target labels in the loss plot
  • No plot line is visible for the ICMP target labels in the HTTP plots


HTTP Metrics

HTTP availability refers to consistent and reliable access to web resources through the Hypertext Transfer Protocol (HTTP). In the context of websites and web services, availability is a crucial aspect of providing a seamless user experience. Monitoring HTTP timing shows performance bottlenecks in client-to-server or server-to-server communications.

Service Experience Insights monitors the network quality from the agent to the host server but does not provide application monitoring.

Click to expand

HTTP Request Response Time

  • HTTP Request Response Time plotlines show each target’s total HTTP response time with tooltips that display the time for each step in the response.
    • Hover over any point in the plot to view the response code and request response time for each step of the synthetic HTTP traffic between the agent and target.
    • The tooltip shows the max value of metrics in milliseconds. The tooltip notes that the total HTTP Request Response time displayed will not match the plotline value unless “Max” is selected in the Aggregation Dropdown.
  • On-demand troubleshooting streams HTTP Request Response Time data in real time and uses different names for some steps.
Metrics Explained
Tool Tip Item On-Demand Troubleshooting Item Description
HTTP Request Response Time Total response time Total time for the complete HTTP request.
Connection Setup
DNS Lookup DNS Lookup Time spent performing the DNS lookup. DNS lookup resolves domain names to IP addresses. Every new domain requires a complete round trip to do the DNS lookup. There is no DNS lookup when the destination is already an IP address.
Initial Connection TCP Connect Time to establish a TCP connection between a source host and destination host. Connections must be properly established in a multi-step handshake process. TCP connection is managed by an operating system. If the underlying TCP connection cannot be established, the OS-wide TCP connection timeout will overrule the timeout config of our application.
SSL TLS Handshake Time to complete a TLS handshake. During the handshake process, endpoints exchange authentication and keys to establish or resume secure sessions. There is no TLS handshake with a non-HTTPS request.
Request/Response
Request Send Request send Time to send the first request
Waiting For Response TTFB Time to first byte after the request is sent.
Content Download Content download Time spent receiving the response data. The size of the response data and the available network bandwidth determine its duration.
Response Code Response Code Response code returned in the response header.


HTTP Availability

  • HTTP probing does not return loss metrics; instead, this type of probing returns the percentage of packets reaching the target.
    • 100% HTTP availability shows that all synthetic HTTP traffic reached the target.
    • <100% HTTP availability indicates the level of HTTP traffic degradation.


Connectivity

  • The agents are considered online as long as the agent posts connectivity metrics to the controller every minute.
  • An agent unable to send metrics to the controller will show a connectivity issue. While running, the agent will continue sending probing traffic to reachable traffic. This data is stored in the agent’s memory for 1 hour and will be added to the time series database when the agent reconnects to the controller.
  • If the agent is unable to reach the controller for more than 1 hour, probing will continue, and the buffered metrics that are older than 1 hour will be discarded.

Click to expand

Red indicates when static and cloud agents were disconnected

Click to expand

Gray indicates when mobile agents were disconnected. Gray is used for mobile agents because they are expected to be offline when the host PC is not in use

Click to expand


Path Discovery

Agents perform path discovery when the agent is added to a probing distribution with path discovery enabled.

Path probing, also referred to as path probing, actions are used to create path discovery visualizations of the hops between the agent and the target. The path discovery visualization shows the IP and latency of each hop and the carrier ASN in use at the time.

Automatic path probing is performed during each path discovery interval and when a change in target latency or IP is detected. Manual path discovery can be triggered with on-demand troubleshooting. Learn more about how agents perform path discovery path probing in the Agents Overview article.

For agents in a path discovery probing distribution, clicking a target plotline opens the path visualization for the alignment period and time range selected.

Click to expand

Path Discovery Methodology

  • IP Header of every packet contains a field called TTL (Time to Live) that helps to prevent infinitely looping packets.
  • A router decrements the TTL value of an IP packet by 1 before forwarding it to the next hop.
  • When the TTL value reaches 0, the packet is discarded, and the router returns an ICMP TTL Exceeded message to the source IP of the packet.
  • Using this, the discovers the routers from a host to a destination by sending packets with increasing TTL until the target host is reached and the source IP address of the incoming ICMP TTL Exceeded message is saved.
  • The ICMP Error messages include the actual packet in the payload, so the sender is able to match the TTL Exceeded message to the actual packet that raised the error.

Path Probing Concept

Click to expand

ICMP Path Probing

ICMP echo request message is sent to the target host with incrementing TTL

  • The response is typically matched to the request based on the fields ID and Sequence number in the ICMP header.
  • The execution stops when we reach the target hop (received ICMP Echo Reply message), or we reach the configured maximum TTL (typically 32 to 64).
  • Routers seem to only take Src, Dst IP, and Protocol to route ICMP, so it is not possible to discover the Equal-Cost Multi-Path (ECMP) using ICMP.
  • ECMP, which is a routing technique that allows traffic to be distributed across multiple paths of equal cost between the source and destination.
  • In traditional routing, traffic is sent over a single best path between two endpoints. But in ECMP, the network uses multiple equal-cost paths simultaneously to distribute the traffic. This increases the capacity and resiliency of the network and helps to avoid congestion.
  • ECMP can be used in many types of network topologies, including data center networks, enterprise networks, and service provider networks. It is supported by many routing protocols, including OSPF, IS-IS, and BGP.

Click to expand

UDP Path Probing

  • UDP path probing is implemented by sending UDP packets to the target host and port number 33434 (IANA Reserved for traceroute).
  • The tracert command is executed on the Windows platform to get path trace.
  • On Linux and Mac, to discover the ECMP, the implementation changes the source port number for each subsequent attempt. By default, 15 attempts are made to discover the EMCP, and the number of attempts is configurable in API.

Click to expand

TCP Path Probing

  • TCP path probing is implemented by sending TCP SYN packet to the target host and port number on which the known application is running (eg 80, 443 etc)
  • Target node responds back with a SYN ACK or a RST.
  • To discover ECMP paths, the implementation can change the source port for each subsequent attempt.
  • TCP path probing has a better chance of reaching the destination when the destination is a TCP application.

Click to expand

TCP Path Probing (Privileged)

  • Uses raw socket to handcraft TCP packets.
  • Try to make the subsequent packets with increasing TTL look like TCP retransmission to get through the firewall.
  • Use the same sequence number and window size for all TCP packets in one path trace attempt.
  • Only supported in Linux agent.

Configure firewalls to allow path probing

  • Firewalls typically block path probing traffic. Avoid this by opening destination and source firewalls.
  • Configure destination firewalls to allow:
    • Incoming ICMP echo request for ICMP path probing.
    • UDP port 33434 (and 33435 to 33655) for UDP path probing
  • Configure firewalls at the sources to allow:
    • Incoming ICMP control messages (such as TTL Exceeded, Destination Port Unreachable etc.)
  • Increasing number of SaaS application endpoints end up blocking these messages, thereby preventing the path trace from discovering the last hop.
  • It is impossible to guess the distance to the last hop from the last discovered hop.


Speed Test

  • Speed tests are configurable in probing distributions and agents configured for speed tests offer on-demand speed testing from the agent page in the dashboard.
  • Speed tests are Managed Targets (agents) configured to receive speed test probing from agents. These speed test targets are typically installed in cloud environments nearest the service hosts (i.e. installed in a client’s AWS region where a critical service is also hosted).
  • Speed test probing interval starts at 1 hour.
  • Test package size is 100 MB.
  • On-demand speed test data is also added to the time series and appears in the plots (after about two-minute delay from the time the on-demand speed test concluded.
  • Confirm that agent’s outbound access to Speed Test targets.
    • Speed Test uses HTTP CONNECT requests to upgrade the HTTP connection to WebSocket, and some HTTP proxy servers reject HTTP CONNECT ports to non-SSL ports by default.
    • If the HTTP proxy server is configured on the host OS or SEI agent AND the Speed test server is not SSL enabled, confirm that your HTTP Proxy server allows HTTP CONNECT requests to non-SSL ports.

See the Data Usage Estimation to learn more about the bandwidth consumed by speed tests.

Click to expand

Example of on-demand Speed Tests

  • Note that an agent can be configured to send speed tests to multiple speed test targets
Click to expand


Device Metrics

  • Agents running on the host OS report device metrics.
    • Static Agents for SPEKTRA Edge and EdgeLQ OS
    • Mobile Agents for Windows and MacOS
  • Agents running in Docker containers and VMs are unable to capture CPU and Memory usage. Interface metrics are available if host network mode is used.
    • Static Agents for Docker
    • Cloud Agents
  • Available Metrics
    • CPU and Memory usage
    • Interface Loss Transmit (TX)
    • Interface Loss Receive (RX)
    • Interface Errors Transmit (TX)
    • Interface Errors Receive (RX)
Click to expand
Interface loss and errors
Metric Description

Interface Loss TX

The number of outbound packets which were discarded even though no errors had been detected to prevent their being transmitted. One possible reason for discarding such a packet could be to free up buffer space.

Interface Loss RX

The number of inbound packets which were discarded even though no errors had been detected to prevent their being deliverable to a higher-layer protocol. One possible reason for discarding such a packet could be to free up buffer space.

Interface Errors TX

The number of incoming packets that were discarded because of errors. Examples of possible scenarios that cause this could be a duplex mismatch, CRC mismatch etc.

Interface Errors RX

The number of outgoing packets that were discarded because of errors. One possible scenario that causes this could be a duplex mismatch.


WiFi Signal Strength

  • WiFi Signal Strength is only reported for agents connected to the controller through a WiFi access point.
  • WiFi SSID is captured to improve location discovery.
Click to expand


Data storage and retention

Agent data collection and aggregation

  • Agents collect probing data after each probing event. Sending an ICMP packet to a target is an example of a probing event.
  • Each probing event creates probing data points for each metric.
    • ICMP agent-to-server probing event creates two data points: one for loss and one value for round trip time (RTT), which is used to determine latency and jitter.
    • HTTP Response Time creates many more probing data points like DNS lookup time, SSL negotiation, response code, etc.
  • The probing interval is the frequency of probing events each minute. Shorter probing intervals result in more probing data.
Example: ICMP Agent-to-Server probing data points per metric each minute
Data points per metric Probing Interval
60 seconds 1 second
Loss 1 60
RTT 1 60
Test Values 2 120
  • These data points are saved to the agent’s local memory.
  • Every 60 seconds, the data points are uploaded to the controller and aggregated into five metric values:
    • Mean average test value data points for the minute.
    • Median average test value of all data points for the minute.
    • Max value which is the highest observed test value during the minute.
    • 95th percentile test value that separates the lowest 95% of the data from the highest 5% of data points for the minute.
    • 99th percentile test value that separates the lowest 99% of the data from the highest 1% of data points for the minute.
Example: Metric values for each minute of ICMP probing to a single target
Metrics Metric Values
Mean Median Max 95th Percentile 99th Percentile
Latency 1 1 1 1 1
Jitter 1 1 1 1 1
Loss 1 1 1 1 1
Metric Values Per Minute 3 3 3 3 3
  • Because probing data points for each sixty seconds of probing are aggregated into five metric values, shorter and more frequent probing does not result in more metric values.
  • While more frequent probing doesn’t create more metric values, they may be improved because the metric value is the product of more frequently collected probing data. Example:
    • 30-second probing intervals trigger two tests for each metric every minute: one at the start of the minute and one after 30 seconds.
    • The Max latency metric value is the higher of the two data points.
    • One-second probing intervals trigger 60 data points every minute, and the resulting Max latency metric value is more likely to capture a spike in latency.
Probing intervals are configurable by each metric type
Metric type Shortest/Most Frequent Longest/Least Frequent
ICMP 1 second 10 minutes
UDP 100ms 10 minutes
HTTP 30 seconds 10 minutes
Speed Test Defined hourly probing intervals 1, 6, 12, or 24 hours

Metrics values in the time series database

  • Metric values are stored in the time series database in alignment periods.
  • For the first 14 days, metric values use a minute alignment period, which means the metric values for each minute can be viewed or downloaded with the dashboard or queried by the API.
  • Over time, one-minute metric values are collapsed into longer periods of time called alignment periods.
  • As the length of time since the data was saved increases, the length of the alignment period becomes longer.
  • Alignment periods enable Service Experience insights to provide a performant time series database, rapid API data retrieval, and robust dashboard data visualizations in scalable and affordable network monitoring and troubleshooting SaaS solutions.
Alignment periods of metrics values in the time series database
Age of data Alignment Period
1 minute to 14 days 1 minute
15 to 28 days 3 minutes
29 to 42 days 5 minutes
43 to 184 days 1 hour
185 to 366 days 3 hours
367 to 732 days 6 hours
733 to 1463 days 12 hours
≥ 1464 days 1 day

Date retention

  • Agent local storage data
    • Locally saved data is discarded after it is uploaded to the controller.
    • The agent’s maximum data storage buffer is one hour.
    • If the agent is unable to reach the controller for more than one hour, tests continue, and the buffered data older than one hour are overwritten with each new minute of data. Only the last hour of probing data is stored locally.
    • When the agent reconnects to the controller, the last hour of probing data is uploaded to the time series database and erased from local memory.
  • Time series data
    • The length of time data is stored in the time series database is defined by the service description provided by the service provider and is not configurable by the end user.
    • The most common data retention time is 90 days for most service offerings. In this case, metrics values are permanently deleted from the time series database and cannot be exported in bulk before deletion.
    • Data alignment periods are fixed and cannot be configured to shorter periods.


In This Article