en-US
en-US

Metrics


In this article

Metrics Overview

Agent-to-target probing measures end-to-end network connectivity. Agents send packets to targets and measure the response. Hops along the way forward the packets to the final destination. This enables the viewer to quickly spot any changes in trends and find the affected parts of the network.

Packets are sent once, and each hop forwards the packet to other hops on the path to that target. The cumulative time it takes for each hop to forward the packet to the next hop determines the latency between the agent and the target.

The following article describes the types of metrics provided by agents.


Latency, Jitter, Loss metrics

Latency

  • Measured in milliseconds, latency is the round trip time (RTT) ICMP, HTTP, and UDP synthetic packet traffic takes between the agent to the target through the network
  • In the case of HTTP, the reported values are TCP latency and jitter reported by the OS for the probing TCP connection.
  • Latency greater than 150 ms results in poor quality real-time communications (audio and video calls), slow application response, and network degradation that is noticeable to most end users
  • Applications depending on lower latency response times may become unusable during periods of high latency
  • The distance between the agent and target directly affects latency

Jitter

  • Jitter is the absolute value of the difference in round trip latency between two consecutive packets.
  • While jitter affects all traffic, high jitter is most noticeable to end users during audio or video calls

Loss

  • Loss is the percentage of ICMP or UDP packets lost in the alignment period selected.
  • The agent determines packet loss differently for ICMP and UDP
Methods for determining packet loss
Timeout Out of Order Packets
ICMP
  • If the probing interval is between 1 and 5 seconds, the timeout for determining packet loss is equal to the probing interval (1 to 5 seconds)
  • If the probing interval is more than 5 seconds, the timeout for determining loss is 5 seconds
  • Out of order packets are always considered as packet loss
UDP
  • Reponse not received within 5 seconds is considered loss
  • N/A - Packet order not considered for packet loss

HTTP and speed test targets do not report packet loss. This changes how metrics are displayed in the Dashboard:

  • The loss plot is replaced by plots for HTTP metrics when only HTTP targets are selected
  • Selecting HTTP and ICMP or UDP targets displays both the loss and HTTP metrics plots
  • To ensure plot time alignment and to maintain line color consistency, all target labels appear on the left of each plot
  • No plot line is visible for the HTTP target labels in the loss plot
  • No plot line is visible for the ICMP target labels in the HTTP plots


HTTP Metrics

HTTP availability refers to consistent and reliable access to web resources through the Hypertext Transfer Protocol (HTTP). In the context of websites and web services, availability is a crucial aspect of providing a seamless user experience. Monitoring HTTP timing shows performance bottlenecks in client-to-server or server-to-server communications.

Service Experience Insights monitors the network quality from the agent to the host server but does not provide application monitoring.

HTTP Request Response Time

  • Hover over any point in the plot to view the response code and request response time for each step of the synthetic HTTP traffic between the agent and target
  • Metrics are shown in milliseconds
  • Metrics explained
    • DNS Lookup: Time spent performing the DNS lookup. DNS lookup resolves domain names to IP addresses. Every new domain requires a full round trip to do the DNS lookup. There is no DNS lookup when the destination is already an IP address.
    • TCP Connection: Time it took to establish a TCP connection between a source host and destination host. Connections must be properly established in a multi-step handshake process. TCP connection is managed by an operating system. If the underlying TCP connection cannot be established, the OS-wide TCP connection timeout will overrule the timeout config of our application.
    • TLS handshake: Time spent completing a TLS handshake. During the handshake process, endpoints exchange authentication and keys to establish or resume secure sessions. There is no TLS handshake without an HTTPS request.
    • Time to First Byte (TTFB): Time spent waiting for the initial response. This time captures the latency of a round trip to the server in addition to the time spent waiting for the server to process the request and deliver the response.
    • Content Transfer: Time spent receiving the response data. The size of the response data and the available network bandwidth determinates its duration.

HTTP Availability

  • HTTP probing does not return loss metrics, instead this type of probing return the percentage of packets reaching the the target
    • 100% HTTP availability shows that all synthetic HTTP traffic reached the target
    • <100% HTTP availability indicates the level of HTTP traffic degradation


Connectivity

  • The agents are considered online as long as the agent posts connectivity metrics to the controller every minute.
  • An agent unable to send metrics to the controller will show a connectivity issue. While running, the agent will continue sending probing traffic to reachable traffic. This data is stored in the agent’s memory for 1 hour and will be added to the time series database when the agent reconnects to the controller.
  • If the agent is unable to reach the controller for more than 1 hour, probing will continue, and the buffered metrics that are older than 1 hour will be discarded.

Red indicates when static and cloud agents were disconnected

Gray indicates when mobile agents were disconnected. Gray is used for mobile agents because they are expected to be offline when the host PC is not in use


Path Discovery

Path discovery is an optional setting in Probing Distributions

Once enabled, agents send synthetic path discovery traffic to probe all available paths or hops between the agent and the target

Click on a plot line to reveal the IP, ASN, ISP, and latency for each hop

Path Discovery Methodology

  • IP Header of every packet contains a field called TTL (Time to Live) that helps to prevent infinitely looping packets.
  • A router decrements the TTL value of an IP packet by 1 before forwarding it to the next hop
  • When the TTL value reaches 0, the packet is discarded and the router returns an ICMP TTL Exceeded message to the source IP of the packet
  • Using this, the discovers the routers from a host to a destination by sending packets with increasing TTL until the target host is reached and the source IP address of the incoming ICMP TTL Exceeded message is saved
  • The ICMP Error messages include the actual packet in the payload, so the sender is able to match the TTL Exceeded message to the actual packet that raised the error

Traceroute Concept

ICMP Traceroute

ICMP echo request message is sent to the target host with incrementing TTL

  • The response is typically matched to the request based on the fields ID and Sequence number in the ICMP header.
  • The execution stops when we reach the target hop (received ICMP Echo Reply message), or we reach the configured maximum TTL (typically 32 to 64).
  • Routers seem to only take Src, Dst IP, and Protocol to route ICMP, so it is not possible to discover the Equal-Cost Multi-Path (ECMP) using ICMP.
  • ECMP, which is a routing technique that allows traffic to be distributed across multiple paths of equal cost between the source and destination.
  • In traditional routing, traffic is sent over a single best path between two endpoints. But in ECMP, the network uses multiple equal-cost paths simultaneously to distribute the traffic. This increases the capacity and resiliency of the network and helps to avoid congestion.
  • ECMP can be used in many types of network topologies, including data center networks, enterprise networks, and service provider networks. It is supported by many routing protocols, including OSPF, IS-IS, and BGP.

UDP Traceroute

  • UDP Traceroute is implemented by sending UDP packets to the target host and port number 33434 (IANA Reserved for traceroute).
  • The tracert command is executed on the Windows platform to get path trace.
  • On Linux and Mac, to discover the ECMP, the implementation changes the source port number for each subsequent attempt. By default, 15 attempts are made to discover the EMCP, and the number of attempts is configurable in API.

TCP Traceroute

  • TCP Traceroute is implemented by sending TCP SYN packet to the target host and port number on which the known application is running (eg 80, 443 etc)
  • Target node responds back with a SYN ACK or a RST.
  • To discover ECMP paths, the implementation can change the source port for each subsequent attempt.
  • TCP Traceroute has a better chance of reaching the destination when the destination is a TCP application.

TCP Traceroute (Privledged)

  • Uses raw socket to handcraft TCP packets.
  • Try to make the subsequent packets with increasing TTL look like TCP retransmission to get through the firewall.
  • Use the same sequence number and window size for all TCP packets in one path trace attempt.
  • Only supported in Linux agent.

Configure firewalls to allow traceroute

  • Firewalls typically block traceroute probing traffic. Avoid this by opening destination and source firewalls.
  • Configure destination firewalls to allow:
    • Incoming ICMP echo request for ICMP traceroute.
    • UDP port 33434 (and 33435 to 33655) for UDP traceroute
  • Configure firewalls at the sources to allow:
    • Incoming ICMP control messages (such as TTL Exceeded, Destination Port Unreachable etc.)
  • Increasing number of SaaS application endpoints end up blocking these messages, thereby preventing the path trace from discovering the last hop.
  • It is impossible to guess the distance to the last hop from the last discovered hop.


Device and WiFi metrics

  • Mobile agents return device and WiFi metrics data for their PC host
  • WiFi metrics and device data are captured from the PC host systems data
  • SSID is accessed for improved location discovery


Speed Test

  • Speed tests are configurable in probing distributions and agents configured for speed tests offer on-demand speed testing from the agent page in the dashboard.
  • Speed tests are Manged Targets (agents) configured to receive speed test probing from agents. These speed test targets are typically installed in cloud environments nearest the service hosts (i.e. installed in a client’s AWS region where a critical service is also hosted).
  • Speed test probing interval starts at 1 hour.
  • Test package size is 100 MB.
  • On-demand speed test data is also added to the time series and appears in the plots (after about two-minute delay from the time the on-demand speed test concluded.
  • Confirm that agent’s outbound access to Speed Test targets.
    • Speed Test uses HTTP CONNECT requests to upgrade the HTTP connection to WebSocket, and some HTTP proxy servers reject HTTP CONNECT ports to non-SSL ports by default.
    • If the HTTP proxy server is configured on the host OS or SEI agent AND the Speed test server is not SSL enabled, Confirm that your HTTP Proxy server allows HTTP CONNECT requests to non-SSL ports.

See the Bandwidth Estimation to learn more about the bandwidth consumed by speed tests.

Example of on-demand Speed Tests

  • Note that an agent can be configured to send speed tests to multiple speed test targets


Data storage and retention

Agent data collection and aggregation

  • Agents collect test values after each probing event.
    • The most common example is an agent-to-server probing event where an agent sending ICPM probing packets a target and receives a response.
    • Each probing event creates test values for each observed metric.
    • ICMP agent-to-server probing event creates two test values. One for loss and one value for round trip time (RTT) for latency and jitter.
    • HTTP Response Time creates many more test values like DNS lookup time, SSL negotiation, response code, etc.
  • The probing interval is the frequency of probing events each minute.
Example ICMP Agent-to-Server test values per metric each minute
Metrics Probing Interval
60 seconds 1 second
Loss 1 1
RTT 1 1
Test Values 2 120

Test values for each minute are saved to the agent’s local memory and aggregated into metrics values.

Every 60 seconds, the test results for each metric are aggregated into five metric values:

  • Max value which is the highest observed test value during the minute.
  • Mean average test value data points for the minute.
  • Median average test value of all data points for the minute.
  • 95th percentile test value that separates the lowest 95% of the data from the highest 5% of data points for the minute.
  • 99th percentile test value that separates the lowest 99% of the data from the highest 1% of data points for the minute.
Example: Metric values for each minute of ICMP probing to a single target
Metrics Metric Values
Max Mean Median Median 95th Percentile 99th Percentile
Latency 1 1 1 1 1 1
Jitter 1 1 1 1 1 1
Loss 1 1 1 1 1 1
Metric Values Per Minute 3 3 3 3 3 3
  • Because all test values for each metric are aggregated into five metric values, shorter and more frequent probing intervals create more test values but do not result in more metric values.
  • While more frequent probing doesn’t create more metrics values, it does improve the metrics values because the metric value is aggregated from more test values that are collected from more frequent tests. Example:
    • 30-second probing intervals trigger two tests for each metric every minute; one at the start of the minute and one at 30 seconds.
    • The Max latency metric value is the higher test value from those two tests.
    • One-second probing intervals trigger 60 tests every minute, and the resulting Max latency metric value is 30 times more likely to capture a spike in latency.
  • Probing intervals are configurable by each metric type.
Probing intervals supported by metric type
Metric type Shortest/Most Frequent Longest/Least Frequent
ICMP 1 second 10 minutes
UPD 100ms 10 minutes
HTTP 30 seconds 10 minutes
Speed Test Defined hourly probing intervals 1, 6, 12, or 24 hours

Saving metrics values to the time series database

  • The agent attempts to connect with the controller every 60 seconds
  • If the connection is successful, the metric values from the previous 60 seconds are uploaded and stored in the time series database.
  • If the agent cannot connect to the controller, metric values continue to be created each minute and are saved in the agent’s local storage.
  • When the agent reconnects to the server, the last hour of metrics values are uploaded to the time series database.

Metrics values in the time series database

  • Metric values are stored in the time series database in alignment periods
  • For the first 14 days, metric values use a minute alignment period which means the metric values for each minute can be viewed or downloaded with the dashboard or queried by the API.
  • Over time, one-minute metric values are collapsed into longer periods of time called alignment periods ranging from three minutes to six hours.
  • As the length of time since the data was saved increases, the length of the alignment period becomes longer.
  • Alignment periods enable Service Experience insights to provide a performant time series database, rapid API data retrieval, and robust dashboard data visualizations in scalable and affordable network monitoring and troubleshooting SaaS solution.
Alignment periods of metrics values in the time series database
Age of data Alignment Period
1 minute to 14 days 1 minute
15 to 28 days 3 minutes
29 to 42 days 5 minutes
43 to 184 days 1 hour
185 to 366 days 3 hours
367 to 732 days 6 hours
733 to 1463 days 12 hours
≥ 1464 days 1 day

Date retention

  • Agent local storage data
    • Locally saved data is discarded after it is uploaded to the controller.
    • The agent’s maximum data storage buffer is one hour.
    • If the agent is unable to reach the controller for more than one hour, tests continue, and the buffered data older than one hour are overwritten with each new minute of data. Only the last hour of test data is stored locally.
    • When the agent reconnects to the controller, the last hour of test data is uploaded to the time series database and erased from local memory.
  • Time series data
    • The length of time data is stored in the time series database is defined by the service description provided by the service provider and is not configurable by the end user.
    • The most common data retention time is 90 days for most service offerings. In this case, metrics values are permanently deleted from the time series database and cannot be exported in bulk before deletion.
    • Data alignment periods are fixed and can not be configured to shorter periods.
Service Experience Insights - Updated: 2023-06-07 22:38-UTC