What Is a Microburst?A microburst occurs when a large amount of burst data is received in milliseconds. Typical microbursts last for 1 to 100 milliseconds, so that the instantaneous burst data rate is tens or hundreds times the average data rate or even exceeds the port bandwidth. Show
The NMS or network performance monitoring software calculates the real-time network bandwidth at an interval of seconds to minutes. In such an interval, the network traffic seems to be stable, as shown in Figure 1-1, and no network exception occurs. However, one second is a long period of time for a port that sends and receives data packets at a high speed. At lower levels of granularity (for example, milliseconds), more traffic bursts can be observed, and the traffic rate exhibits a sawtooth pattern, as shown in Figure 1-2. If the sawtooth is sharp, microbursts occur. Figure 1-1 Traffic statistics at a higher level of
granularity Figure 1-2 Traffic statistics at a lower level of granularity Causes of a MicroburstMicrobursts occur on a network due to the following causes:
Impact and Generation Process of a MicroburstWhen a microburst exceeds the forwarding capability of a switch, the switch buffers the burst data for transmission later. If the switch does not have sufficient buffer space, the excess data is discarded, causing congestion and packet loss. The following figure shows a typical millisecond-level microburst scenario. Assume that Port1 and Port2 respectively send 5 MB data to Port3 at a line rate of 10 Gbps. The total transmission rate is 20 Gbps. Port3 supports only a rate of 10 Gbps, which is a half of the total transmission rate. It sends only 5 MB data out and buffers the other 5 MB data for transmission later. However, the switch has only 1 MB buffer space. Therefore, 4 MB data is discarded due to insufficient buffer space. Without considering overhead data such as the inter-frame gap, preamble, frame checksum, and packet header, the microburst duration is 4 ms (5 MB/10 Gbps). On an actual network, the number of ingress ports is greater than that of egress ports. Therefore, more buffer space is consumed and more packets are discarded due to congestion when microbursts occur. How to Evaluate the Anti-Burst Capability of a Switch?RFC 4445 defines the delay factor (DF), which is a key indicator for measuring the transmission quality of data flows. DF indicates the delay and jitter of service traffic. A greater DF value indicates a higher jitter of the service traffic. When a microburst occurs on the network, the DF can be measured in milliseconds. For switches, the DF can be converted into the required buffer. The conversion formula is as follows: Where: T: buffer required by a switch DF: delay and jitter of service traffic W: bandwidth of the egress port U: bandwidth usage of the egress port A: sum of physical bandwidths of server ports where bursts occur B: rate of the egress port C: sum of rates of server ports where no burst occurs Based on the preceding formula, a larger buffer size of a switch indicates a lower buffer requirement and stronger anti-burst capability of the switch. Why Cannot a Microburst Be Monitored Through an NMS or the Observing Port of a Device?Customers are accustomed to using an NMS or observing port to monitor port traffic or the maximum rate of outgoing packets. However, the NMS or observing port cannot monitor microbursts. Monitoring Port Traffic Through the NMSThe NMS monitors port traffic around the clock and determines the traffic trend on the network based on traffic curves of devices. The traffic curves seem to be smooth on the NMS although microbursts (packet loss) have occurred. The NMS obtains device data in SNMP get mode, which has the following disadvantages:
Why cannot devices support millisecond-level traffic statistics? This is because switches have a large number of ports. When a switch collects packet statistics on ports, the CPU traverses all ports, lowering the performance of obtaining port packet statistics. As a result, the CPU cannot traverse all ports to obtain traffic information during microbursts. In addition, millisecond-level polling on all ports greatly consumes CPU resources, which may affect normal services of the switch. Therefore, the switch cannot capture the microburst information. Monitoring the Maximum Rate of Outgoing Packets on the Observing PortWhen microbursts (packet loss) occur on the network, the maximum rate of outgoing packets on the observing port of a switch is far lower than the port rate, and the port usage is even lower than 10%. The forwarding chip of the switch records only the total number of received packets. The rate of any packet on the switch can only be calculated by dividing the total number of packets in a period of time by the measurement period. Therefore, the measurement accuracy depends on the length of the measurement period. By default, Huawei CloudEngine series switches calculate the average values of the peak and instantaneous traffic rates based on a period of 300s. The measurement period used for calculating the peak rate is configurable, and the minimum value is 10s. That is, the statistics accuracy is 10s. How to Detect a Microburst?A microburst occurs when any of the following conditions is met:
Currently, you can detect microbursts by using Telemetry, a third-party packet capture and analysis tool, or the discarded packet capture function.
The following describes how to use these methods to monitor microbursts in detail: Using Telemetry to Monitor MicroburstsTelemetry is a device feature in a narrow sense while is a closed automatic O&M system in a board sense. It is divided into the device side and OSS side, and consists of network devices, collector, analyzer, and controller, which can be provided by Huawei or third parties. In Huawei Telemetry system, the network devices refer to CloudEngine switches, the collector and analyzer both refers to iMaster NCE-FabricInsight, and the controller refers to iMaster NCE-Fabric. Telemetry can be used to detect microbursts on a network. Figure 1-7 shows the Telemetry system architecture and data processing flow. Figure 1-7 Telemetry system architecture and data processing flow The following details operations and data processing flow of Huawei Telemetry system.
Monitoring Microbursts Using Packet Capture and Traffic Analysis ToolsYou can use packet capture and traffic analysis tools (such as Wireshark) to monitor microbursts on a network. The procedure is as follows:
Monitoring Microbursts Using the Discarded Packets Capture FunctionYou can configure the discarded packet capture function to detect microbursts. The procedure is as follows:
How to Mitigate Microbursts on a Data Center Network?A data center network runs a variety of services, such as the management, storage, big data, computing, and video services, as shown in Figure 1-12. Figure 1-12
Typical data center networking and traffic paths
Traffic Analysis
Preventive MeasuresThe most fundamental preventive measure is to reduce microbursts by optimizing server traffic. On the network side, you can take the following methods to reduce the possibility of microbursts and mitigate the impact of them. The following is an example of configurations on Huawei CloudEngine switches:
How Do I Determine Whether a Microburst Occurs on an Interface Where Packets Are Discarded and the Bandwidth Usage Is Low?QuestionHow do I determine whether a microburst occurs on an interface where packets are discarded and the bandwidth usage is low? Products and Versions InvolvedProducts involved: CE6851HI, CE6855HI, CE6856HI, CE6860EI, and CE6865EI Versions involved: V200R005C10 and later versions Answer
Common Misunderstandings About a MicroburstMisunderstanding 1. Why is no alarm reported for microbursts upon buffer exhaustion? This is because the CPU uses the polling mechanism to query the buffer usage. The CPU will be overloaded if it frequently polls the buffer usage. As a result, the switch responds slowly or even does not respond. Misunderstanding 2. Currently, the rate and utilization of an interface are low. Therefore, microbursts rarely occur on this interface. This is a wrong understanding. There is no linear relationship between the average rate and burst rate. A low rate or utilization of an interface does not mean a low rate of burst traffic. Misunderstanding 3. Switches record the number of packets discarded due to congestion. Therefore, switches cause microbursts or packet loss. This is also a wrong understanding. Burst traffic is generated by service terminals. Except for a small number of protocol packets, switches do not generate other traffic. However, a traffic burst may be aggravated on switches. For example, if multiple interfaces send data to a single port concurrently, the oversubscription ratio is improper, exacerbating the burst peak. Therefore, the source of traffic bursts needs to be located based on the networking. Misunderstanding 4: Service traffic is random on servers. Therefore, servers do not encounter heavy traffic bursts. The NIC of a server sends service packets at the maximum rate supported by the physical interface. For example, a 10GE interface sends service packets at the rate of 10 Gbps, and waits for subsequent packets from the application layer after it finishes sending the preceding packets. Therefore, the physical link transmits packets based on the link rate. That is, the physical link works for a period of time and is idle for a period of time. The overall average bandwidth usage may be low, for example, 20% to 30%. However, the actual bandwidth usage is either 100% or 0%. Such scenario is considered as a traffic burst. When the server sends packets, the bandwidth usage is 100%.
|