|
VVR replication bindwidth keeps low on 1Gbps network with about 20ms latency, even worse than 100Mbps network
----- VVR stat check
vxprint -Pl|egrep -i '^rlink|^protocol'
vxrlink -g datadg stats -i5 rlk_drserver201_datadg_rvg
vrstat -g datadg
vxtune vol_rp_increment 16
https://iconnect-symwise.symantec.com/infocenter/index?page=content&id=TECH182489&locale=en_US
VVR replication bindwidth keeps low on 1Gbps network with about 20ms latency, even worse than 100Mbps network
When set vol_rp_increment/vol_rp_decrement to 8:
1. On a 1 Gbps n/w with 20 ms one-way latency, the VVR performance as following :
Bandwidth Utilization 16.20 Mbps.
Bandwidth Utilization 21.60 Mbps.
Bandwidth Utilization 19.80 Mbps.
Bandwidth Utilization 20.60 Mbps. << Unreasonably low BW utilization
Bandwidth Utilization 17.60 Mbps.
Bandwidth Utilization 15.40 Mbps.
Bandwidth Utilization 19.00 Mbps.
Network Statistics:
Messages Errors Flow Control
-------- ------ ------------
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
105 53760 47 23 0 0 105 172000 1 67
Bandwidth Utilization 21.00 Mbps.
Thu Jan 19 14:37:18 2012
Replicated Data Set rvg0:
Data Status:
10.200.18.185: DCM contains 8855808 Kbytes.
Network Statistics:
Messages Errors Flow Control
-------- ------ ------------
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
109 55808 52 14 0 0 109 173000 1 93
Bandwidth Utilization 21.80 Mbps.
2. The VVR performance on 100 Mbps - 20 ms latency n/w as seen below:
Network Statistics:
Messages Errors Flow Control
-------- ------ ------------
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
270 138240 60 42 0 0 0 731000 1 76
Bandwidth Utilization 54.00 Mbps.
Thu Jan 19 16:13:19 2012
Replicated Data Set rvg0:
Data Status:
10.200.18.185: DCM contains 9991936 Kbytes.
Network Statistics:
Messages Errors Flow Control
-------- ------ ------------
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
308 157696 77 48 0 0 0 803000 1 94
Bandwidth Utilization 53.20 Mbps.
Bandwidth Utilization 54.00 Mbps.
Bandwidth Utilization 61.60 Mbps. << Better than 1 Gbps n/w
Bandwidth Utilization 65.40 Mbps.
Bandwidth Utilization 42.00 Mbps.
To be noted here that in case of UDP connections vol_rp_increment/vol_rp_decrement values are used to change how VVR's flow control gets affected by the the network feedback. The higher values (32/32) would mean that it would take 32 retries before VVR reduces its window size. The defaults for this is 8. With this default setting, NW bytes is keeping unreasonably low (for 1Gbps n/w).
By setting vol_rp_increment/vol_rp_decrement to 32, 'NW bytes' can be seen to increase and performance improvements can be observed .
> vxtune vol_rp_increment 32
> vxtune vol_rp_decrement 32
Following are bandwidth utilization results for 1 Gbps
Bandwidth Utilization 159.40 Mbps.
Bandwidth Utilization 174.20 Mbps.
Bandwidth Utilization 163.00 Mbps.
Bandwidth Utilization 170.60 Mbps.
Bandwidth Utilization 151.20 Mbps. << Better Performance
Bandwidth Utilization 121.80 Mbps.
Bandwidth Utilization 165.40 Mbps.
Bandwidth Utilization 187.60 Mbps.
Bandwidth Utilization 195.20 Mbps.
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
981 502272 57 195 0 0 0 2409000 1 63
Bandwidth Utilization 196.20 Mbps.
Thu Jan 19 19:01:14 2012
Replicated Data Set rvg0:
Data Status:
10.200.18.185: DCM contains 4463360 Kbytes.
Network Statistics:
Messages Errors Flow Control
-------- ------ ------------
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
1000 512000 56 203 0 0 0 2407000 1 72
Bandwidth Utilization 200.00 Mbps.
Thu Jan 19 19:01:24 2012
Replicated Data Set rvg0:
Data Status:
10.200.18.185: DCM contains 4232960 Kbytes.
Network Statistics:
Messages Errors Flow Control
-------- ------ ------------
# Blocks RT(msec) Timeout Stream Memory Delays NW Bytes NW Delay Timeout
SLESa0428
819 419328 46 207 0 0 0 2389000 1 58
Bandwidth Utilization 163.80 Mbps.
When use UDP as the replication protocol, VVR uses its own network flowcontrol. VVR increases or decreases the rate at which data is sent depending on the number of timeouts or memory errors it gets per second. If the number of errors is greater, VVR decreases the sending rate to avoid network congestion. If there are only a few (or no) errors, VVR continues to increase the sending rate by a fixed amount every second. For a lossy network, a large number of errors may occur, which prevents VVR from increasing the sending rate. However, these errors are not due to network congestion so VVR should continue to increase the sending rate.
To specify the error tolerance VVR uses by setting two tunables:
vol_rp_increment and vol_rp_decrement. VVR increases its sending rate if timeouts or memory errors per second are not more than the vol_rp_increment value. VVR decreases its sending rate if timeouts or memory errors per second are more than vol_rp_decrement. The default value of both tunables is 8. In the case of a lossy network, the sending rate does not increase because the number of errors per second could be more than vol_rp_increment or vol_rp_decrement, and the sending rate can decrease further. This impacts replication performance. If RLINK statistics show a higher number of errors and VVR is not using available bandwidth, you may be able to improve replication performance by tuning vol_rp_incrementand vol_rp_decrementto higher value like 16 or 32.
Since don’t have any memory errors, vol_rp_increment/vol_rp_decrement setting would be dictated by Timeout errors.
Specifically,
1. Vol_rp_increment should be larger than the Timeout errors seen per second. This would allow VVR to increase the sending rate and improve bandwidth . However, there is no point in increasing the sending rate if timeouts are because of n/w congestion, and therefore, an increase in sending rate would only worsen the situation. So this suggestion (to keep vol_rp_increment larger than timeout errors per second) should be applied only if the timeouts occur primarily because of lossy network.
2. Similar reasoning can be applied to infer that vol_rp_decrement should be kept above the Timeout errors per second. However, again, if the customer’s n/w is not lossy, instead it is congested, then increasing the tunable value may backfire.
In short, High values for these tunables may worsen performance, if the n/w is too congested. Otherwise, if the n/w can handle the load, then these tunables can be set to a higher value.
Fine-tuning of vol_rp_increment/decrement variables to higher value
Such as:
vxtune vol_rp_increment 32
vxtune vol_rp_decrement 32
Applies To
VRTSvxvm 5.0MP4RP1HF3_VIS_SLES9