|
|
https://www.veritas.com/support/en_US/article.HOWTO45767
https://www.veritas.com/support/en_US/article.000085514
This document attempts to explain the chain of events performed in relation to Veritas DMP and the related Solaris SCSI drivers (sd/ssd).
The I/O failover process has many layers in Solaris. Each layer has time out values and/or retry counts which have to be exceeded for the layer above to be notified of a problem.
Solaris SCSI Disk Driver (sd or ssd)
The I/O retry values for sd and ssd are different. They are determined by the following sets of parameters:
sd:
SD_RETRY_COUNT default value 5
SD_IO_TIME default value 60
Please note SD_RETRY_COUNT is not for Solaris 10.
ssd:
SSD_RETRY_COUNT default value 3
SSD_IO_TIME default value 60
The initial request will be followed by the SCSI disk driver number of retries.
By default, in order to complete an I/O, the sd driver will perform the initial request, followed by a further 5 retries for sd devices and 3 retries for ssd devices for the I/O operation, each with a default timeout value of 60 seconds for all retries.
In the event that the I/O request does not complete due to a problem at the LUN/disk layer, this leads to a total delay of 360 seconds, broken down this is 60 seconds for the initial SCSI device timeout, followed by an additional 300 seconds, the number of tries combined (5*60) .
The above is also true for the ssd driver in regards to the I/O operation. The only major difference is the number of retries which is lower, with a value of 3 retries.
Therefore, it would take just 240 seconds (60 seconds for the initial SCSI device timeout, followed by an additional 180 seconds) before a failure is signaled to the VxDMP layer.
Recommendations:
In order to reduce the potential time delay for an I/O operation failure, reduce the applicable “sd_io_time/ssd_io_time” attributes from 60 to 30 seconds related to the disk driver in question.
It is important that the value defined in HEX, and is correctly populated in the /etc/system file to make the attribute setting persistent across a system restart:
60 seconds would be:
set sd:sd_io_time=0x3C
set ssd:ssd_io_time=0x3C
30 seconds would be:
set sd:sd_io_time=0x1E
set ssd:ssd_io_time=0x1E
Host Bus Adaptor (HBA)
The number of retries and the wait between such retries is Vendor dependent.
Vendor specifications may contain particulars for their device. The maximum delay before an application is notified of an I/O failure will be as follows:
HBA delay *
[sd/ssd]_RETRY_COUNT *
[sd/ssd]_IO_TIME *
dmp_retry_count *
number of DMP paths *
number of pending I/O's
NOTE: The timeout values are set and passed from the target drivers (sd/ssd) to the HBA layer. HBA layer implements the timeout mechanism for the IO subsystem.
The newer HBA drivers leave the IO retry mechanism to the target drivers (sd/ssd).
The overhead is due to the HBA driver retry which is very minimal.
With the implementation of flag B_FAILFAST, the time taken to fail a submirror has been drastically reduced.
Loss of Fabric (Leadville stack)
In the event that the HBA driver reports a true path failure or loss of fabric up the I/O stack to the SCSI disk driver, the Leadville stack will potentiallly wait a minimum SCSI service overall I/O timeout of 110 seconds.
The 110 second timeframe is a combination of two default timeouts for the leadville drivers, fp and fcp:
SSFCP_OFFLINE_DELAY (default 20)
FP_OFFLINE_TIMEOUT (default 90)
After the 110 seconds, the targets would “disappear" and the luns will be offlined and a clean failure is reported.
In the event that a path failure or loss of fabric is not detected/reported up the I/O stack to the SCSI disk driver, the Leadville stack will entertain an initial delay of 110 seconds, plus the sd or ssd total delay timeout.
In the case that no offline/fabric problems are reported, the 110 second is not triggered, as the fp and fcp protocols are unaware there is any problem, so they won't introduce any delay (after all, they can still submit I/O successfully, so they think all is well ), so the operation awaits the outcome from the sd/ssd delays (only).
FP_OFFLINE_TIMEOUT can be tuned by adding 'fp_offline_ticker' entry to /kernel/drv/fp.conf.
For example
fp_offline_ticker=50;
SSFCP_OFFLINE_DELAY can be tuned by adding 'fcp_offline_delay' entry to /kernel/drv/fcp.conf.
For example
fcp_offline_delay=10;
Examples
The Fibre Channel protocol notifies server of any major fabric changes by issuing a registered state change notification (RSCN). This allows nodes to immediately gain knowledge about the fabric and react accordingly.
1) If the fabric detects/reports link offline messages, the RSCN suggest that the target has gone away:
110 second delay -> followed by offline of lun -> return failed IO status ( no waiting for 'timeouts' )
2) No fabric offline related messages, and no RSCN notifications suggests that the device has gone away, and the storage is not responding, although it can accept commands
No 110 second delay, just the ssd timeout/retry mechanism described.
Therefore, reducing the timeout values from the default 60 seconds to just 30 seconds, could dramatically improve the overall timeout times.
DMP tunable “dmp_scsi_timeout”
NOTE: The DMP tunable “dmp_scsi_timeout” is set to 30 seconds by default with Veritas Volume Manager (VxVM) 5.0 MP3. The value determines the timeout value for any SCSI command that is sent via DMP when using the SCSI bypass logic “dmp_fast_recovery”.
This DMP timeout value is used when DMP sends a SCSI inquiry probe to validate the health of a suspect path.
The “dmp_fast_recovery” activity is only performed once the SCSI disk drivers have timed out, whereby notifying DMP.
If the HBA does not receive a response for a SCSI command that it has sent to the device within the timeout period (in this case 30 seconds for DMP), the SCSI command is returned with a failure error code.
How DMP parameters respond to a HBA IO error (Solaris)
In the first stage, we need the lower layers, i.e. the sd/ssd driver to update Veritas DMP, before DMP can attempt to respond to the response returned by the SCSI layer.
DMP will always use the scsi driver interface for regular I/O. If the scsi bypass functionality is enabled, then it only uses it in the case of I/O error analysis for a faster response, and not for regular I/Os.
When the DMP “dmp_fast_recovery” tunable is “on”, DMP will attempt to obtain SCSI error information directly from the HBA interface.
DMP uses the HBA interface, if supported, to obtain SCSI error information. This can potentially provide faster error recovery where DMP bypasses the SCSI disk driver (sd/ssd).
The default setting is on.
DMP ENCLOSURES
With DMP it is possible to define specific recovery options.
To limit the number of times that DMP attempts to retry sending an I/O request down a specific path for a given enclosure, it is possible to customize the “retrycount” for that enclosure.
# vxdmpadm setattr enclosure <enclosure-name> recoveryoption=fixedretry retrycount=3
Pre 5.1 SP1 Design:
The enclosure based “retrycount” local to that enclosure, specifies the number of retries to be attempted before DMP reschedules the I/O request on another available path, or fails the request altogether.
This value overrides the default DMP tunable “dmp_retry_count”, which is a global attribute for all remaining enclosures.
# vxdmpadm gettune dmp_retry_count
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_retry_count 5 5
Post 5.1 SP1 Design:
The enclosure based “retrycount” local to that enclosure, specifies the number of retries to be attempted by DMP for rescheduling the I/O request against alternate available path (not a specific path), or fails the request altogether.
In some cases, it may make sense to reduce the global “dmp_retry_count” to “3”, as the ssd disk driver maybe controlling all the attached storage.
In the event, that a mix of sd and ssd drivers are controlling a specific set of enclosures, the “retrycount” per enclosure could be defined.
DMP TIMEBOUND “IOTIMEOUT”
The DMP timebount iotimeout value should always be greater than the I/O service time of the underlying operating system layers.
As an alternative to the enclosure based “retrycount”, it is possible to specify the overall amount of time DMP allows for handling an I/O request. If an I/O request does not succeed with the defined time, DMP fails the I/O request regardless of the number of retry attempts.
To specify an iotimeout for a given enclosure, type:
# vxdmpadm setattr enclosure <enclosure-name> recoveryoption=timebound iotimeout=160
The default value for the “iotimeout” is 300 seconds (5 minutes). For some applications such as Oracle, it may be desirable to set the “iotimeout” to a larger value.
By defining the “iotimeout” with a value of 160 seconds, this allows the initial SCSI device timeout of 30 seconds, followed by “3” SCSI retry attempts each of 30 seconds (90 seconds), plus a HBA reset to be performed.
30 + 90 + 10 + (4 seconds) = 134 seconds
NOTE: DMP will have to wait for the lower layers to respond each time before retrying I/O down an alternate (enabled) path.
The first I/O attempt is completed after 134 seconds, as DMP still has 26 seconds left, the I/O is retried down an alternate path, this also fails although after the DMP timeout of 180 seconds.
DMP cannot fail the I/O until after 268 seconds due to the time required by the lower layers.
KEY POINTS:
DMP will fail the I/O only when the returned I/O reaches DMP.
Once the SCSI layer failure updates DMP, only then will DMP attempt to re-issue the failed I/O against an alternate available path, as long as time still permits before the defined iotimeout is reached or when the number of fixedretries has not been exhausted.
DMP cannot react to a hung I/O in the lower layers unless a response is given back to DMP.
|
|
