Description of the problem
The #1600 Development Area Aggregation OSN3500 device and the #1602 Cultural Road Aggregation device form a two-fiber bi-directional multiplexed segment ring
One night at 20:44:28, #1600 Development Zone Aggregation-1 (Downtown Aggregation 10) 8 board N1SLQ16 reported COMMUN_FAIL (Serial Communication Failure Alarm).
20:46:09, 8-board bit-N1SLQ16 reported BD_STATUS (single board not in position alarm).
20:46:10, MS_APS_INDI_EX, APS_INDI multiplexing segment protection inversion alarm was reported by the network element. Thereafter, the protection reversal is restored and the service is normal.
At 20:52:23, a multiplexing segment protection inversion occurred again on the ring network.
At 20:52:29, EGS4 single board reported TU-AIS alarm and service interruption. During this period, the GSCC at board position 17 reported HARD_BAD (single board hardware failure alarm), and the alarm parameter pointed to N1SLQ16 at board position 8.
21:16:57, after restarting the reuse segment protocol, the protection inversion returns to normal, and the service gradually recovers.
1. #1600 Development Zone Aggregation-1 (Downtown Aggregation 10) 8 board N1SLQ16 reports COMMUN_FAIL (Serial Communication Failure Alarm).
2、#1600 Development Area Convergence-1 (Downtown Convergence 10) 8 board bit-N1SLQ16 reported BD_STATUS (single board not in position alarm).
3, #1600 Development Area Convergence-1 (Downtown Convergence 10) network element reported MS_APS_INDI_EX, APS_INDI multiplexed segment protection inversion alarm. egs4 single board reported TU-AIS alarm.
Processing
1. At the first time of inversion, N1SLQ16 of #1600 development zone aggregation 8 board reports COMMUN_FAIL, BD_STATUS, which causes RLOS alarm to be reported on the opposite end #1602, causing the ring network multiplexing segment protection inversion.
2. During the second inversion, the service is interrupted in a large area, the protection inversion protocol of the whole network is restarted, and the K bytes of #1600 and #1602 are rechecked in the whole network, and after checking, the protocol of the multiplexing section and the status of K bytes are normal, and the state of protection inversion is consistent at both ends, and the protection inversion is restored to normal.
3, the K byte information collection and analysis, the cause of the failure is 1600 development zone convergence 8 board N1SLQ16 hardware failure caused by this element of the multiplexing segment protocol module of the K bytes sent inaccurate, resulting in the opposite end did not receive the correct K bytes, resulting in the two ends of the protocol module is not in accordance with the predetermined procedures to switch the cross-page, resulting in the failure of the inversion. Specific process:
A. #1602 11 board receives the disappearance of SF, it will send "inversion recovery request" to the opposite end #1600 8 board, #1600 8 board receives the opposite end of the "inversion recovery request", it will confirm, and send #1602 also send "inversion recovery request". "The #1600 8 board will confirm the request and send a request to #1602.
B. Board 11 of #1602 confirms the request after it receives the "reversal recovery request" from the opposite end, and at the same time sends the command "switch idle state" to board 8 of #1600 through board 11, and board 8 of #1600 receives the command "switch idle state" from board 11. After receiving the "switching idle state", board 8 of #1600 restores the inverted state, sends out the cross page, and the local terminal returns to normal state.
C. At the same time, board 8 of #1600 also sends the command of "switching idle state" to board 11 of #1602, and after board 11 of #1602 receives the command of "switching idle state", the local end also switches the inverted state and sends out the cross page to restore the normal state, and finally the whole ring is restored to the normal state. After receiving the "switching idle state" command, #1602 board #11 also switches the inverted state at the local end, sends out the cross page, and restores the normal state, and finally the inverted state of the whole ring network returns to normal.
However, at that time, due to the instantaneous report of board 8 N1SLQ16 that the single board was not in position, the RLOS of #1602 disappeared instantly, and the ring network started to recover from the inversion. D. The hardware failure of board N1SLQ16 caused the "inversion recovery request" sent by it to become a command to "switch to idle state", causing #1602 to change directly from "inverted state" to "idle normal state". The #1602 directly changed from the "inverted state" to the "idle normal state", while the opposite end #1600 network element was still in the "inverted state", causing inconsistency in the inverted state of both sides, which led to service interruption.
Root Cause
N/A
Solution
1, because of the protection reversal but business interruption, so suspected that the multiplexing segment protection protocol is abnormal immediately restart the entire network of multiplexing segment protection protocol, restart the protocol after the protection reversal back to normal, the business is gradually restored.
2、Because of #1600 development zone convergence 8 board N1SLQ16, N1SLQ16 reported COMMUN_FAIL (serial communication failure alarm), BD_STATUS (single board is not in the position alarm) abnormal alarm, this single board for replacement, replacement of the abnormal alarm disappeared, the protection reversal alarm disappeared, protection reversal is over, the multiplexing section of the ring network to return to the normal state.
3. In the early morning of the next day, do the reversal test of the multiplexing section, and the reversal test is normal for many times, and the service is not interrupted.
Suggestions and Summary
When the multiplexing segment is inverted, the failure of large-scale interruption of service can be considered as the effect of abnormal K-byte transmission or abnormal multiplexing segment protocol, resulting in large-scale interruption of service, and you can decisively try the method of stopping and starting the multiplexing segment protocol to make the protocol reset and quickly reply to the service, and then deal with the other faulty veneers.
The related technical information in this chapter and the SDH equipment troubleshooting process are collected and organized by Shenzhen Optical Transmission Network Technology Company Limited (www.opticaltrans.com), please retain! Our company specializes in the sale of Huawei SDH optical transmission equipment,SDH transmission equipment.


Chinese
English





