Description of the problem
On January 30, 2007, at 9:22 a.m., the SSA1EGT2 board No. 9 of Network Element #126 (OptiX 10GV2 device) at Site XX showed the phenomenon that FCS_ERR alarms were constantly reported, and the service interruption could not be recovered until 1:15 p.m. when the field engineer started and stopped the LCAS protocol, and the fault disappeared.
Processing
1. After the failure at the site, the R&D personnel obtained the site data with the help of the site engineers, mainly including
The main data include the network configuration MO data of the site, the operation log of the site network management, the historical alarm and performance information of the site, and the negotiation information of the LCAS protocol components. The results of analyzing the information from the site are as follows:
(1) The site MO data was imported to the network management, and the data was analyzed and found that the MSP protection reversal had occurred on the circuit board of the #126 network element before the service interruption reported by the FCS_ERR alarm, and the phenomenon of non-stop reporting of the FCS_ERR alarm appeared when the protection reversal was resumed.
(2) From the historical alarms and performance data returned from the site, it can be seen that there are a large number of false codes at the site when the protection reversal is restored.
(3) From the negotiation information of LCAS protocol components returned from the site, no abnormality is found, and the negotiation of the upper layer software is normal.
(2) There are two breakthroughs in the data returned from the scene, one is the interruption of the sending service caused by the MSP protection inversion, and the other is the FCS_ERR alarm that appears on the scene, and the laboratory has conducted systematic tests for both.
(1) First of all, the test MSP protection inversion and LCAS with the problem, in accordance with the field network in the laboratory to build the environment for testing, through repeated MSP protection inversion test, found that in the optical power is relatively low in the case of BER will produce FCS_ERR alarm, but and the field data compared to the scene found that in the occurrence of the MSP protection inversion and the recovery of the time breaks have BER, but the protection inversion after 10:00 after the recovery, there is no BER. After the recovery, there is no more BER after 10 o'clock, which is inconsistent with the simulation of the lab that there is always BER before reporting FCS_ERR alarm, so the direct reversal test did not succeed in reproducing the problem.
(2) In the case that the MSP inversion test conducted directly in the laboratory could not reproduce the problem, the data experts were summoned to discuss another solution.
One way of thinking is to start from the FCS_ERR alarm returned from the field, analyze what may cause the FCS_ERR alarm in the field conditions, and then carry out the targeted simulation test according to the FCS_ERR alarm:
Through the test software simulation test several times, found that the simulation site error code generated SQ error when the alarm and the scene is completely consistent with the reported situation, as follows:
And found that, if the SQ is affected by the error code to produce a certain pattern of change, it will appear after the disappearance of the error code has not been restored to the business, and accompanied by non-stop FCS_ERR alarms reported, and can only be restarted by restarting and stopping the LCAS protocol in order to make the business back to normal. Repeated testing finally found out the SQ change rule for:
(1) The SQ received by one EGT2 board is greater than or equal to 0x3f due to the influence of error code.
(2) When the protection inversion is restored, the SQ number of individual time slots is misplaced due to the influence of error codes.
(3) When the error code disappears, although it can normally receive the correct SQ number, but at this time the service can no longer be restored, and accompanied by FCS_ERR alarms.
The phenomenon corresponding to the entire SQ change process is: data board performance statistics have been misplaced, single board FCS_ERR alarms keep reporting, business interruption is not restored, restart and stop the LCAS protocol business recovery alarm disappears.
Root cause
The procedure used in the laboratory to simulate the failure is as follows:
1) After issuing an SQ of size 0x3f for time slots 1 and 2, the service is interrupted (equivalent to a field error code resulting in an all 1)
(2) After exchanging the SQ numbers of time slots 1 and 2, the service is still interrupted (equivalent to a misplaced SQ due to a code error).
(3) After the correct SQ number is issued, the service is still interrupted (equivalent to the disappearance of BER and the return of SQ to normal).
(4) After starting and stopping the LCAS protocol, the service is resumed.
Now for each step of the operation and site analysis is as follows:
(1) the scene due to the error code caused by the SQ number was incorrectly rewritten as all 1, (for the EGT2 single board is 0x3f because the single board maximum support for 0x3f), for the size of 0x3f SQ number chip processing is as follows:
PMC5397 chip only 6 bits of registers indicate the SQ number, for the size of the SQ is equal to 0x3f that is not used SQ number (Each entry of the sequence RAM specifies four expected sequence numbers as explained in the ECBI register section. The four sequence numbers corresponds to four consecutive timeslots.For unused timeslots (including those for contiguous concatenation), the value of the SQ number is the SQ number of the sequence RAM. For unused timeslots (including those for contiguous concatenation), the value must be set to0x3F.).
In LCAS enabled condition (chip LCAS is always enabled) when an SQ of size 0x3f is received, the received
CTRL field will become 5 (IDLE state, indicating that the time slot is not in use), and the SQ read out from the FPGA will keep the previous value.
The SQ read from the FPGA keeps the previous value and does not change, because the received CTRL=IDLE will be sent down once through the LCAS adaptation layer interface to the timeslot configuration relationship.
This includes the value of the downlink direction SQ. Since the value of the downlink SQ is the previous value, it is not consistent with the actual SQ received.
Therefore, the service is interrupted and an FCS_ERR alarm is reported.
(2) When the on-site MSP recovers, the SQ received by one end of the EGT2 board is misaligned due to error code interference.
In this case, the service is still interrupted for the following reasons:
When the SQ number of time slots 1 and 2 is changed from 0x3f to a valid SQ number (although the positions of the two SQs are reversed, it is within a reasonable range), the time slot will be restored, and the same downlink SQ configuration will be issued during the restoration, but the SQ configuration relationship is exchanged, that is, the SQ of the time slot 1, which was originally supposed to be 0, is now 1, while the SQ of the time slot 2, which was originally supposed to be 0, is now 1, and the SQ of the time slot 2, which was originally supposed to be 1, is now 1. The SQ of time slot 2 was originally 1, but now it is 0. Since the SQs must be arranged in order when the PMC5397 chip sends down the time slot bindings, there is a problem with the time slot configurations sent down, so the service is still not restored.
3) When the SQ is correctly restored, the service still cannot be restored for the following reasons:
The key reason for the failure to recover is that PMC5397 does not support SQ adaptation in the downlink direction. Since the wrong SQ number was sent down in the second step, now although the correct SQ number can be received, the chip does not support SQ adaption, which makes the downlink SQ configured by the chip different from the correct SQ received, and it also leads to the interruption of the service.
4) Starting and stopping the LCAS protocol, the service is resumed for the following reasons:
After restarting the LCAS protocol, the downlink SQ is recalculated, so that the current correct SQ is reissued to the chip configuration register, and the configured SQ and the received SQ are consistent, and the service is normal.
Summarize:
Through the above analysis and localization, it is obtained that the root cause of the accident is: due to the error code generated by the MSP protection inversion, which leads to the SQ No.
According to the above law was rewritten, and because the PMC5397 chip does not support the SQ adaptive function, resulting in the disappearance of the error code after the recovery of the SQ
The downlink SQ configuration of the PMC5397 chip does not match with the actual received SQ, which leads to service interruption and non-stop reporting of FCS_ERR alarms.
Solution
Emergency recovery measures:
None
Measures for a definitive solution:
After understanding that the EGT2 single board of OSN products also had a similar problem because of the error code affecting SQ, and OSN personnel learned that their EGT2 single board in the case of LCAS enable also did the SQ adjustment task, while the SSA1EGT2 single board only did the SQ adjustment task in the case of the LCAS does not enable, the root cause of the problem is also a formal SQ self-adaptation problem, and thus We can also refer to the OSN practice, in the case of LCAS enable the same downstream direction of the SQ adjustment, so that when the downstream configuration of the SQ and the actual reception of the SQ is inconsistent, the service can be restored through the adjustment of the SQ.
This chapter of related technical information and SDH equipment troubleshooting process is collected and organized by Shenzhen Optical Transmission Network Technology Co. Our company specializes in the sales of Huawei SDH optical transmission equipment,SDH transmission equipment.


Chinese
English





