Problem Description
After the SSN2EGS2 board has been operating normally on OSN1500 network elements for more than two months, it repeatedly reports COMMU_FAIL, LP_SLM_VC12, and ETH_LOS alarms and service interruption. Soft/hard reset and unplugging operations on the existing network failed to solve the problem.
Alarm information
COMMU_FAIL, LP_SLM_VC12, ETH_LOS.
Processing
1. Replace the single board with SSN1EGS4 and other data single boards, the software on SSN1EGS4 has done the protection mechanism of anti-protocol message impact, so the problem is solved after replacing the single board.
2、SSN2EGS2 single board software upgrade to version 5.51 or above.
Root cause
Analyzing the black box of the single board, we found that debugbuf.log records a large number of single board soft reset records, it can be seen that it is indeed a constant soft reset caused by the single board repeatedly reported COMMU_FAIL alarm, which indicates that the master control and the data board of the EtherCommunication channel has been interrupted. Since SSN2EGS2 is a special single board, soft reset and hard reset have the same effect, both will interrupt the service. Therefore, when repeated reset faults occur on the EGS2 single board in the existing network, the service is interrupted repeatedly. The LP_SLM_VC12 and ETH_LOS alarms reported repeatedly are also caused by repeated soft reset of the single board.
From the black box record of debugbuf.log, it is that the single board receives a large number of protocol messages, which causes the CPU occupancy rate of the single board to be too high, and thus the reset occurs. The version 5.50 software adopted by the SSN2EGS2 single board does not do the protection mechanism of anti-protocol message impact, and when a large number of protocol messages may be generated in the network due to the broadcast storm, the CPU will not be able to carry the load and reset, and it can be seen from the black box record that the soft dog alarm is also reported repeatedly. The black box record also shows that the process of processing protocol messages accounted for as much as 47.63% of the CPU resources when the softdog was reset. Therefore, after unplugging the board again, the protocol message impact is not lifted, and the EGS2 board will still experience the reset problem.
Recommendation and Summary
None


Chinese
English





