Description of the problem
The SSN2EGS2 single board on a Huawei OSN3500 network element reports a COMMUN_FAIL alarm, and the service is interrupted. After checking, the SSN2EGS2 board was abnormally reset, resulting in the COMMUN_FAIL alarm. The service was resumed after the board was reset, and the "COMMUN_FAIL" alarm and abnormal reset of the board did not occur after that.
The host version is 5.21.12.42.
Huawei EGS2 board version is 2.14.
Alarm Information
COMMUN_FAIL
Processing
Upgrade the OSN3500 SSN2EGS2 board to R6C02B014 (board software version 3.15) to solve the problem at all.
Root cause
According to the feedback from the site, this problem is a sudden interruption of the single-board reset service in the normal operation of the service without any operation.
Analyzing the black box data, it is found that when processing Hello messages, the first two processes are successful, and when it comes to the third process, there is a failure to release memory exception, resulting in a single board reset.
log log information:
bb1.log 2008-10-14 20:19:33 D:/3500prj/public/HardDrv/NP3454Drv/NP3454PktMng.cpp,813, Hardware operation failed ,(null)
Error processing Hello message, need to free memory space. bb1.log 2008-10-14 20:19:33 Error:0x70008, freeAddr=0x1cc8070, BufSize=0x90, dmm_intf.cpp, line:1439, ,
01cc80f0: 00000000 00000000 64656164 64656164
bb1.log 2008-10-14 20:19:33 Reset: File_NP3454PktMng.cpp, Line_859, Type_0xf0000010.
Error freeing message memory, memory freeing failed
bb1.log 2008-10-14 20:19:33 Error:0x70008, freeAddr=0x1cc7380, BufSize=0x90, NP3454PktMng.cp, Line:669, ,
01cc7400: 00000000 04001200 0180c200 00000000
Request message memory error, failed to request memory
bb1.log 2008-10-14 20:19:33 Error:0x70008, freeAddr=0x1cc7380, BufSize=0x90, dmm_intf.cpp, line:1439, ,
01cc7400: 00000000 04001200 0180c200 00000000
Analyzing the reset log, it is due to memory request failure or memory write out of bounds.
It was found by analyzing the code in the segment:
When processing the protocol message, firstly, the pointer is not initialized; secondly, due to memory leakage, the application fails after using up the slice points and non-slice points, but only the level3 black box is remembered, and the error is not returned or restarted directly; finally, it is a mistake of the pen in the judgment of releasing the memory in the failure to send, which leads to the memory not being released;
Due to memory leakage, after running out of other tasks application failure will restart; and if pSendMsgBuf is a pointer that has not been initialized and has not been given space, releasing it below may also lead to repeated release and restart;
Therefore, the cause of the problem is: the single board reset due to the memory leakage caused by not releasing the memory in special scenarios after requesting the memory. Accumulation after memory leak is a long term process, so the problem disappears after resetting the single board. The problem is a single-board quality issue.
The technical information related to this chapter and the troubleshooting process of SDH equipment is collected and organized by Shenzhen Optical Transmission Network Technology Co. ( www.opticaltrans.com), please retain the reproduction! Our company specializes in the sale of Huawei SDH optical transmission equipment, SDH transmission equipment.


Chinese
English





