Description of the problem
On a certain day, the OSN7500 (gateway network element) device was remotely upgraded, and it was found that the activation failed, and the master was repeatedly reset in the BIOS state, and hard-plugging the single board was ineffective. After unplugging the network cable of the OSN7500, the master control was reset again successfully.
Device version: V100R8C02SPC200
No, the network management shows activation failure, use the command line to log in, prompted "in BIOS state, please load NE software", use the command line to load the software, loaded unsuccessfully.
Process
1, the main control load software successfully, activation failure, judgment may be the main control single board failure.
2, in the process of master activation, other single board up and down, resulting in master reset failure.
3, the main control repeatedly restarted in the BIOS state, may be loaded software data abnormalities.
4, the master control in the BIOS state reboot, may be from the DCN message attack.
5, DCN is too large, a large number of ECC messages filled the message queue of the master control.
Root Cause
The DCN is too large and the mailbox communication message queue is overflowing.
Solution
1, activation failure, through the indicator found that the main control board repeatedly restarted, the initial judgment board failure. After replacing the board and upgrading the activation again, the main control board still reboots in the BIOS state. Plug the board into another 7500 device, found that the upgrade has been successful, to rule out the problem of board failure.
2, the use of a new master single board inserted in the slot, do not do the upgrade, only a hard reset, found that the master still can not start normally, the judgment may be attacked by the DCN message, the 7500 device will be unplugged from the network cable, the single board reset success. Tested several times, found that as long as the network cable is connected, the main control can not be normal reset, judgment failure and DCN related.
3, the device will be connected to the network cable in the HUB, through the HUB and then connected to the OSN7500, and through the HUB connected to a PC to capture packets.
4, analyze the logs found in this TransCMMsg records, and BIOS logs under the errlog reset records can be corresponding.
BB0 " 2013-04-17 [16:19:50] 0x000d 00 0x0008000b bioscm_CBiosCM. 0503 TransCMMsg "This record and errlog "2013-04-17 [16:20:05] 0x18 0xf0000010 0xf0000000b bioscm_CBiosCM. 0503 TransCMMsg -17 [16:20:05] 0x18 0xf0000010 0x00000001" corresponding to the on.
5, continue to analyze the function of TransCMMsg error printing through R & D, and found that TransCMMsg failed to call the following function, resulting in a direct reset of the master control:
dwRc = m_pAcc->TransCMMsg(vBuf, pLanHeader->wLength, 0 /*LanHeader.dwDstAddr*/,
(BYTE)(wChanPort + LAN_PORTBASE), byLanMode) ;
m_pISBoard->CheckErrCode(dwRc,VOS_ERR_LEVEL1,MODULE_ID_BIOSCM, "TransCMMsg") ;
6, BIOS under the initialization of mailbox communication message queue length of 128, assuming that the BIOS does not deal with the premise of any message, if the above function returns to the failure, that is, the message queue has been full of 128 messages, and then there are new messages come in to the forwarding of the error so as to reset the single board.
Problems occur through the capture of packets to see the situation, the number of packets per second for 5, 60 packets, in the case of the BIOS does not handle any message, fill the message queue only takes 2 seconds, in the case of closing the DCN, the problem has not resurfaced, this time the capture of packets for the packets per second more than 10 packets, so that in the assumption that the BIOS does not deal with the case of any message, fill the message queue takes more than 10 seconds.
Meanwhile, analyzing the data, we found that the number of ECC subnets in the existing network is 240, which is much larger than the 64 recommended by NG products, and in this scenario of large subnet division, when the gateway network element master reset, the downstream network elements have been having large-scale Ethernet communication data coming into the ETH port of the gateway network element master, which may cause the message volume to be too large at a certain moment when the master board is too busy, and the other tasks are busy in the startup process, the message queue will be filled up. This may result in a reset when the message queue is full and the master board is busy with other tasks during its own startup.
The second reproduction without abnormal reset is due to the gateway of all network elements under the faulty network element is set to unconfigured, which reduces the amount of communication between this network element and other network elements, so the single board in the startup process can handle over the message without resetting the normal startup to the host.
7, R10C03 previous version did not improve the robustness. This problem occurs in the current network of this ECC is too large scenarios, in the R10C03 version of the robustness of this problem has been modified.
Suggestions and Summary
1. The number of non-gateway network elements carried by the gateway network element is too large, and the ECC information is too large
2, in the process of master reboot, BIOS state situation network management will send ECC communication packets to the network management network elements at regular intervals (the network management does not know what state the master control is in)
3, due to the gateway network element with ECC information is too large, in the main control board to restart the BIOS message queue is full of burst packets of information is too large, resulting in the main control board constantly reset to clean up the message queue, thus repeating the dead cycle of reset operations, the main control board can not start normally!
4、When unplugging the network cable or changing the gateway attributes of the network element, the ETH port of the device will not send ECC packets to the main control board, so that the BIOS starts up normally, calling the upper software, and finally the single board starts up normally.
5. It is recommended that the size of DCN does not exceed 64.
The related technical information in this chapter and the troubleshooting process of SDH equipment are collected and organized by Shenzhen Optical Transmission Network Technology Company Limited (www.opticaltrans.com), please retain! Our company specializes in Huawei SDH optical transmission equipment,SDH transmission equipment sales.


Chinese
English





