Description of the problem
In the process of processing the GSCC report COMMUN_FAIL warning of the No. 17 main master control of Huawei OSN3500 equipment in the current network, the local employee deleted the alarm inversion database alminv.dbf of the main master control according to the guideline and then instructed to request a reset. However, the local staff did not issue the reset command directly, but reset the backup master first according to the routine operation, and then the master and backup inverted, and then reset the master master, resulting in a near network element reset state, Navigator intermittently unable to log in to the network element, the network administrator No. 17 GSCC reported COMMUN_FAIL, and the No. 18 GSCC reported SYNC_FAIL, and the network element side of the issue of the Other commands return host status busy: failed! cmd:0x301a error:0x9127 NE IS BUSY : HBU
Warning reference: Precaution About Repeated Reset of the SCC board Due to Enabling of the Alarm Reversion Function for Boards on the Extended Subrack of the OptiX OSN 3500-20090727-A.doc
Alarm Information
COMMUN_FAIL;
SYNC_FAIL;
failed! cmd:0x301a error:0x9127 NE IS BUSY : HBU ;
The network element is intermittently unable to log in.
Processing
Positioning process:
1. Log in to the network element, use :errlog, check the reset log, and find that there is no repeated reset phenomenon.
2. Use :cfg-get-phy to query continuously for 5 minutes, and find that the two GSCC physical board positions have always existed, confirming that the primary and backup masters are not being reset repeatedly.
3, check the alarm reversal database, the display is as follows, there are a large number of Ox33 BID number, indicating that the alarm database has not been successfully cleared.
#9-80:szhw [Kadawatha ][][2011-11-03 17:37:42+05:30]>
:dbms-query: "alminv.dbf",drdb
ALMINV.DBF
record num BID OPPORT PATH ALLOW
1 36 01 0034 01
2 36 01 0039 01
3 34 01 001c 01
4. The query :hbu-get-sync-enable returns success, indicating that synchronization is not disabled.
5. Querying :hbu-get-backup-info sync status shows that the status switches repeatedly between 0x00000000 and 0x00000002, but never reaches status 3.
Cause Analysis:
1. After clearing the database of master control No. 17, the alarm reversal data of backup master control No. 18 still exists (not updated in real time for some reason), which leads to the fact that during the process of master and backup reversal, master control No. 17 copies the data (with the alarm reversal database) from the new master control No. 18 GSCC, and thus after the reversal, master control No. 17 reports COMMUN_FAIL again.
2, in the process of reversing, due to the inconsistency of the database information of the two masters conflict (at the same time, the 17th master reported COMMUN_FAIL), resulting in two masters have been in the state of batch backup can not be successful, the system has been in the state of processing, so the network elements into the busy state.
3. On the other hand, it is also possible that the alarm reversal database triggers the warning again.
Conclusion:
It is necessary to interrupt the batch backup, clear the alarm database once more, and reset it to synchronize the master and backup masters. However, none of the related commands can be issued because the network element is in state busy. All prompts: failed! cmd:0x301a error:0x9127 NE IS BUSY : HBU ;
Take action:
After consulting the R&D, use the command :sm-set-nebusy:0,0,0,0,none to release the network element state busy, execute the clear alarm reversal database once again according to the warning finger, soft reset the main master, and manually trigger the batch backup once again (with :hbu-go-batch) to check the batch backup state is normal, the problem is solved.
Root Cause
1, the network element intermittently can not be logged on, may be the master in repeated reset caused.
2、GSCC No.17 is still reporting COMMUN_FAIL, which means that the alarm reversal database alminv.dbf deleted before is not working, and there may still be alarm reversal data in GSCC No.17.
3, No. 18 GSCC reported SYNC_FAIL, indicating that the master and backup master synchronization failed, that is, the batch backup failed. It is possible that the inconsistency of the master and backup data led to a conflict, which could not be successful in the process of synchronization, which in turn led to the network element entering the busy state.
Suggestions and Summary
1. Regarding the primary backup master inversion and database synchronization backup.
Master and backup master control in normal circumstances, the data is synchronized in real time or timed backup, that is, the master database (such as alarm data, etc.) has changed, the backup master will read and copy data from the master.
The mechanism when issuing the command to reverse the master and backup is: first reverse, then trigger the batch backup, and the backup master synchronizes the data of the master, that is, the original master reads data from the original backup master for synchronization. At this time, if the data of the backup master is incomplete or incorrect, problems will occur, and there is a risk of causing errors in the network element database.
However, after consulting with R&D, it was confirmed that it is not possible to change the order of batch backups and reversals. This is mainly because the main control reversal is a hard reversal, did not do due to software problems reversal, such as unplugging and plugging the single board, the main control hardware failure, etc., so it is the first reversal and then batch backup. At the same time, the master-backup synchronization is close to real-time, but also can not be completely absolute real-time, so there is indeed such a risk, but the probability of risk is low.
Therefore, before the main backup reversal, it is recommended to check the synchronization status with :hbu-get-backup-info first, to ensure that the status is at 0x00000003 party to issue the command. However, there is currently no such interface on the U2000 network management to query, and it is recommended that the network management interface can be increased.
2, on the warning guide
It is recommended that in the early warning guidance to improve the implementation of the command, the master and backup inversion to increase the instructions, or increase the manual trigger batch backup steps, to ensure that after the deletion of the main master database will not appear this problem. The possibility of hardware problems or other problems during processing leading to the reversal still exists.
At the same time, it is recommended not to use the :reset command during early warning processing, and the first line generally does not dare to execute the command directly in the current network.
3、About the release of network element state busy command
This command can not be used casually in the existing network elements, you need to consult the R & D before use, otherwise, it may appear to lift the normal processing of the network element to trigger other uncontrollable problems.
In the process of this problem, we should try the following command first to release the batch backup status busy, instead of directly releasing all status busy.
:sm-set-nebusy:0,40,0x9127;// Unbusy batch backups.
The related technical information in this chapter and the troubleshooting process of SDH equipment are collected and organized by Shenzhen Optical Transmission Network Technology Co.( www.opticaltrans.com), please retain the reprint! Our company specializes in the sales of Huawei SDH optical transmission equipment, SDH transmission equipment.


Chinese
English





