Problem Description
On May 19, 2012, at 01:18:23 AM, 2249-Zhongcao network element stored a large number of abnormal alarms, from the alarm situation, 2, 8, 9, 10, 16 single boards reported single board is not in the position of the alarm. At the same time, a large number of alarms were reported for power supply abnormality, component failure, power supply abnormality, bus failure, chip failure, and so on. The related alarm information is: powerfail BUS_ERR
Processing
1. At 19:00 on May 19, 2012, we went to the scene to locate the problem, and analyzed the alarms to locate the problem of abnormal power supply, and inquired about the reset records of the equipment and found that there were no abnormal reset records. However, some single boards have abnormal reset conditions. The reset records of single boards are 04 and 10, and since it is impossible to confirm whether it is a power failure or not, we collected data for 800 to analyze.
2. In the process of fault localization, it is found that slot 9 reports a single board not in position alarm, slot 11 reports a serial port communication failure alarm, and slot 8 reports a power supply abnormality alarm. The other slots have no abnormal alarms. After unplugging and plugging the three single boards respectively, it was found that the single board in slot 11 was normal, and the other two single boards were still faulty.
On May 20, 2102, the optical boards and cross boards in slots 8 and 9 were replaced respectively. 8 slots found that the power abnormal alarms had been cleared after replacement, but 9 slots found that the logic single board had been deleted, resulting in the single board not being able to start work normally.
4. At 03:00 on May 21, 2012, the cross board in slot 10 reported BUS_ERR alarm again.
Data analysis:
No. 8 circuit board reported powerfail alarm, from the alarm parameters to see the single board 3.3V power supply is abnormal, is using a backup 3.3V power supply, the current network replaced the No. 8 single board, powerfail alarm disappeared, indicating that the No. 8 single board power module may be abnormal:
POWER_ABNORMAL MJ start 2012-05-19 02:18:41 None SA NEW_BOARD board=8;01 00 01 00 05 ;
No. 9 cross board reported not in position alarm, the first line to the scene to deal with the single board found that can not be powered up, the indicator does not light, the initial suspicion that the power supply module has failed or fuse burned:
May 21, 3:00 a.m. No. 10 cross board instantly reported a type 1 and type 3 BUS_ERR alarm, type 1 BUS_ERR alarm means bus LOS, type 3 BUS_ERR alarm internal bus OOF, these two alarms indicate that the No. 10 cross board cross chip anomaly:
BUS_ERR CR end 2012-05-21 03:00:49 2012-05-21 03:03:40 SA NEW_BOARD board=10;10 01 17 01 ff ; BUS_ERR CR end 2012-05-21 03:03:53 2012-05-21 03:04:30 SA NEW_BOARD board=10;08 02 17 01 ff ; BUS_ERR CR end 2012-05-21 03:03:53 2012-05-21 03:04:30 SA NEW_BOARD board=10;0b 01 17 01 ff ; BUS_ERR CR end 2012-05-21 03:02:44 2012-05-21 03:04:50 SA NEW_BOARD board=10;07 07 02 03 ff ; BUS_ERR CR end 2012-05-21 03:02:44 2012-05-21 03:04:50 SA NEW_BOARD board=10;07 08 02 03 ff ;
Root Cause
After the 8, 9, and 10 slot veneer boards were returned to R&D, they were powered up and analyzed in the lab:
1, on the 9 cross board to analyze, the single board does not power up, the indicator lights are all off. Use a multimeter to test the single board power module input voltage, the result of the power supply is 0V, then test the power module front fuse F519, found that the fuse has burned. Replacement of the fuse F519 on the single-board re-power, the single-board can start running normally. On the business test 24 hours, the results of the normal single board, no business interruption.
2, on the 10th cross board power-up analysis. After the single board is powered up and started normally, configure the service for long-term monitoring, the service is normal, there is no abnormal alarm, there is no type 1 and type 3 BUS_ERR alarm, the fault does not recur.
3. Power up and analyze the No. 8 cross board. Single board after power-up normal start, configure the service for long-term monitoring, business is normal, no power module abnormal alarms, the fault does not reoccur
8 and 10-slot single board lab failure does not reappear, what causes the current network failure at that time 8-slot power module output shutdown, 10-slot high-level chip abnormal? What causes the 9 cross board fuse to burn out? Carefully observe the appearance of the three returned veneer boards, and find that all three veneer boards have water stains on them. There was a thunderstorm at the time of the veneer failure, so the water stains on the surface of the veneer should be rainwater, and the rainwater caused the three veneers to be abnormal. The specific water stains are shown in the following diagrams:
1) 8-slot circuit board water damage is shown in the following figure, and there is also water damage on the power module:
2) 9-slot crossover board water damage is shown in the following figure, and there is water damage near the power module:
3) 10-slot crossover board water damage is shown in the following figure, and there is water damage near the high-level chip:
Solution
Guiyang Unicom's three single boards failed due to short circuit of single boards caused by rainwater. Please troubleshoot the environment of the server room to avoid the situation of being waterlogged again.
The related technical information in this chapter and the SDH equipment troubleshooting process are collected and organized by Shenzhen Optical Transmission Network Technology Co. (www.opticaltrans.com), please retain! Our company specializes in Huawei SDH optical transmission equipment,SDH transmission equipment sales.


Chinese
English





