Service Incident Report: 12-15 May (ed.) 2010, LHCOPN primary link ASGC-CERN

(Copied from a SIR with date 12-15 April)

Description

Atlas users experienced unreasonable low data transfer rate from CERN to ASGC. CERN network team also experienced small size packet lost with ping test. But, ASGC network team identified no quality issue between Taipei and Amsterdam. Small packet size ping test and history diagram also shown no packet lost occurred from Taipei to Geneva. However, there is significant packet lost with larger packet size ping test from Amsterdam to Geneva.

Impact

Thousands of Atlas data processing jobs were queued waiting for efficient network to deliver to ASGC.

Analysis

After segment by segment network tests were implement, a low power issue was identified on GEANT Frankfurt pop.

Low power issue is a very tough network problem to identified. In this case, users suffered on unreasonable slow data transferring, but ASGC network team experienced no packed lost on small packet size ping. The ping test and its history result have ASGC network team delay 48 hours to check the network issue. A simple and straight forward file transfer test scheme on LHCOPN network for engineer to run test might be helpful to verify such kind of problem in the future.

The 13/5 was public Holidays in Geneva and Amsterdam, thus the delay of the Surfnet NOC to open the ticket.

Timeline

All time in UTC
2010/05/12 07:44: jwhuang@twgridNOSPAMPLEASE.org found single file transfer rate from CERN to ASGC was less than 50kbps. He warned that there might be a network issue.
2010/05/12 07:55: after check to ASGC smoke ping server, there is no significant packet lost occured on the smoke ping history. Network issue was ignored by chenws@twgridNOSPAMPLEASE.org.
2010/05/12 15:09: neteng@cernNOSPAMPLEASE.ch is notified of an issue affecting all the transfer from CERN to ASGC
2010/05/12 16:25: after checks on the CERN routers, neteng@cernNOSPAMPLEASE.ch open a ticket with Surfnet, the circuit provider
2010/05/12 17:14: neteng@cernNOSPAMPLEASE.ch moves the ASGC traffic to the CERN-ASGC backup link. The transfers improves, but the bandwidth is limited at 1.8Gbps
2010/05/13 10:50: jwhuang@twgridNOSPAMPLEASE.org confirmed again. There is no problem on the server pools. There must be a network problem.
2010/05/13 12:04: chenws@twgridNOSPAMPLEASE.org found the returning path from CERN to ASGC were routed to Chicago backup link. He also identified the 2gbps peering like is fully occupied. chenws@twgridNOSPAMPLEASE.org informed that the issue was caused by congestion. chenws@twgridNOSPAMPLEASE.org changed the route from ASGC to CERN back to 10g primary link, but the returning path is remaining on the backup link.
2010/05/13 17:45: chenws@twgridNOSPAMPLEASE.org was informed that data communication between ASGC and CERN was down, and the routing path changment was done in purpuse by CERN to by pass the issue. chenws@twgridNOSPAMPLEASE.org was asked to change the route back to backup link.After the route changment, jwhuang@twgridNOSPAMPLEASE.org reported that data transfering is recovered.
2010/05/13 17:45: After some data transfering test with jwhuang@twgridNOSPAMPLEASE.org, Edoardo confirmed that sticked on backup path was the best solution although it was 50KB per second for that moment. He also opened a ticket with Surfnet which provide the link to the ASGC router in Amsterdam.
2010/05/13 18:02: chenws@twgridNOSPAMPLEASE.org confirm that there was no quality issue between Taipei and Amsterdam. He also found there was a signicant packet lost occured with 8973Byte ping test on router interface to router interface. chenws@twgridNOSPAMPLEASE.org wondered there might be an MTU issue between Amsterdam and CERN.
2010/05/13 18:17: Edoardo replied that the MTU seting is correct. His test shown nomal ping test with small size packet between router interfaces is having packet lost issue. He believed it was a link issue between Amstedam and CERN and suggested waiting for SURFNet response.
2010/05/14 07:30: For 12 hours no updated information, chenws@twgridNOSPAMPLEASE.org sent an email to 'helpdesk@nicNOSPAMPLEASE.surnet.nl' and noc@netherlightNOSPAMPLEASE.net to ask them speed up the process.
2010/05/14 08:18: Surfnet open the ticket and starts investigations. The ticket name is Netherlight-20100514-1
2010/05/14 09:45: Netherlight Peter Tavenier [Peter.Tavenier@sara.nl] replied no error found on Netherlight switches. He requested to check Taipei-Amsterdam IPLC link status.
2010/05/14 10:33: Netherlight Peter Tavenier prposed to attached an iperf server between Amsterdam and CERN to run performance test. 10:46 chenws@twgridNOSPAMPLEASE.org confirmed would do iperf test.
2010/05/14 11:13: chenws@twgridNOSPAMPLEASE.org provided IP information to Peter Tavenier. 11:33 Peter replied the circuit has detached, and a iperf server was set toward to Amsterdam router.
2010/05/14 12:05: Netherlight Peter and chenws@twgridNOSPAMPLEASE.org both confirm ping test with packet size bigger than 1504 all failed. We were running into a maximum MTU of 1504 somewhere in Amsterdam
2010/05/14 14:05: chenws@twgridNOSPAMPLEASE.org idefied that there was an MTU mismatch on ForthTen switch behind Amsterdam router. After the MTU mismatch fixed, the iperf test and big size packet ping test runs well between Amsterdam router and the iperf server of Netherlight.
2010/05/14 14:38: Peter Tavenier attached the iperf server toward to Geneva to run iperf test. He found that it did be a packet lost issue between Amsterdam and Geneva. His conclusion is: a lots of packet loss, the bigger the packet size the worse is gets. He decided to call GEANT service
2010/05/14 15:36: According to Netherlight ticket, GEANT found a low power issue in Frankfurt, engineer is on the way, no ETR yet.
2010/05/14 22:54: Geant reports to have fixed the problem
2010/05/14 23:05: ASGC check the links and finds it OK
2010/05/15 00:03: ASGC moves the traffic back to the primary link

-- JamieShiers - 28-May-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-05-28 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback