Week of 051010
Open Actions from last week:
- FTS server not visible from the VO box -> deploy another one (Simone + Gavin).
DONE
- Fix the problem in the stager to remove the disk-only files (Olof)
DONE
- cleanup and re-lay files again (James)
DONE
- publish procedure for FTS to publish in BDII (James/Gavin)
DONE
- Remove IA32 nodes from wan cluster until problem solved (Olof)
NOT DONE
- Fix I/O Error on oplapro79 (Vlado)
DONE
- FTS support want to change behaviour of LHCb retry agent to fit their model better (Gavin)
DONE
- check possiblility of using LCG Quattor WG componets for LFC/DPM/... (Jan/Vlado/Sophie)
TO TEST
- Danielle and Olof to do a debug session
DONE
- Escalate problem with lost packets on loopback I/F to linux.support (Olof)
IN PROGRESS
On Call:
Monday:
Log: Problem with unosat
VOMS boxes. Also mail from Lionel about problems with lxshare218d.
New Actions:
- Look at VOMS server issues (Jean-Philippe/Maarten)
DONE
- Check IPTABLES rules (Maarten/James)
DONE
- bundle clients for new release into LCG (Gavin/Louis)
Discussion:
- Olof will do a new release of castor this week to fix many bugs.
Tuesday:
Log: nothing. Rate peaked at 180MB/s yesterday
New Actions:
- LHCb found issue with doing ESTOR on castor2 gridftp server(with srmcp client). Ben to check
DEPLOYED
Discussion:
- VOMS issue is known bug - fixed in 1.4.1 - workaround is to reboot VOMS server every 12 hours
- I/O error on oplapro79 - this will cleaned by vlado (requires manual intervention by FIO, not sysadmin since we don't have hardware cover on the IA64 nodes)
- New version of castor scheduler today
- new security patch version of SLC3 - try and schedule for restart of FTS and LFC nodes on thursday to co-ordinate with FTS upgrade to 1.3 QF23
- Vlado debugged the iptables issue. Will take rpm off nodes until problem is fixed ( doing a iptables --list will cause the problem again since kernel module is reloaded)
Wednesday
Log: Derek reported problems with gridftp truncation.
Actions:
- Vlado to continue the investigation into loopback traffic problem. Trying to tweak kernel buffers.
- Test new QF23 FTS clients (James)
- Meeting with Joel/LHCb in 2-1-34 at 4 to discuss staging.
DONE
Discussion:
- We should use wan-data.operations for problem reporting.
- Problem with agents in QF23 - wrong tag was given to buld team - need a rebuild.
- oplapro79 still out of production - problem with disk.
Thursday
Log: nothing
Actions:
- All services - upgrade kernel to be scheduled.
Discussion:
- FTS (Gavin and Paolo) Matching the configuration today. The configuration of the pilot node has to be adapted. a modification for new updated needed. Next update of the pilot service forseen, depending on how far the fixes with the configuration are but probably next week (Gavin and Paolo). The pilot server will be update to the V1.4.1 while the servive node to the latest 1.3
- CMS is working with a SE in Taiwan working with srmcopy. Commands have to be optimized. It has been communicated to the site manager and sent emails to CNAF and PIC to follow the same procedure (Olof)
Friday
Log: nothing reported. Problem with castor2 database - high load. could be related to excess polling processes on SRM.
Actions:
- double monitor desc in lemon - we get no LCG_MON_GRIDFTP (Jan)
DONE
- FTS/LFC OS upgrade - will try to do at same time as the DB backend if that can happen next week (Andrea/Jan/...)
Discussion: