Computing Technical Design Report

3.14 Software Infrastructure

3.14.1 Introduction and Description

The ATLAS Software Infrastructure Team (SIT) is responsible for providing and maintaining the ATLAS software development environment. While the SIT is part of the offline software community, the SIT provides services to all communities using the offline software. The SIT work includes:

Currently there are approximately 25 people representing about 9 FTEs of effort actively working on the SIT. Recent estimates indicate that SIT is about 5 FTEs short of the staffing level needed to fully perform the assigned tasks, and the SIT is therefore unable to cover all of its functions as it would wish.

3.14.2 Code Management

3.14.2.1 Concurrent Version System (CVS)

ATLAS uses the Concurrent Version System (CVS) [3-40] for the source-code repository and as the low-level management tool for the software code base. CVS supports the decomposition of the code base into a hierarchically organized set of packages , each containing a related set of files. Typically these files include the implementation and header files for C++ classes, together with documentation and configuration files. Typically a package maps onto a concrete component of the Athena architecture2, or defines groupings of such components. CVS provides mechanisms for the checking out by one or more developers of a package or set of packages in order for them to modify, add, or remove files within the set of packages, and to commit such changes back to the repository so that they are available to other developers. CVS allows multiple developers to make such changes simultaneously, and if there are no inconsistencies will merge them together as they are committed. If conflicts are detected, CVS identifies them and requires that one developer resolve them before the merging is completed. Any package or set of packages may be tagged in such a way that this version may be recovered at any time in the future.

CVS implements a secure set of protocols in order to allow for check-outs and commits, based on a source-code repository that is managed by CERN IT [3-41] on a cluster of servers using a shared file system for performance and redundancy.

3.14.2.2 Code Management Tool (CMT)

The Code Management Tool (CMT) [3-42] is used as the basis for building consistent sets of packages and versions into a so-called ATLAS release, as the basis for supporting check-out and testing of packages during the software development process. Each package contains a configuration or requirements file that identifies which other packages it depends upon, the library, set of libraries and/or applications that it creates, and other configuration information that is used by other packages that depend upon this one, or are necessary in order to establish a consistent run-time environment. Standard patterns and strategies support common operations such as compilation of a set of C++ implementation files into a library, and provide support for different operating system and compiler combinations.

CMT allows dependencies to be examined, and circular dependencies to be detected.

3.14.2.3 Tag Collector

The Tag Collector is a web-based database application for release management. The tag collector allows code developers and librarians to select which versions (CVS tags) of each package are used to build the release. The first version was designed and implemented during the summer of 2001. The tool proved extremely successful, and a second, more powerful version known as Tag Collector II was introduced in the second quarter of 2005.

Both Tag Collector versions allow the preparation of builds of ATLAS software in a controlled fashion. The tool is interfaced with CVS, and also with CMT. Developers can interactively select the set of packages and their CVS tags to be included in a build, and the complete build commands are produced automatically. Other features are provided such as verification of container package CMT requirements files, and direct links to the package documentation. Tag Collector II provides a fine-grained management of user rights, some automated CVS tagging, and support for project-based builds which means that the total ATLAS offline software builds are replaced by several smaller independent builds. It is anticipated that this feature will enable a more efficient release cycle. Tag Collector II is based on the AMI generic database management software [3-43]. All Tag Collector commands can be accessed by the AMI web service.

3.14.2.4 Dependency Checking

The checkreq tool performs several internal consistency checks within each software package, with the main purpose of detecting any dependencies between packages which are either unnecessary or have been overlooked in defining the CMT requirements file for the package. Checkreq does a limited analysis of the requirements file of the package to collect the names of other packages referenced. It then does a limited analysis of the source files and header files contained in the package to assemble a list of files actually referenced there via include statements. Then a check is performed whether the included files, translated to package names, match the packages listed in the requirements file. Appropriate warnings are issued if there is no match. Additional checks are performed on the set of packages within a release, e.g. on consistency of package versions. The tool is implemented as a shell script and is maintained within the SIT. Checkreq is routinely run to check each referenced package during the nightly build procedures.

3.14.3 External Packages

The ATLAS offline software depends upon a set of externally developed and supplied software packages, including event generators and simulation tools and services and toolkits such as CLHEP [3-8]. In many cases the external packages depend upon each other and it is important that a consistent set of package versions is achieved. CMT supports access to external packages through the use of glue or interface packages, which identify the necessary package versions as well as access to the necessary package libraries and header files.

Since the offline software must operate in the high-level-trigger environment it is important that there is consistency in the package versions for external packages that are shared by both the offline and online/TDAQ systems. An explicit liaison procedure has been put into place to ensure such consistency.

The categories of external packages that are used by the ATLAS offline software are:

3.14.4 Platforms and Compilers

Since January 2005 the default Linux version at CERN is Scientific Linux CERN 3 (SLC3) [3-44]. It is a CERN-customized Linux distribution built on top of a common base platform, Scientific Linux, which is in turn built from freely available Red Hat Enterprise Linux 3 sources by a joint Fermilab and CERN effort. SLC3 is built to integrate into the CERN computing environment but it is not a site-specific product. Over the period between January and June 2005, SLC3 replaced Red Hat Linux 7.3 on essentially all machines in the CERN computer centre. The introduction of SLC3 required the software building team to expend considerable effort to overcome significant compatibility issues between the build procedure, the software, and SLC3.

In addition to support for SLC3, ATLAS has and continues to make some efforts to support platforms other than SLC3. The ATLAS software has been demonstrated to run acceptably on AMD Opteron machines in 32-bit mode. Work is under way both within LCG and ATLAS to support Opteron running in 64-bit mode. Previously LCG had provided support for the LCG-written external packages used by ATLAS on the Intel Itanium (IA-64) but this effort is currently dormant because of the lack of acceptance for the Itanium processor. ATLAS has also put some effort into porting the ATLAS software to Macintosh OSX but while there is considerable interest within the software developer community, lack of available manpower has limited progress on an OSX port. The SIT is also actively working on ways to make building the ATLAS software and the associated externals on different Linux distributions easier (currently only the SLC3 variant of Red Hat Enterprise Linux 3.0 is officially supported). The plan is to collaborate with various groups that have the need to run the software on clusters with other Linux distributions installed.

Distcc [3-45] is a fast, free distributed compiler that the SIT is investigating to speed the compilation of ATLAS software. The offline software now takes approximately one day to build on one of the best machines available at CERN. Distcc distributes builds of C, C++, Objective C and Objective C++ code across several machines on a network. When the software is properly configured, distcc generates the same results as a local build, is simple to install and use, and is much faster than a local compile. It does not require all machines to share a file system, have synchronized clocks, or to have the same libraries or header files installed. ATLAS has dedicated access to five distcc server machines.

3.14.5 Releases and Release Strategy

A hierarchy of release builds is used to ensure rapid feedback of package integration problems and as testbeds for testing and validation. The hierarchy is:

3.14.5.1 Project Releases

From its inception ATLAS offline software releases have been made for the entire code base. This has put stress on the release-build hardware, and also compromises the robustness of the software to changes in the core packages. A reorganization is therefore under way, with the goal being to split the offline software into distinct projects , each project possibly depending upon other projects, but having their own release cycle and support tools (e.g. Tag Collector, release coordinator). Packages within each project can only depend upon other packages within the same project or from a lower project.

The tentative project decomposition is:

The assignment of packages to these projects is under way, with the goal that these project-based builds will be in production in September 2005.

3.14.5.2 Release Coordinator

The Release Coordinator has overall responsibility for the release builds, having a term of duty that extends over a single production release cycle (i.e. from one production release until the next one). They have ultimate authority over whether to delay a release or reject late submissions, etc.

The Release Coordinator makes sure the (developer and production) release deliverables are met, by staying in contact with the developers and coordinators involved. Given the importance of functionality of the nightly builds and developer releases for the success of a production release, the Release Coordinator regularly monitors the functionality of the nightly release to make sure that no serious problem remains unfixed. The Release Coordinator coordinates major changes to minimize disruption and ensures that developers are informed about major changes, and about the status/usability of developer releases.

3.14.6 Code Distribution

The architecture of the code distribution for the ATLAS software is based on a combination of the CMT code management tool and the Pacman [3-46] code packaging tool. A set of shell scripts that are part of the librarian toolkit performs the appropriate queries to the CMT knowledge base provided by the package authors in the form of package requirements files, and construct a Pacman cache from it. Users wishing to install the ATLAS software can simply use Pacman to download and install it on their machines. Although the system is in production (it has been greatly exploited for all recent data production activities), several elements are still continuously evolving and improving, since several aspects of the functionality are not completely implemented such as incrementally building the kits, increasing traceability, etc.

3.14.7 Quality Assurance and Quality Control (QA/QC)

The essential difference between QA and QC is that the former describes an approach to software development, whereas the latter implies that development is subject to a series of tests measuring the software quality. The QA/QC coordinators pursue a two-pronged approach providing support for voluntary peer reviews of a technical nature, and also support for a tool which checks the compliance of code with ATLAS C++ coding rules.

Until 2005, the number of technical reviews within ATLAS software has been relatively small. In the first half of 2005 ATLAS software management has adopted the QA/QC recommendations for review procedure in a series of 10 non-technical reviews. This experience has had a positive effect in that it has encouraged sub-groups to organize their own technical reviews, following the same procedure.

ATLAS has adopted the "RuleChecker" code parser [3-47]. The initial contract with the vendor was terminated before it was technically possible to run the tool over all the ATLAS release. Since March 2005 it has become possible to do this, using the RTT tool, allowing a better view of the ensemble of the results. The tool was used with some success by the database group during their January 2005 documentation drive. Some RuleChecker parsing bugs have been exposed, which are rapidly fixed by the developers. However, systematic use of the tool has also exposed weaknesses and inconsistencies in the coding rules, and any modification of the tool to accommodate changes to these rules will require a new contract with the vendor.

Quality Control is discussed in detail in Section 3.13 , "Testing and Validation".

3.14.8 Documentation

At present the documentation of ATLAS software exists at multiple places and in various forms. The principal sites for access to documentation are:

An effort is under way to assess, upgrade and streamline the existing documentation [3-52]. Special attention is paid to documentation intended for users who are new to ATLAS. The example of BaBar is being followed, which has very good experience with documentation for newcomers in the workbook style.

Recommendations for web pages and usage of Doxygen will ensure better control of correctness and timely maintenance of the information. Each page must list at least the responsible person and the date of the last significant update.

Documentation is a common effort. Once the initial layout and the definition of rules and templates are available, the contributions of experienced developers, and the review by experienced and new users must be encouraged. Communication and cooperation between the software and physics communities, and the developers and users is needed to keep the documentation up-to-date and useful, and to avoid duplication of effort.

3.14.9 User Support

ATLAS uses the Savannah [3-53] bug tracking system for reporting problems or following the progress of "tasks". Problems can be posted to Savannah anonymously or under a registered username if a user chooses to log in. Currently ATLAS has 36 bug tracking groups (projects) in Savannah. There is one group (ATLAS Bugs), which is a general group intended for use by users who do not know to which group their bug should be posted. The manager of the bug group assigns each bug to a member of the group who then becomes responsible for following the bug and seeing that it is fixed. Every action within Savannah generates an automatic email to all of the people working on the bug, the group managers, and the person submitting the bug (unless the submission was anonymous).

The "atlas-sw-help" mailing list was originally set up to reduce traffic on the "atlas-sw-developers" list. Membership was meant to be limited just to "experts" who would feel duty-bound to respond to requests, which would often be from beginning users. However, for various reasons this mechanism has not been entirely successful and the SIT is currently considering the future of this list. It seems likely that membership will be reset to a reduced number of people, and that a moderator will be put in place. Even though many people do not feel that a mailing list is the best solution for this kind of user support, all attempts to replace the mailing list mechanism by a forum or news group have failed.

Another aspect of user support is managing disk space requests and allocations, both on behalf of ATLAS users, and for ATLAS operations. A general support mailing list ("atlas-support") is available for these requests. Disk space management is particularly important for the ATLAS releases. A policy has been established to control the deletion of obsolete releases and to allow ample time for physicists to migrate to newer and more functional releases. Releases are archived prior to deletion.

3.14.10 HLT Coordination

Figure 3-15 shows the dependencies between the online, high-level trigger, offline, and external software. This tight coupling requires good coordination to ensure that consistent package versions are maintained. Representatives from the online and HLT communities are invited to the regularly scheduled SIT meetings, and to the ATLAS computing-management meetings to ensure good communication for this and other coordination issues.


Figure 3-15 Online, High-Level Trigger and Offline dependencies

 

3.14.11 Mailing Lists

Use of mailing lists is an essential part of the communication between ATLAS collaborators. Many lists are used, targeted at different user and developer communities. The CERN SIMBA2 mailing list management system is used to control access, both for subscribing to receive email notifications, and for authority to be able to post to specific lists. The lists themselves are archived and provide a web interface.


1. CVS, CMT and the Tag Collector are described in detail in Section 3.14.2 , "Code Management".

2. The Athena architecture and framework are described in Section 3.3 , "The Athena Framework"



4 July 2005 - WebMaster

Copyright © CERN 2005