A History of Scientific Text Processing at CERN

Text processing has always had some kind of special place in the computing environment at CERN and elsewhere. Processing documentation is an essential part of communication to explain how to use (software and other) tools, describe procedures, and publish results. In fact CERN has more often than not been using ``state of the art'' standard tools. In this article I review the main scientific text processing systems that have been in use at CERN since the early 1970s. I will show how they evolved logically over time so we end up with the present situation where the main processors can be optimally integrated via XML technologies.

First a word of caution. Text processing systems have evolved considerably over the last decades. First, electronic typewriters acquired memory and developed into a large set of incompatible word-processing systems on dedicated machines (e.g. Norsk Data, Wang, AES, Philips, IBM, Olivetti, and Nixdorf), all of which were used in various CERN services in the 1970s and 1980s. This overview does not discuss these specific systems, but limits itself to (scientific) text processors that were available on CERN's central computing facilities.

The early days: From the typewriter to Waterloo Script

The first issue of the Computer Newsletter (dated 15 February 1966) was produced on a classic typewriter. Later issues were put together by more complex means (no indication of which system was used remains), with ``cut and paste'' techniques for including figures, tables, etc. playing an important role.

It has to be remembered that commonly-available printers had very limited capabilities in the 1970s, with uppercase-only being the norm. It is thus no surprise that text processing systems available on general-use computers only began to appear when printers got more flexible.

The first text formatting program I found mentioned in the CNL (88, December 1973) was BARB (CERN Program Library entry Q500), which was a program ``TO EASE THE CHORE OF OPERATING AND UPDATING PROGRAM WRITEUPS''. It had text strings interspersed with format control and allowed uppercase-only texts with titles, subtitles, appendices, headings, justification (left, right, centered) and boldface. Column mode (an elementary way to represent tabular information) was also available.

BARB was superceded a year later by BARBA (BARB ASCII, CERN Program Library entry Q501), which allowed upper and lowercase letters, as well as underlining. The input to these programs was generally in ``card form'', with column 1 used for control, columns 2-72 for text and control parameters, and columns 73-80 for sequence record information, not read by the processor, but which came in handy when one dropped a box of punched cards (in those days the preferred input method for computer information).

Also in the middle 1970s, Horst von Eicken developed AUTHOR (CERN Program Library Q510), an interactive text processing system for the Control Data 6000/Cyber computer systems. It contained a superset of ASCII, plus the Greek alphabet and some mathematical symbols, and allowed for tabular input and other layout commands. Version 2 was released in 1978 (CNL 136, August 1978).

At about the same time, Tony Shave, wrote SIMTEX (SIMple TEXt processor). SIMTEX was written in BCPL, and hence ran on a fairly wide variety of computers (CDC 6000, IBM 370, Nord 10, PDP-11, VAX, HP 2100). CERN-written documentation of BCPL and of the first generation of CERN microprocessor cross-software, was prepared with SIMTEX.

In the meantime an IBM Mainframe had been installed at CERN, and the PEO (Program Enquiry Office) decided to transfer all its documentation work to IBM (CNL 132, April 1978). To do this they used a program to convert BARBA files to SCRIPT files, that could then be run with the Waterloo SCRIPT processor (e.g., using SYSPUB, a simplified set of SCRIPT macros). This was the beginning of the SCRIPT era at CERN, and that formatter would remain the basis of most text processing work at CERN until the advent of personal workstations in the late 1980s.

The first laser printers

As already mentioned, the quality and functionality of general text processing systems is closely linked to the availability of output devices. It should therefore come as no surprise that the arrival at CERN in April 1979 (CNL 143) of the first laser printer, an IBM 3800, opened up a new realm of possibilities for higher quality typesetting thanks to these laser devices. Anders Berglund, who was to shape text processing at CERN in the 1980s, showed in a further article in September 1979 (CNL 147) how, with the SYSPUB macros, one could obtain accents using a ``EURO'' character set, as well as miscellaneous other characters for composing block diagrams (the first time that such complex output was possible without having to use a plotter).

More character sets became available at the beginning of 1980 (CNL 149), and they offered for the first time a choice between various type sizes (10, 12 , and 15 characters per inch). Moreover, in the same issue of the CNL Berglund contributed an article ``SCRIPT as an Aid in Preparing Papers for Physics Results'', where he gave some hints on how to prepare publications for submission to physics journals. He describes PHYSPAP, a macro set developed at CERN and based on Waterloo SCRIPT's SYSPAPER, that allowed high-quality output to be sent to the Photon photo-typesetter-- subsequently replaced by a Compugraphics photo-typesetter--connected to a NORD 100 computer, and was customised for styles similar to those of the Nuclear Physics journal. A lot of scientific symbols were available, and one-line equations could be typeset, e.g.,

The availability of this system was the basis of a mini-revolution, since for the first time scientists could consider preparing their scientific papers themselves in a reliable way. Because this new application was so much better than BARBA, support for the latter was dropped in August 1980.

The next big event was the arrival of the first loose-sheet laser printer, the IBM 6670. It came with an extended set of Greek and mathematical symbols, including accented characters. This printer proved a huge improvement over what was available with the IBM 3800. The 6670 laser printer also offered for the first time proportionally-spaced fonts (previously only available on photo-typesetters). The latest SCRIPT installation introduced easy-to-remember shorthand for common entities. For instance, when using the proportionally-spaced accented font set 302, one could write:

CERNPAPER

Over the years, users of SCRIPT at CERN had been using various macro packages, such as SYSPAPER, SYSPUB, PHYSPAP, or had developed their own SCRIPT-based macro definitions library (e.g., Gino De Bilio for the manuals of DD-EE Group, Julius Zoll for his Hydra & Patchy documentation, Horst von Eicken for writing yellow reports documenting microprocessor cross-software). Therefore, it seemed appropriate to propose and develop a generic SCRIPT document processing macro-system optimised for CERN use, and Eric van Herwijnen, working with Anders Berglund, released CERNPAPER in January 1984 (CNL 172). It presented a set of high-level macros for various layouts, such as letters, memoranda, technical notes, reports, minutes and agenda for meetings, papers for physics journals, writeups, and manuals. A user-friendly interface to Wylbur, an interactive line-oriented interface to IBM's MVS, was available to easily produce a skeleton job for the various supported document types, thus eliminating the need to be a SCRIPT expert. In fact, most basic SCRIPT commands were made ``invisible'' and were replaced with generic commands, such as

New versions of Waterloo SCRIPT were made available as soon as they were released. They included support for the newest photo-composition and laser printing devices, that allowed for better font handling, negative skips and overlaps. They also introduced spelling checking possibilities via the inclusion of dictionaries, improved hyphenation, more flexible super and subscript handling and better error-reporting.

At the beginning of 1985 an APA6670 (All Points Addressable) high-volume single-sheet printer was installed at the Computing Center. This printer was announced in CNL 178, and henceforth the Computer Newsletter was typeset on that printer, a quantum leap forward in quality from the mono-spaced fonts used for the earlier issues.

From GML to SGML

At the beginning of 1985 SCRIPT version 84.1 was installed. It introduced support for GML (Generalized Markup Language). But rather than use the native GML syntax, Berglund took the wise step of introducing directly the reference concrete syntax of SGML (Standard Generalized Markup Language) that was at that time in the last stages of becoming an ISO standard 8879[10]).

SGML considers documents as tree-structures, where a grammar can be defined for any given class of documents. SGML is therefore not a ``markup language'' in the same sense as SCRIPT or TEX . SGML only defines the syntax for creating an infinite variety of markup languages and is hence completely independent of the text formatter. Documents marked up with SGML can be interchanged between different systems and rendered on different installations.

Berglund published the first edition of the CERN SGML User's Guide in October 1986. It offered a rich tag set for preparing the same kind of documents as those proposed by CERNPAPER, as well as for foils. Here is an example of the markup:

The introduction of L^ATEX

In the late 1980s various Unixes and VAX/VMS became popular at CERN (and elsewhere) and the need for a text processing system that ran on all systems became ever more important.

Physicists and engineers who visited the United States of America, especially SLAC, told us with great enthusiasm about TEX , a publicly available text processing system that D.E. Knuth of Stanford University had been working on with his students since 1977. It consists of two main components, TEX [6] and [8]. About the aim of his project, Knuth wrote in the foreword of the TEX book[6]: ``TEX $\lbrack$ is $\rbrack$ a new typesetting system intended for the creation of beautiful books -- and especially for books that contain a lot of mathematics. By preparing a manuscript in TEX format, you will be telling a computer exactly how the manuscript is to be transformed into pages whose typographic quality is comparable to that of the world's finest printers''.

TEX popularity with thousands of scientists is mainly due to the ease with which any kind of writing can be transformed into various document classes, suhc as articles, reports, proposals, books, in a way that is completely under the control of the writer through a rich set of formatting commands.

By its very conception, TEX is particularly useful when the document contains mathematical formulae that have to be rendered with high typographic precision. Moreover the program, originally written in an enriched dialect of Pascal, but now distributed in C, can be compiled on almost any operating system in the world, so that it runs on a wide range of computer platforms, from micros to mainframes. It behaves 100% identically on all machines, a fact extremely important in the scientific and technical communities. Related to this portability is TEX 's printing device independence, so that a document can be printed on anything from a CRT screen, a medium-resolution dot or laser printer, to a professional high-resolution photo-typesetter.

Because of these qualities and since it is available in the public domain TEX has become the de facto standard text processing system in many academic departments and research laboratories, It has also been adopted by members of the professional publishing world as a printing engine. In his foreword to ``TEX and , New Directions in Typesetting''[9], Gordon Bell wrote that ``Don Knuth's Tau Epsilon Chi (TEX ) is potentially the most significant invention in typesetting in this century. It introduces a standard language in computer typography and in terms of importance could rank near the introduction of the Gutenberg press''.

At the beginning of the 1980s, Leslie Lamport started work on a document preparation system L^ATEX [7], based on the TEX formatter. The system adds a level of abstraction to the plain TEX commands and lets the user concentrate on the structure of the document rather than on formatting details. A few high-level commands allow the user to easily compose most of his documents. Users do not have to bother about typographic details, which are left to the document designer, who provides style files for every application.

TEX was first officially introduced at CERN in September 1987 (CNL 189) running on the central VAX Service. System support was provided by Alexander Samarin who developed a set of integration tools. However, on the VM/CMS system, the ``recommended'' system remained SGML. A TEX service on VM/CMS was announced in May 1988 (CNL 191). Support was by Jurgen de Jonghe of the TP Section of the newly created OCS Group in MIS Division. In September 1989 SGMLTEX, an application for supporting SGML on VAX/VMS, was announced (CNL 196).

Defining a text processing policy

The need to define a global text processing policy for CERN became ever clearer and in 1989 a proposal (CERN/DD/89-25) was presented to the Meddle Committee and approved. The proposed policy was presented during a CERN-wide seminar about ``Text processing at CERN'' in December 1989.

A summary, detailing which text processing systems were supported at that time, was published in CNL 198:

The move from SGML based on Waterloo SCRIPT to IBM's BookMaster suite was because the latter presented a more professional and standard-compliant product, including a very flexible high-level style language for defining document layout. CERN customisation of the IBM styles was available via the BOOKIE exec. The math formatter was SMFF, a variant of EQN, the formula processor originally developed for troff on Unix systems.

As of issue 198, the CNL was produced with SGML/BookMaster and started featuring a regular section on text processing. For instance CNL 199 (June 1990) contained a long article about SGML by Eric van Herwijnen (who, after Anders Berglund had left for ISO, had become responsible for text processing developments at CERN), and an introduction to the PostScript language (by M.G.). CNL 202 (June 1991) contained an overview article describing the various text processing systems in use at CERN, recommendations for typesetting rules for writing scientific documents (still very relevant today, see http://home.cern.ch/goossens/typorules/typorules.html), and an explanation of the importance of SGML in the publishing world.

CNL 203 contained further information on how to optimally prepare one's documents for typesetting, and proposed a set of entity names for elementary particle. At the back it included a questionnaire about the text processing needs of the user community (use of systems and macro packages, needs for multiple input languages, training, required included graphics material).

During this period the documentation for the software packages supported by the User Support (later Application Software) Group (e.g., Hbook, PAW, CERNLIB, GEANT) was translated from Waterloo SCRIPT/SGML, into L^ATEX , and made available as printable PostScript files (CNL 205).

The creation of the Web and HTML

With mainframes being abandoned and most development, administrative, and production work moving to Unix and Microsoft Windows workstations, there was for a few years a three-prong approach for producing documents at CERN, namely TEX for physics documents (all systems, with first-line support by Michèle Jouhet's Team in Michael Draper's EET/DH Group ), FrameMaker for large technical manuals and reports (all systems, with support until the end of 2000 by Mario Ruggier, and presently by Johan Karlsson in IT/API), and Microsoft Word (or, more generally, Microsoft's Office Suite) for administrative work (and some technical work in the engineering sector) on PC and Mac (with support also in ETT/DH).

With the help of a consultant, Sebastian Rahtz, in 1992/93 we installed at CERN a reference system containing all the latest L^ATEX developments. This work became the basis of the TEX Live CDROM, which is now a world reference for TEX distributions, as well as of three books [3,4,5] on the use of L^ATEX (articles about L^ATEX developments appeared in almost every CNL from 203 to 225) and from CNL 206 (until today!) the printed version of the CNL has been produced with L^ATEX (with one exception, see below).

In the early 1990s a major event happened at CERN, namely Tim Berners-Lee and collaborators developed the basics of what was to become the Web. In those days (1992-1993) Tim B.-L. was sitting just a few offices down the corridor from where we (M.G. and S.R.) were working, and already at the beginning of 1993 we had translated,with the active help of Tim, some L^ATEX documents into HTML (we started with the HBOOK manual), first using a home-made ad hoc set of L^ATEX macros, later with L^ATEX 2HTML . A first article on HTML appeared in September 1993 (with a nice picture of an HTML page displayed with Xmosaic, the first generally-available X-Window browser and the precursor of Netscape), clearly showing that at CERN HTML was being used well before the rest of the world, that became aware of the Web mostly after the ``Woodstock of the Web'', the First World Wide Web Conference organised at CERN on May 25-27 1994 (a more detailed history of the Web can be found in [1,2]).

The Web is essentially based on the successful triad: the HTTP protocol, the URL uniform addressing scheme, and the HTML language. The syntax of the HTML language was inspired by that of the SGML language system that we had been running at CERN since the middle 1980s. However, the first versions of HTML were presentation-directed, and it was not until version 3 of HTML that a formally correct DTD of the language was published.

By a decision of the CERN Council in December 1994 CERN officially left Web development to the World Wide Web Consortium (W3C, http://www.w3c.org), that had been set up a few months earlier. INRIA (France) and MIT (USA), later joint by Keio in Japan, were to co-host the W3C, and coordinate further Web-related activities.

As far as the CNL is concerned there was a move from L^ATEX -source to HTML -source in the middle of 1996 (CNL 223), and one issue (CNL 224) was produced with the ``Print'' menu option of the browser. However, several readers protested about the ``significant loss of markup, structure and readability'' (a fact that the CNL editors agreed with), and hence it was decided to transform the HTML back to L^ATEX to prepare the printed version with a higher typographic quality.

Coming full circle

The Web became so popular that several browser vendors started competing to offer specific extensions to the HTML language to attract users wanting to publicise their products. Therefore, various mutually-incompatible dialects of HTML appeared. Moreover, to really benefit from the Web and the various applications that were being developed, the XML initiative was initiated in late 1997. Jon Bosak, who published his seminal article ``XML, Java, and the future of the Web'' (http://www.xml.com/pub/a/w3j/s3.bosak.html) around that time, was one its main promoters. All this culminated in the publication of the XML W3C Recommendation (First edition http://www.w3.org/TR/1998/REC-xml-19980210, Second edition http://www.w3.org/TR/2000/WD-xml-2e-20000814), which defines the XML language in a formal way.

XML is truly and explicitly international in that it espouses Unicode as its basic character set. Unicode (http://www.unicode.org) provides an unambiguous, fixed 16-bit (2-octet) code set (in fact 17 planes, each 16 bits wide, are defined). The 65,536 characters of plane 0 contain ASCII, plus most characters needed for writing in the major living languages. The other sixteen planes with 65,536 characters each, provide for the inclusion of ancient languages, more specialised math and other characters, plus over 100,000 code points for private use areas.

Looking back over a period of almost thirty years one observes that CERN started with specific markup (BARB and SCRIPT-based languages), and went via a more generic approach (CERNPAPER and SGML) to a portable mixture of presentation and structure (L^ATEX ). HTML started off as essentially presentational (for making nice-looking and attractive Web pages), but soon it was realized that presentation (style) and structure should be independent. Hence there was a return to the SGML approach, pioneered in the early 1980s, in the form of XML, the SGML-lite of the Web. Table 1 shows a comparison of the syntax of various text processors used at CERN since the 1980s until now.

Description CERNPAPER CERN SGML (HTML) LaTeX nroff/troff (mm macros) XML/DocBook (MathML)

Document sectioning commands

Level 0 .part <H0> \part not available <part>

Level 1 .chapter <H1> \chapter .NH1/.H1 <chapter>

Level 2 .section <H2> \section .NH2/.H2 <sect1>

Level 3 .subsection <H3> \subsection .NH3/.H3 <sect2>

Level 4 .subsub <H4> \subsubsection .NH4/.H4 <sect3>

New paragraph .para <P> \par .PP/.P n <para>

Highlighted and other special text

Normal normal text normal text normal text or \textrm{text} normal text or .R normal text

Emphasis

.highl1 ... .ehighl1

<HP1> ... </HP1>

(<EM> ... </EM>)

\emph{...}

.I ...

<emphasis>...</emphasis>

Quotation .quote begin ... .quote end <Q> ... </Q> \begin{quote} ... \end{quote} .QS ... .QE <quote> ... </quote>

Footnote .footnote ... .footend <FN> ... </FN> \footnote{...} .FS ... .FE <footnote> ... </footnote>

Lists

Ordered

.point begin

.point

.point end

<OL>

<LI> xxx

<EOL>

\begin{enumerate}

\item ...

\end{enumerate}

.AL

.LI xxx

.LE

<orderedlist>

<listitem> ...

</orderedlist>

Unordered

.bullet

.bullet ...

.bullet end

<UL>

<LI> xxx

<EUL>

\begin{itemize}

\item ...

\end{itemize}

.BL

.LI xxx

.LE

<itemizedlist>

<listitem> ...

</itemizedlist>

Description (glossary)

.glossary begin

.glosssary xxx

...

.glossary end

<DL>

<DT> xxx

<DD> ...

<EDL>

\begin{description}

\item[xxx]

...

\end{description}

.VL

.LI xxx

...

.LE

<variablelist>

<varlistentry>

<term>...</term>

<listitem>...</listitem>

</varlistentry>

</variablelist>

Principle of how to type mathematics and other special characters

Greek (e.g. $\alpha$ ) @a α $\alpha$ \(*a α (α)

Math symbol (e.g. $\approx$ ) &approx. ≈ $\approx$ \(ap ≈ (≈)

Inline math

special math chars

$math text$ or$math text$

# math text #

<inlineequation>

<math>MathML code</math>

</inlineequation>

Display math

.stmath ... .emath

special math chars

\[math text\]

.EQ math text .EN

<informalequation>

<math>MathML code</math>

</informalequation>

Superscript &S'text. <SUP>text</SUP> $\sp{math text}$ # sup {math text}# <subscript>text</subscript>

Subscript &s'text. <SUB>text</SUB> $\sb{math text}$ # sub {math text}# <supscript>text</supscript>

Accents and umlauts &acute.e or &trema.u é or ü \'e or \"u e\*' or u\*: é (é) or ü (ü)

Non breaking space \ (set by command ``.blank \'')   ~ \ or\(space)  

Specific formatting page markup

Line break .break Fall through to formatter \newline .br Processing instruction for formatter

Page break .page Fall through to formatter \newpage .bp Processing instruction for formatter

XML and L^ATEX , perfect twins for the 21st century?

Most Web-related activities are currently centered on the use of XML for data exchange and document production (Elliotte Rusty Harold's ``Cafe con Leche XML News and Resources'' at the URL http://www.ibiblio.org/xml/ is a good source of up-to-date information). Microsoft, Oracle, IBM, Sun, as well as open source applications, such as perl, Python, the GNU Project, GNOME, have all included XML into their basic strategic plans. Hundreds of dedicated data vocabularies (DTDs), each dealing with a particular subject area, have been proposed (see Robin Cover's ``XML Pages'' at the URL http://www.oasis-open.org/cover/xml.html).

For scientific and computer documentation DocBook (http://www.docbook.org) markup has been in use for many years. The DocBook DTD or schema contains hundreds of elements to mark up clearly and explicitly the different components of an electronic document (book, article, reference guide, etc.), not only displaying its hierarchical structure but also indicating the semantic meaning of the various elements. Moreover, the structure of the DTD is optimised to allow for customisation, thus making it relatively straightforward to add or eliminate certain elements or attributes, to change the content model for certain structural groups, or to restrict the value that given attributes can take.

A short example of a document marked up in DocBook and including some math follows:

The DTD mybook.dtd, which follows, declares how to combine MathML's <math> ``master'' element with the DocBook DTD (lines 1 and 2), while the following lines declare parameter entities for the MathML and DocBook DTD, and load them (last two lines), so that their elements are available for the XML parsers. The attribute definition for the math element on line 5 defines the namespace for that element and its daughters.

The XML document described above was first transformed with an XSLT processor into XSL-FO formatting elements with the help of Normal Walsh's XSLT stylesheets (http://nwalsh.com/docbook/xsl/). These formatting objects were then interpreted and typeset with Sebastian Rahtz' PassiveTeX (http://users.ox.ac.uk/~rahtz/passivetex/), and the result is shown in Figure 3.

Various other ways exist to transform XML documents into viewable information with the help of XSLT stylesheets (via HTML or directly with CSS in XML-aware browsers). XSLT to XSL-FO transformations and specific applications (such as the Apache Project's FOP, http://xml.apache.org/fop/, or PassiveTEX , see above) can be used to obtain PDF or PostScript.

More generally, as already explained in CNL 2000-002 (http://ref.cern.ch/CERN/CNL/2000/002/xml-strategy), XML can be considered as the central element in an integrated Internet strategy, where documents (and data) are stored in an ``online'' electronic repository in XML, from which they can be transformed into various formats. We reproduce here the figure of that article (Figure 4), where we emphasise the complementary roles of XML (at the top right), that can be manipulated by a whole set of mostly publicly-available standard tools, and L^ATEX (at the top left), that is available for high-quality typesetting purposes, where precision and clarity is desired.

CERN is presently collaborating with other academic and commercial partners in an IST project funded by European Union, TIPS (Tools for Innovative Publishing in Science). The aim of TIPS (http://tips.sissa.it) is to develop a set of user-friendly and advanced tools and services that are organised in an open system to support research information production, management, access, and use in a coherent manner. These ideas are being tested using articles of the electronic Journal of High Energy Physics (http://jhep.sissa.it). In particular, the articles are transformed from L^ATEX into XML to see how this generic format can be used to optimally support the activities of document writing, reviewing, publishing, searching, disseminating and reading, as well as communication among members of the research community.

Bibliography

Acknowledgements

I would like to thank Anders Berglund, Julian Blake, Ian McLaren, Miguel Marquina, Jutta Megies, Harry Renshall, David Stungo, David Williams, and Roger Woolnough for sharing information or documents with me while I was preparing this article.

Contents