CERN Accelerating science

This website is no longer maintained. Its content may be obsolete. Please visit http://home.cern/ for current CERN information.

CERN home pageCERN home pageDocuments by ReferenceDocuments by ReferenceCNLsCNLsYear 2001Year 2001Help, Info about this page

Contents

Editorial Information
Editorial
If you need help
Announcements Physics Computing Desktop Computing Internet Services and Network Scientific Applications and Software Engineering Desktop Publishing
  • A History of Scientific Text Processing at CERN
The Learning Zone User Documentation Just For Fun ...
Previous:Desktop Publishing
Next:The Learning Zone
 (See printing version)



A History of Scientific Text Processing at CERN

Michel Goossens , IT/API


Abstract

Text processing has always had some kind of special place in the computing environment at CERN and elsewhere. Processing documentation is an essential part of communication to explain how to use (software and other) tools, describe procedures, and publish results. In fact CERN has more often than not been using ``state of the art'' standard tools. In this article I review the main scientific text processing systems that have been in use at CERN since the early 1970s. I will show how they evolved logically over time so we end up with the present situation where the main processors can be optimally integrated via XML technologies.


First a word of caution. Text processing systems have evolved considerably over the last decades. First, electronic typewriters acquired memory and developed into a large set of incompatible word-processing systems on dedicated machines (e.g. Norsk Data, Wang, AES, Philips, IBM, Olivetti, and Nixdorf), all of which were used in various CERN services in the 1970s and 1980s. This overview does not discuss these specific systems, but limits itself to (scientific) text processors that were available on CERN's central computing facilities.

The early days: From the typewriter to Waterloo Script

The first issue of the Computer Newsletter (dated 15 February 1966) was produced on a classic typewriter. Later issues were put together by more complex means (no indication of which system was used remains), with ``cut and paste'' techniques for including figures, tables, etc. playing an important role.

It has to be remembered that commonly-available printers had very limited capabilities in the 1970s, with uppercase-only being the norm. It is thus no surprise that text processing systems available on general-use computers only began to appear when printers got more flexible.

The first text formatting program I found mentioned in the CNL (88, December 1973) was BARB (CERN Program Library entry Q500), which was a program ``TO EASE THE CHORE OF OPERATING AND UPDATING PROGRAM WRITEUPS''. It had text strings interspersed with format control and allowed uppercase-only texts with titles, subtitles, appendices, headings, justification (left, right, centered) and boldface. Column mode (an elementary way to represent tabular information) was also available.

BARB was superceded a year later by BARBA (BARB ASCII, CERN Program Library entry Q501), which allowed upper and lowercase letters, as well as underlining. The input to these programs was generally in ``card form'', with column 1 used for control, columns 2-72 for text and control parameters, and columns 73-80 for sequence record information, not read by the processor, but which came in handy when one dropped a box of punched cards (in those days the preferred input method for computer information).

Also in the middle 1970s, Horst von Eicken developed AUTHOR (CERN Program Library Q510), an interactive text processing system for the Control Data 6000/Cyber computer systems. It contained a superset of ASCII, plus the Greek alphabet and some mathematical symbols, and allowed for tabular input and other layout commands. Version 2 was released in 1978 (CNL 136, August 1978).

At about the same time, Tony Shave, wrote SIMTEX (SIMple TEXt processor). SIMTEX was written in BCPL, and hence ran on a fairly wide variety of computers (CDC 6000, IBM 370, Nord 10, PDP-11, VAX, HP 2100). CERN-written documentation of BCPL and of the first generation of CERN microprocessor cross-software, was prepared with SIMTEX.

In the meantime an IBM Mainframe had been installed at CERN, and the PEO (Program Enquiry Office) decided to transfer all its documentation work to IBM (CNL 132, April 1978). To do this they used a program to convert BARBA files to SCRIPT files, that could then be run with the Waterloo SCRIPT processor (e.g., using SYSPUB, a simplified set of SCRIPT macros). This was the beginning of the SCRIPT era at CERN, and that formatter would remain the basis of most text processing work at CERN until the advent of personal workstations in the late 1980s.

The first laser printers

As already mentioned, the quality and functionality of general text processing systems is closely linked to the availability of output devices. It should therefore come as no surprise that the arrival at CERN in April 1979 (CNL 143) of the first laser printer, an IBM 3800, opened up a new realm of possibilities for higher quality typesetting thanks to these laser devices. Anders Berglund, who was to shape text processing at CERN in the 1980s, showed in a further article in September 1979 (CNL 147) how, with the SYSPUB macros, one could obtain accents using a ``EURO'' character set, as well as miscellaneous other characters for composing block diagrams (the first time that such complex output was possible without having to use a plotter).

More character sets became available at the beginning of 1980 (CNL 149), and they offered for the first time a choice between various type sizes (10, 12 , and 15 characters per inch). Moreover, in the same issue of the CNL Berglund contributed an article ``SCRIPT as an Aid in Preparing Papers for Physics Results'', where he gave some hints on how to prepare publications for submission to physics journals. He describes PHYSPAP, a macro set developed at CERN and based on Waterloo SCRIPT's SYSPAPER, that allowed high-quality output to be sent to the Photon photo-typesetter-- subsequently replaced by a Compugraphics photo-typesetter--connected to a NORD 100 computer, and was customised for styles similar to those of the Nuclear Physics journal. A lot of scientific symbols were available, and one-line equations could be typeset, e.g.,

E[(X-@m)@2] = &s'-@B.@I&S'@B.(x-@m)@2 f(x) dx = @s@2
would produce something like:
\begin{displaymath} E[(X-\mu)\sp{2}] = \int_{-\infty}\sp{\infty}  (x-\mu)\sp{2} f(x) dx = \sigma\sp{2}. \end{displaymath}

One notes the frequent use of the @ as shorthand, and the & as functional operator. Typing high-energy physics processes was quite straightforward, e.g.,
@p@+p@AK@+@S@+
which gives:
\begin{displaymath}\pi\sp{+} \mathrm{p} \rightarrow  \mathrm{K}\sp{+} \Sigma\sp{+}.\end{displaymath}

For multi-level formulae one had to define tabular positions and construct the alignments by hand, e.g.,
.stmath
.tb 5 9 12
.tb set $
$@B$1$1
$@S$-$-
$n=1$n@2$m
.emath
for the following output:
\begin{displaymath}\sum_{n=1}\sp{\infty} \frac{1}{n\sp{2}}  \frac{1}{m}.\end{displaymath}

The availability of this system was the basis of a mini-revolution, since for the first time scientists could consider preparing their scientific papers themselves in a reliable way. Because this new application was so much better than BARBA, support for the latter was dropped in August 1980.

The next big event was the arrival of the first loose-sheet laser printer, the IBM 6670. It came with an extended set of Greek and mathematical symbols, including accented characters. This printer proved a huge improvement over what was available with the IBM 3800. The 6670 laser printer also offered for the first time proportionally-spaced fonts (previously only available on photo-typesetters). The latest SCRIPT installation introduced easy-to-remember shorthand for common entities. For instance, when using the proportionally-spaced accented font set 302, one could write:

..im SYSPAPER FONT=302
..ch /'e/&eacute.e/
Le monde est carr'e.
yielding: ``Le monde est carré.'' The first line in the SCRIPT source loads the SYSPAPER macro package with the font 302, while the second line instructs SCRIPT to change all occurrences of 'e to &eacute.e, where one recognises the precursor notation for entity references in GML, and later SGML and XML.

CERNPAPER

Over the years, users of SCRIPT at CERN had been using various macro packages, such as SYSPAPER, SYSPUB, PHYSPAP, or had developed their own SCRIPT-based macro definitions library (e.g., Gino De Bilio for the manuals of DD-EE Group, Julius Zoll for his Hydra & Patchy documentation, Horst von Eicken for writing yellow reports documenting microprocessor cross-software). Therefore, it seemed appropriate to propose and develop a generic SCRIPT document processing macro-system optimised for CERN use, and Eric van Herwijnen, working with Anders Berglund, released CERNPAPER in January 1984 (CNL 172). It presented a set of high-level macros for various layouts, such as letters, memoranda, technical notes, reports, minutes and agenda for meetings, papers for physics journals, writeups, and manuals. A user-friendly interface to Wylbur, an interactive line-oriented interface to IBM's MVS, was available to easily produce a skeleton job for the various supported document types, thus eliminating the need to be a SCRIPT expert. In fact, most basic SCRIPT commands were made ``invisible'' and were replaced with generic commands, such as

.chapter
.point begin
.point xxxxx
.point yyyyy
.point end
.para
The above lines start a chapter, define a numbered list, and start a paragraph.

New versions of Waterloo SCRIPT were made available as soon as they were released. They included support for the newest photo-composition and laser printing devices, that allowed for better font handling, negative skips and overlaps. They also introduced spelling checking possibilities via the inclusion of dictionaries, improved hyphenation, more flexible super and subscript handling and better error-reporting.

At the beginning of 1985 an APA6670 (All Points Addressable) high-volume single-sheet printer was installed at the Computing Center. This printer was announced in CNL 178, and henceforth the Computer Newsletter was typeset on that printer, a quantum leap forward in quality from the mono-spaced fonts used for the earlier issues.

From GML to SGML

At the beginning of 1985 SCRIPT version 84.1 was installed. It introduced support for GML (Generalized Markup Language). But rather than use the native GML syntax, Berglund took the wise step of introducing directly the reference concrete syntax of SGML (Standard Generalized Markup Language) that was at that time in the last stages of becoming an ISO standard 8879[10]).

SGML considers documents as tree-structures, where a grammar can be defined for any given class of documents. SGML is therefore not a ``markup language'' in the same sense as SCRIPT or TEX . SGML only defines the syntax for creating an infinite variety of markup languages and is hence completely independent of the text formatter. Documents marked up with SGML can be interchanged between different systems and rendered on different installations.

Berglund published the first edition of the CERN SGML User's Guide in October 1986. It offered a rich tag set for preparing the same kind of documents as those proposed by CERNPAPER, as well as for foils. Here is an example of the markup:

<!DOCTYPE sgmlmins>
<GDOC SEC="Secondary title">
<TITLEP>
<TITLE>Minutes of todays meeting
<DATE>Whenever
<DISTRIB>
<DIST>Present were
Me
Some Others
</DISTRIB>
</TITLEP>
<BODY>
<P>Nous avons discut&eacute; :
<OL>
<LI>bla bla
<LI>More bla bla
</OL>
<CLOSE>
<TYPIST>MG/xyz
<RECORDER>MG
</CLOSE>
</GDOC>
We see the use of SGML's concrete syntax, i.e., < and > for starting and ending element tags (GML uses : and . respectively). We have an attribute (SEC=''...'') on line 2, while special characters (&eacute; yielding the accented ``é'' on line 13) are entered using entity references. Note that I typed all tag names in uppercase, but this is not necessary in SGML, since in most systems element and attribute names are case-insensitive. However, entity names are case-sensitive. On the other hand, in XML (see later) element and attribute names, as well as entities, are always case-sensitive.

The introduction of LATEX

In the late 1980s various Unixes and VAX/VMS became popular at CERN (and elsewhere) and the need for a text processing system that ran on all systems became ever more important.

Physicists and engineers who visited the United States of America, especially SLAC, told us with great enthusiasm about TEX , a publicly available text processing system that D.E. Knuth of Stanford University had been working on with his students since 1977. It consists of two main components, TEX [6] and [8]. About the aim of his project, Knuth wrote in the foreword of the TEX book[6]: ``TEX $\lbrack$ is$\rbrack$ a new typesetting system intended for the creation of beautiful books -- and especially for books that contain a lot of mathematics. By preparing a manuscript in TEX format, you will be telling a computer exactly how the manuscript is to be transformed into pages whose typographic quality is comparable to that of the world's finest printers''.

TEX popularity with thousands of scientists is mainly due to the ease with which any kind of writing can be transformed into various document classes, suhc as articles, reports, proposals, books, in a way that is completely under the control of the writer through a rich set of formatting commands.

By its very conception, TEX is particularly useful when the document contains mathematical formulae that have to be rendered with high typographic precision. Moreover the program, originally written in an enriched dialect of Pascal, but now distributed in C, can be compiled on almost any operating system in the world, so that it runs on a wide range of computer platforms, from micros to mainframes. It behaves 100% identically on all machines, a fact extremely important in the scientific and technical communities. Related to this portability is TEX 's printing device independence, so that a document can be printed on anything from a CRT screen, a medium-resolution dot or laser printer, to a professional high-resolution photo-typesetter.

Because of these qualities and since it is available in the public domain TEX has become the de facto standard text processing system in many academic departments and research laboratories, It has also been adopted by members of the professional publishing world as a printing engine. In his foreword to ``TEX and , New Directions in Typesetting''[9], Gordon Bell wrote that ``Don Knuth's Tau Epsilon Chi (TEX ) is potentially the most significant invention in typesetting in this century. It introduces a standard language in computer typography and in terms of importance could rank near the introduction of the Gutenberg press''.

At the beginning of the 1980s, Leslie Lamport started work on a document preparation system LATEX [7], based on the TEX formatter. The system adds a level of abstraction to the plain TEX commands and lets the user concentrate on the structure of the document rather than on formatting details. A few high-level commands allow the user to easily compose most of his documents. Users do not have to bother about typographic details, which are left to the document designer, who provides style files for every application.

TEX was first officially introduced at CERN in September 1987 (CNL 189) running on the central VAX Service. System support was provided by Alexander Samarin who developed a set of integration tools. However, on the VM/CMS system, the ``recommended'' system remained SGML. A TEX service on VM/CMS was announced in May 1988 (CNL 191). Support was by Jurgen de Jonghe of the TP Section of the newly created OCS Group in MIS Division. In September 1989 SGMLTEX, an application for supporting SGML on VAX/VMS, was announced (CNL 196).

Defining a text processing policy

The need to define a global text processing policy for CERN became ever clearer and in 1989 a proposal (CERN/DD/89-25) was presented to the Meddle Committee and approved. The proposed policy was presented during a CERN-wide seminar about ``Text processing at CERN'' in December 1989.

A summary, detailing which text processing systems were supported at that time, was published in CNL 198:

  • LATEX on all platforms (IBM, VAX, PC, Mac, Unix workstations);
  • SGML/BookMaster on IBM (based on IBM's DCF SCRIPT interpreter);
  • Microsoft Word on PC and Mac;
  • Interleaf (later FrameMaker http://www.adobe.com/products/framemaker) on workstations (later also on PC and Mac);
  • Waterloo based SGML, SCRIPT, and CERNPAPER were to be phased out.

The move from SGML based on Waterloo SCRIPT to IBM's BookMaster suite was because the latter presented a more professional and standard-compliant product, including a very flexible high-level style language for defining document layout. CERN customisation of the IBM styles was available via the BOOKIE exec. The math formatter was SMFF, a variant of EQN, the formula processor originally developed for troff on Unix systems.

As of issue 198, the CNL was produced with SGML/BookMaster and started featuring a regular section on text processing. For instance CNL 199 (June 1990) contained a long article about SGML by Eric van Herwijnen (who, after Anders Berglund had left for ISO, had become responsible for text processing developments at CERN), and an introduction to the PostScript language (by M.G.). CNL 202 (June 1991) contained an overview article describing the various text processing systems in use at CERN, recommendations for typesetting rules for writing scientific documents (still very relevant today, see http://home.cern.ch/goossens/typorules/typorules.html), and an explanation of the importance of SGML in the publishing world.

CNL 203 contained further information on how to optimally prepare one's documents for typesetting, and proposed a set of entity names for elementary particle. At the back it included a questionnaire about the text processing needs of the user community (use of systems and macro packages, needs for multiple input languages, training, required included graphics material).

During this period the documentation for the software packages supported by the User Support (later Application Software) Group (e.g., Hbook, PAW, CERNLIB, GEANT) was translated from Waterloo SCRIPT/SGML, into LATEX , and made available as printable PostScript files (CNL 205).

The creation of the Web and HTML

With mainframes being abandoned and most development, administrative, and production work moving to Unix and Microsoft Windows workstations, there was for a few years a three-prong approach for producing documents at CERN, namely TEX for physics documents (all systems, with first-line support by Michèle Jouhet's Team in Michael Draper's EET/DH Group ), FrameMaker for large technical manuals and reports (all systems, with support until the end of 2000 by Mario Ruggier, and presently by Johan Karlsson in IT/API), and Microsoft Word (or, more generally, Microsoft's Office Suite) for administrative work (and some technical work in the engineering sector) on PC and Mac (with support also in ETT/DH).

With the help of a consultant, Sebastian Rahtz, in 1992/93 we installed at CERN a reference system containing all the latest LATEX developments. This work became the basis of the TEX Live CDROM, which is now a world reference for TEX distributions, as well as of three books [3,4,5] on the use of LATEX (articles about LATEX developments appeared in almost every CNL from 203 to 225) and from CNL 206 (until today!) the printed version of the CNL has been produced with LATEX (with one exception, see below).

In the early 1990s a major event happened at CERN, namely Tim Berners-Lee and collaborators developed the basics of what was to become the Web. In those days (1992-1993) Tim B.-L. was sitting just a few offices down the corridor from where we (M.G. and S.R.) were working, and already at the beginning of 1993 we had translated,with the active help of Tim, some LATEX documents into HTML (we started with the HBOOK manual), first using a home-made ad hoc set of LATEX macros, later with LATEX 2HTML . A first article on HTML appeared in September 1993 (with a nice picture of an HTML page displayed with Xmosaic, the first generally-available X-Window browser and the precursor of Netscape), clearly showing that at CERN HTML was being used well before the rest of the world, that became aware of the Web mostly after the ``Woodstock of the Web'', the First World Wide Web Conference organised at CERN on May 25-27 1994 (a more detailed history of the Web can be found in [1,2]).

The Web is essentially based on the successful triad: the HTTP protocol, the URL uniform addressing scheme, and the HTML language. The syntax of the HTML language was inspired by that of the SGML language system that we had been running at CERN since the middle 1980s. However, the first versions of HTML were presentation-directed, and it was not until version 3 of HTML that a formally correct DTD of the language was published.

By a decision of the CERN Council in December 1994 CERN officially left Web development to the World Wide Web Consortium (W3C, http://www.w3c.org), that had been set up a few months earlier. INRIA (France) and MIT (USA), later joint by Keio in Japan, were to co-host the W3C, and coordinate further Web-related activities.

As far as the CNL is concerned there was a move from LATEX -source to HTML -source in the middle of 1996 (CNL 223), and one issue (CNL 224) was produced with the ``Print'' menu option of the browser. However, several readers protested about the ``significant loss of markup, structure and readability'' (a fact that the CNL editors agreed with), and hence it was decided to transform the HTML back to LATEX to prepare the printed version with a higher typographic quality.

Coming full circle

The Web became so popular that several browser vendors started competing to offer specific extensions to the HTML language to attract users wanting to publicise their products. Therefore, various mutually-incompatible dialects of HTML appeared. Moreover, to really benefit from the Web and the various applications that were being developed, the XML initiative was initiated in late 1997. Jon Bosak, who published his seminal article ``XML, Java, and the future of the Web'' (http://www.xml.com/pub/a/w3j/s3.bosak.html) around that time, was one its main promoters. All this culminated in the publication of the XML W3C Recommendation (First edition http://www.w3.org/TR/1998/REC-xml-19980210, Second edition http://www.w3.org/TR/2000/WD-xml-2e-20000814), which defines the XML language in a formal way.

XML is truly and explicitly international in that it espouses Unicode as its basic character set. Unicode (http://www.unicode.org) provides an unambiguous, fixed 16-bit (2-octet) code set (in fact 17 planes, each 16 bits wide, are defined). The 65,536 characters of plane 0 contain ASCII, plus most characters needed for writing in the major living languages. The other sixteen planes with 65,536 characters each, provide for the inclusion of ancient languages, more specialised math and other characters, plus over 100,000 code points for private use areas.

Recently, many XML based applications have been developed, and XML seems to have taken over the Internet World. XML is a light-weight version of SGML, that can be parsed by a relatively simple program (various publicly-available XML parsers exist in almost every computer or scripting language). XML-based standards exist for document navigation and manipulation (XPath, http://www.w3.org/TR/xpath, XSLT, http://www.w3.org/Style/XSL/, XML Query, http://www.w3.org/XML/Query), Schema definition (XML Schema http://www.w3.org/XML/Schema), Formatting and presentation (XSL-FO, http://www.w3.org/TR/xsl/ and CSS, Cascading Style Sheets, http://www.w3.org/Style/CSS/), Scalable Vector Graphics (SVG, http://www.w3.org/Graphics/SVG), mathematics (MathML, http://www.w3.org/Math/, Generalized links (XLink/XPointer, http://www.w3.org/XML/Linking), and an XML version of HTML (http://www.w3.org/MarkUp/) is being worked on.

Looking back over a period of almost thirty years one observes that CERN started with specific markup (BARB and SCRIPT-based languages), and went via a more generic approach (CERNPAPER and SGML) to a portable mixture of presentation and structure (LATEX ). HTML started off as essentially presentational (for making nice-looking and attractive Web pages), but soon it was realized that presentation (style) and structure should be independent. Hence there was a return to the SGML approach, pioneered in the early 1980s, in the form of XML, the SGML-lite of the Web. Table 1 shows a comparison of the syntax of various text processors used at CERN since the 1980s until now.

Table 1: A comparison of the syntax of various text processors
Description CERNPAPER CERN SGML (HTML) LaTeX nroff/troff (mm macros) XML/DocBook (MathML)
Document sectioning commands
Level 0 .part <H0> \part not available <part>
Level 1 .chapter <H1> \chapter .NH1/.H1 <chapter>
Level 2 .section <H2> \section .NH2/.H2 <sect1>
Level 3 .subsection <H3> \subsection .NH3/.H3 <sect2>
Level 4 .subsub <H4> \subsubsection .NH4/.H4 <sect3>
New paragraph .para <P> \par .PP/.P n <para>
Highlighted and other special text
Normal normal text normal text normal text or \textrm{text} normal text or .R normal text
Emphasis .highl1 ... .ehighl1
<HP1> ... </HP1>
(<EM> ... </EM>)
\emph{...} .I ... <emphasis>...</emphasis>
Quotation .quote begin ... .quote end <Q> ... </Q> \begin{quote} ... \end{quote} .QS ... .QE <quote> ... </quote>
Footnote .footnote ... .footend <FN> ... </FN> \footnote{...} .FS ... .FE <footnote> ... </footnote>
Lists
Ordered
.point begin
.point
.point end
<OL>
<LI> xxx
<EOL>
\begin{enumerate}
\item ...
\end{enumerate}
.AL
.LI xxx
.LE
<orderedlist>
<listitem> ...
</orderedlist>
Unordered
.bullet
.bullet ...
.bullet end
<UL>
<LI> xxx
<EUL>
\begin{itemize}
\item ...
\end{itemize}
.BL
.LI xxx
.LE
<itemizedlist>
<listitem> ...
</itemizedlist>
Description (glossary)
.glossary begin
.glosssary xxx
...
.glossary end
<DL>
<DT> xxx
<DD> ...
<EDL>
\begin{description}
\item[xxx]
...
\end{description}
.VL
.LI xxx
...
.LE
<variablelist>
<varlistentry>
<term>...</term>
<listitem>...</listitem>
</varlistentry>
</variablelist>
Principle of how to type mathematics and other special characters
Greek (e.g. $\alpha $) @a &alpha; $\alpha$ \(*a &alpha; (&#x03B1;)
Math symbol (e.g. $\approx $ ) &approx. &ap; $\approx$ \(ap &approx; (&#x2248;)
Inline math special math chars special math chars $math text$ or\(math text\) # math text #
<inlineequation>
<math>MathML code</math>
</inlineequation>
Display math .stmath ... .emath special math chars \[math text\] .EQ math text .EN
<informalequation>
<math>MathML code</math>
</informalequation>
Superscript &S'text. <SUP>text</SUP> $\sp{math text}$ # sup {math text}# <subscript>text</subscript>
Subscript &s'text. <SUB>text</SUB> $\sb{math text}$ # sub {math text}# <supscript>text</supscript>
Accents and umlauts &acute.e or &trema.u &eacute; or &uuml; \'e or \"u e\*' or u\*: &eacute; (&#x00E9;) or &uuml; (&#x00FC;)
Non breaking space \ (set by command ``.blank \'') &nbsp; ~ \ or\(space) &#x00A0;
Specific formatting page markup
Line break .break Fall through to formatter \newline .br Processing instruction for formatter
Page break .page Fall through to formatter \newpage .bp Processing instruction for formatter

XML and LATEX , perfect twins for the 21st century?

Most Web-related activities are currently centered on the use of XML for data exchange and document production (Elliotte Rusty Harold's ``Cafe con Leche XML News and Resources'' at the URL http://www.ibiblio.org/xml/ is a good source of up-to-date information). Microsoft, Oracle, IBM, Sun, as well as open source applications, such as perl, Python, the GNU Project, GNOME, have all included XML into their basic strategic plans. Hundreds of dedicated data vocabularies (DTDs), each dealing with a particular subject area, have been proposed (see Robin Cover's ``XML Pages'' at the URL http://www.oasis-open.org/cover/xml.html).

For scientific and computer documentation DocBook (http://www.docbook.org) markup has been in use for many years. The DocBook DTD or schema contains hundreds of elements to mark up clearly and explicitly the different components of an electronic document (book, article, reference guide, etc.), not only displaying its hierarchical structure but also indicating the semantic meaning of the various elements. Moreover, the structure of the DTD is optimised to allow for customisation, thus making it relatively straightforward to add or eliminate certain elements or attributes, to change the content model for certain structural groups, or to restrict the value that given attributes can take.

A short example of a document marked up in DocBook and including some math follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE article SYSTEM "mybook.dtd"[
<!ENTITY DB "<application>DocBook</application>">
]>
<article>
<articleinfo>
<title>Docbook with a few formulae</title>
<author><forename>Michel</forename> 
        <surname>Goossens</surname>
</author>
<pubdate>Wednesday, 18 March 2001</pubdate>
<abstract>
<para>
This XML document is marked up according to 
the &DB; schema. It shows a few elements of 
the &DB; vocabulary, as well as a couple of
examples of mathematical expressions where 
we used MathML markup.
</para>
</abstract>
</articleinfo>
<section>
<title>A MathML example</title>
<para>
A MathML formula can be typeset inline, as here
<inlineequation>
<math>
 <mrow>
  <msup><mi>&#x03C0;</mi><mo>+</mo></msup>
  <mi>p</mi>
  <mo>&#x2192;</mo>
  <msup><mi>K</mi><mo>+</mo></msup>
  <msup ><mi>&#x03A3;</mi><mo>+</mo></msup>
 </mrow>
 </math>
</inlineequation>, a simple particle reaction.
</para>

<para>
A mathematical equation can also be typeset in 
display mode using &DB;'s 
<sgmltag class="element">informalequation</sgmltag> 
element, as is shown in the following example 
containing a summation expression:
</para>

<informalequation>
<math>
 <mrow>
  <msubsup><mo>&#x2211;</mo>
           <mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow>
           <mrow><mi>&#x221E;</mi></mrow>
  </msubsup> 
  <mfrac><mn>1</mn><msup><mi>n</mi><mn>2</mn></msup></mfrac>
  <mfrac><mn>1</mn><mi>m</mi></mfrac> 
 </mrow>
 <mtext>.</mtext>
</math>
</informalequation>
</section>
</article>
The DTD mybook.dtd (see below) is loaded on line 2, and in the internal subset (line 3) an entity DB is defined as a shorthand for ``DocBook'' (marked up as an application element). Then the article informationis set up, specifying title, author, abstract, etc. The body of the text shows how the markup for two formulae in MathML. The first corresponds to the physics process and the second to the summation equation shown as examples earlier in this article. Note the Unicode character entity references for denoting non-ASCII characters, in particular &#x03C0; ($\pi$), &#x03A3; ($\Sigma$), and &#x2192; ($\rightarrow$) in the first formula, and &#x2211; ($\sum$), and &#x221E; ($\infty$) in the second formula.

The DTD mybook.dtd, which follows, declares how to combine MathML's <math> ``master'' element with the DocBook DTD (lines 1 and 2), while the following lines declare parameter entities for the MathML and DocBook DTD, and load them (last two lines), so that their elements are available for the XML parsers. The attribute definition for the math element on line 5 defines the namespace for that element and its daughters.

<!ENTITY % equation.content "(math+)">
<!ENTITY % inlineequation.content "(math+)">
<!ENTITY % mathml SYSTEM 
   "/opt/XML/cdrom/dtd/mathml/mathml2.dtd">
<!ATTLIST math xmlns CDATA #FIXED 
   "http://www.w3.org/1998/Math/MathML">
<!ENTITY % docbook SYSTEM
   "/opt/XML/cdrom/www.nwalsh.com/docbook/xml/docbookx.dtd">
<!-- load MathML and docbook -->
%mathml;
%docbook;

The XML document described above was first transformed with an XSLT processor into XSL-FO formatting elements with the help of Normal Walsh's XSLT stylesheets (http://nwalsh.com/docbook/xsl/). These formatting objects were then interpreted and typeset with Sebastian Rahtz' PassiveTeX (http://users.ox.ac.uk/~rahtz/passivetex/), and the result is shown in Figure 3.

Figure 1: A simple DocBook document typeset with LATEX
\fbox  {\includegraphics[width=.95\linewidth]{dbexa.eps}}

Various other ways exist to transform XML documents into viewable information with the help of XSLT stylesheets (via HTML or directly with CSS in XML-aware browsers). XSLT to XSL-FO transformations and specific applications (such as the Apache Project's FOP, http://xml.apache.org/fop/, or PassiveTEX , see above) can be used to obtain PDF or PostScript.

More generally, as already explained in CNL 2000-002 (http://ref.cern.ch/CERN/CNL/2000/002/xml-strategy), XML can be considered as the central element in an integrated Internet strategy, where documents (and data) are stored in an ``online'' electronic repository in XML, from which they can be transformed into various formats. We reproduce here the figure of that article (Figure 4), where we emphasise the complementary roles of XML (at the top right), that can be manipulated by a whole set of mostly publicly-available standard tools, and LATEX (at the top left), that is available for high-quality typesetting purposes, where precision and clarity is desired.

Figure 2: An XML strategy for the Web
\includegraphics [width=.95\linewidth]{xml4web}
As browsers are becoming more powerful in the way they can display mathematics (with MathML) and simple two-dimensional graphics (SVG), by using plug-ins or native code, it is expected that in the not too distant future (end 2001?) one will be able to cut and paste a formula or data in a graph from one application (e.g., a browser) into another (e.g., a Java or C++ program) using XML technology.

CERN is presently collaborating with other academic and commercial partners in an IST project funded by European Union, TIPS (Tools for Innovative Publishing in Science). The aim of TIPS (http://tips.sissa.it) is to develop a set of user-friendly and advanced tools and services that are organised in an open system to support research information production, management, access, and use in a coherent manner. These ideas are being tested using articles of the electronic Journal of High Energy Physics (http://jhep.sissa.it). In particular, the articles are transformed from LATEX into XML to see how this generic format can be used to optimally support the activities of document writing, reviewing, publishing, searching, disseminating and reading, as well as communication among members of the research community.

Bibliography

1
Tim Berners-Lee. Weaving the Web. Orion Business Books, London, 1999.
2
James Gillies and Robert Cailliau. How the Web was Born. Oxford University Press, Oxford, UK, 2000.
3
Michel Goossens, Frank Mittelbach, and Alexander Samarin. The LATEX Companion. Addison-Wesley, Reading, 1994.
4
Michel Goossens and Sebastian Rahtz, with Frank Mittelbach. The LATEX Graphics Companion. Addison-Wesley, Reading, 1997.
5
Michel Goossens and Sebastian Rahtz. The LATEX Web Companion. Addison-Wesley, Reading, 1999.
6
D.E. Knuth.
The TEX book.
Addison-Wesley, Reading, 1990.
7
Leslie Lamport.
LATEX , A Document Preparation System.
Addison-Wesley, Reading, 1986.
8
D.E. Knuth.
The Metafontbook.
Addison-Wesley, Reading, 1990.
9
D.E. Knuth.
TEX and Metafont, New Directions in Typesetting.
The American Mathematical Society and Digital Press, Stanford, 1979.
10
International Organization for Standardization.
SGML, The Standard Generalised Markup Language.
ISO 8879, ISO Geneva, 1986.

Acknowledgements

I would like to thank Anders Berglund, Julian Blake, Ian McLaren, Miguel Marquina, Jutta Megies, Harry Renshall, David Stungo, David Williams, and Roger Woolnough for sharing information or documents with me while I was preparing this article.


About the author(s): Michel Goossens is a CERN authority on LATEX, XML and Electronic Document Publishing techniques in general. He has written several articles and books on the subject.


For matters related to this article please contact the author.
Cnl.Editor@cern.ch


CERN-CNL-2001-001
Vol. XXXVI, issue no 1


Last Updated on Thu Apr 05 15:28:11 CEST 2001.
Copyright © CERN 2001 -- European Organization for Nuclear Research