Skip to main content
\(\require{cancel}\newcommand{\definiteintegral}[4]{\int_{#1}^{#2}\,#3\,d#4} \newcommand{\myequation}[2]{#1\amp =#2} \newcommand{\indefiniteintegral}[2]{\int#1\,d#2} \newcommand{\testingescapedpercent}{ \% } \newcommand{\lt}{<} \newcommand{\gt}{>} \newcommand{\amp}{&} \)

Section18Internationalization

Supporting a multitude of possible characters, across many languages and across many output formats can be a challenge. One of our goals is to make this easier for authors. Fortunately, the Unicode standard has led to improvements from the 7-bit ASCII standard of old.

Unicode Characters for HTML Output

First, we discuss HTML output. If you include Unicode characters in your PreTeXt source, they should survive just fine en route to a web browser or e-reader. Here are the caveats for HTML output:

  • So that you can continue to get the best results with print and PDF output, use available empty elements for special characters, even if targeting HTML output, before resorting to a Unicode character. For example, use <times /> for a small “multiplication sign” in text before resorting to the Unicode character U+00D7.

  • How you actually enter Unicode characters into your source file is dependent on your editor and operating system, and is therefore outside the scope of our documentation. You can cut/paste characters and text from the source of our examples for initial testing and experimentation.

  • Always, always identify your source as having Unicode characters by including the incantation <?xml version="1.0" encoding="UTF-8" ?> as the first line of your source file. (You may be able to accurately cut/paste this text here. But if the copy has non-standard characters in it, go back to the top of this source file for a copy.)

  • Alan Wood’s Unicode Resources has a plethora of samples of various groups of Unicode characters. If you, or your readers, are “missing” characters in a web browser, this is a good place to start testing the local setup.

Characters in , PDF, print

The situation for is much more complicated, since pre-dates Unicode's widespread adoption.

This sample article is intended to work well, out-of-the-box, for authors just starting with PreTeXt. So we only include here examples that we know are likely to convert to PDF without any errors. For more extensive examples and experiments, we provide the sample document examples/fonts/fonts-and-characters.xml, so be aware of that example as you look to see what is possible.

Similarly, you should be able to process this sample article successfully with various engines. We test regularly with pdflatex and xelatex and provide online sample PDF output of this document processed by pdflatex. In principle, you should be able to use latex (to produce a DVI), and possibly other (unsupported) engines, such as lualatex.

Once you get beyond the Latin alphabet, with accents common in Western Europe and the Western Hemisphere, you will almost assuredly need to restrict your attention to producing PDF output with the xelatex engine. This is discussed and tested in examples/fonts/fonts-and-characters.xml.

Basic Latin, U+0000U+007F

Unicode uses multiple 8-bit bytes to represent characters, and these are typically expressed in hexadecimal (base 16) notation. Using just a single byte, we can get 256 values, and the first 128 (hex 00 to 7F) are the “usual” Latin characters with some values used as control codes. These 95 characters are the most basic, and should all render using pdflatex or xelatex with no special setup (and HTML). U+0000 to U+001F are control codes and not used here. U+007F is also a control code and so is excluded, while U+0020 is a space, so appears invisible in the table. In the source we have also replaced reserved characters by their PreTeXt equivalent empty elements.

0 1 2 3 4 5 6 7 8 9 A B C D E F
002_ ! " # $ % & ' ( ) * + , - . /
003_ 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
004_ @ A B C D E F G H I J K L M N O
005_ P Q R S T U V W X Y Z [ \ ] ^ _
006_ ` a b c d e f g h i j k l m n o
007_ p q r s t u v w x y z { | } ~
Table18.1Basic Latin, Regular
Latin-1 Supplement, U+0080U+00FF

Now we are interested in the next 128 possible bytes, (hex 80 to FF). The first 32 are again control codes and U+00A0 is a non-breaking space, so is invisible, while U+00AD is a soft hyphen (which we have not implemented and so is excluded). We have taken care to see that the remainder will render using pdflatex or xelatex with no special setup (and HTML).

0 1 2 3 4 5 6 7 8 9 A B C D E F
00A_   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
00B_ ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
00C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
00D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
00E_ à á â ã ä å æ ç è é ê ë ì í î ï
00F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Table18.2Latin-1 Supplement, Regular
Monospace, Basic Latin and Latin-1 Supplement, U+0000U+00FF

A monospace font is critical for samples of keyboard input and to distinguish exact technical input from running commentary. We list here all of the reasonable characters from the first 256 Unicode code points. (We skip the same 65 control characters from above, and the soft hyphen.) These should all render fine in HTML and when processed with xelatex, however our focus with this sample article for PDF output is the capabilities when processed with pdflatex. First, characters from U+0000U+007F.

0 1 2 3 4 5 6 7 8 9 A B C D E F
002_ ! " # $ % & ' ( ) * + , - . /
003_ 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
004_ @ A B C D E F G H I J K L M N O
005_ P Q R S T U V W X Y Z [ \ ] ^ _
006_ ` a b c d e f g h i j k l m n o
007_ p q r s t u v w x y z { | } ~
Table18.3Basic Latin, Monospace

Note that the single and double quotes are upright and dumb, not curly and smart: " ' " ' " '. The zero is distinguished from the capital “oh”: 0 O 0 O 0 O. And the numeral one is slightly different from the lower-case “ell”: 1 l 1 l 1 l. The hyphen should be short and not expanded into some other kind of dash: - - -. These characters should all cut/paste out of a PDF into a text editor with no conversion to other characters.

Now the remaining characters from U+0080U+00FF. Inline code (the c tag) and the program tag are implemented in via the listing package and these characters require ad-hoc replacements for processing by pdflatex. (You can see the replacements in the preamble of the source for this document.) The replacement mechanism provided by the listing package will cause the characters below to produce a compilation error if processed by pdflatex and in a table cell in certain situations (which we have avoided in the table below). The only workaround in this case is to switch to xelatex.

0 1 2 3 4 5 6 7 8 9 A B C D E F
00A_ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
00B_ ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
00C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
00D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
00E_ à á â ã ä å æ ç è é ê ë ì í î ï
00F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Table18.4Latin-1 Supplement, Monospace

The pre tag is implemented in with the fancyvrb package. You can compare results here with the table above, lines here are rows above.

  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬   ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

The console tag is also implemented with fancyvrb, with adjustments for the input lines. It will not look like it, but these are 8 such inputs, with similar results to above, but now bolded.

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬   ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Again, more examples and more thorough explanations can be found in the sample: examples/fonts/fonts-and-characters.xml. Be aware that the nature of the more advanced sample is that it will likely produce many errors when processed with pdflatex. Adding -interaction batchmode or -interaction nonstopmode to the pdflatex command-line will sometimes be less painless than acknowledging each error. The more advanced sample will perform well when processed with xelatex.