Text File Encoding in Open Petra

From OpenPetra Wiki
Revision as of 20:43, 27 May 2016 by Alanjfpaterson (talk | contribs) (Created page with "== Background == === Character Sets === Open Petra runs under the .NET Framework and all text in the Open Petra program is stored in 'Unicode' - as is all text in the databas...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Background

Character Sets

Open Petra runs under the .NET Framework and all text in the Open Petra program is stored in 'Unicode' - as is all text in the database itself. Unicode text is how you would imagine sensible text is managed inside a computer. Literally every character that you can imagine ever existing is assigned a unique number in a huge table. (Unicode refers to characters as 'code points' because sometimes a writeable entity is more than a character. But for our purposes a character is good enough). So if you know the number, you know the character and vice versa. No number ever refers to more than one character.

But things were not always so simple. In the early days of computers memory and disk space were expensive and people tried to make do with a limited number of characters. The earliest computers (which were built upon the workings of teletype and other radio transmissions) could only display about 95 different letters and punctuation marks. All of these were so called Latin characters - A to Z and 0 to 9. Of course quite quickly in an international world there was a demand for some other characters but there was a lot to be gained by restricting the total number of individual characters a particular computer used to a maximum of 255. So a system grew up which allowed for many different 'code pages' of characters. So, for example, there was a 'Western Europe' code page and a 'Greek' code page. these were the same for the original first 127 characters but the remaining 128 were different.

This is an extract from the Unicode Consortium web site.

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 8.0 provides codes for 120,672 characters from the world's alphabets, ideograph sets, and symbol collections.

File Formats

The choice of character set is only the beginning of the problem! Because the next stage is how to transfer a text document from one person to another using a computer file. Unless you are going to be a hermit with your computer you are going to need to 'export' or 'import' a text document so how is the stream of bytes that is a file to be turned back into a stream of characters?

ANSI Format

Once again, in the early days, things were simple. Computers read and wrote bytes which conveniently were the right size for storing our 255 characters (some of which were graphical so that you could display boxes and shading). So a stream of characters mapped straight to a stream of bytes - and that was the file. There was nothing to say which code page had been used for the characters themselves. Mostly that did not matter because you were only sharing files with friends in the same country as you, so nobody needed to bother with code pages. This kind of file format is referred to as ANSI , although it actually never was sponsored by the American National Standards Institute.

This is the file format that was used by Petra.

Unicode Format

When you have the capacity to have millions of characters uniquely represented you are not going to manage with a single byte - potentially you will need 4 bytes for every character. But that will be very wasteful for all the commonest characters which would only need one or two bytes. So the Unicode Consortium recognises three basic file encodings: UTF-8, UTF-16 and UTF-32. The numbers give you a clue that 1, 2 and 4 bytes are involved but in a flexible manner. Here again is the official explanation:

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

UTF-8 is the native import/export format for files from OpenPetra but as we shall see in the next section, Open Petra is capable of working with many different file formats.

More Background

There is a very good article about Text and Encoding that you can find here