Text File Encoding in Open Petra

From OpenPetra Wiki
Jump to navigation Jump to search

Background

Character Sets

Open Petra runs under the .NET Framework and all text in the Open Petra program is stored in 'Unicode' - as is all text in the database itself. Unicode text is how you would imagine sensible text is managed inside a computer. Literally every character that you can imagine ever existing is assigned a unique number in a huge table. (Unicode refers to characters as 'code points' because sometimes a writeable entity is more than a character. But for our purposes a character is good enough). So if you know the number, you know the character and vice versa. No number ever refers to more than one character.

But things were not always so simple. In the early days of computers memory and disk space were expensive and people tried to make do with a limited number of characters. The earliest computers (which were built upon the workings of teletype and other radio transmissions) could only display about 95 different letters and punctuation marks. All of these were so called Latin characters - A to Z and 0 to 9. Of course quite quickly in an international world there was a demand for some other characters but there was a lot to be gained by restricting the total number of individual characters a particular computer used to a maximum of 255. So a system grew up which allowed for many different 'code pages' of characters. So, for example, there was a 'Western Europe' code page and a 'Greek' code page. these were the same for the original first 127 characters but the remaining 128 were different.

This is an extract from the Unicode Consortium web site.

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 8.0 provides codes for 120,672 characters from the world's alphabets, ideograph sets, and symbol collections.

File Formats

The choice of character set is only the beginning of the problem! Because the next stage is how to transfer a text document from one person to another using a computer file. Unless you are going to be a hermit with your computer you are going to need to 'export' or 'import' a text document so how is the stream of bytes that is a file to be turned back into a stream of characters?

ANSI Format

Once again, in the early days, things were simple. Computers read and wrote bytes which conveniently were the right size for storing our 255 characters (some of which were graphical so that you could display boxes and shading). So a stream of characters mapped straight to a stream of bytes - and that was the file. There was nothing to say which code page had been used for the characters themselves. Mostly that did not matter because you were only sharing files with friends in the same country as you, so nobody needed to bother with code pages. This kind of file format is referred to as ANSI , although it actually never was sponsored by the American National Standards Institute.

This is the file format that was used by Petra.

Unicode Format

When you have the capacity to have millions of characters uniquely represented you are not going to manage with a single byte - potentially you will need 4 bytes for every character. But that will be very wasteful for all the commonest characters which would only need one or two bytes. So the Unicode Consortium recognises three basic file encodings: UTF-8, UTF-16 and UTF-32. The numbers give you a clue that 1, 2 and 4 bytes are involved but in a flexible manner. Here again is the official explanation:

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

UTF-8 is the native import/export format for files from Open Petra but as we shall see in the next section, Open Petra is capable of working with many different file formats.

Byte Order Marks (BOM)

If there are all these different file formats, wouldn't it be nice if all the different formats could 'self-identify'? If there was some way that you could tell when you read a file how it was encoded? Well it's too late to do that with the old ANSI files but Unicode files can optionally be saved with a three-byte 'header' or 'preamble' that identifies the Unicode format precisely. If the software that reads the file checks these first three bytes it then knows how to read the rest of the file. But, of course, the downside of starting with a BOM is that old software that was written before Unicode became popular can get very confused and sometimes fail to open the file altogether. For many UTF-8 format files this is a pity because the old software would be able to read much of the file quite happily. So BOM's are by no means universal and most recommendations come out in favour of not using them if there is any chance that the file will need to be read by 'legacy' software.

More Background

There is a very good article by Joel Spolsky about Text and Encoding that you can find here

Text Encoding in Open Petra

Exporting

When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark for files that are to be read by Open Petra and an ANSI file in the user's own machine codepage for a file that is destined for Petra.

Importing

There are two types of file that Open Petra may import:

  • a 'CSV' file where the columns of data are separated by a character such as a comma, semi-colon or tab (or another character of your choice). OpenPetra uses CSV files for importing both partner data and finance data.
  • an 'EXT' file which is a file that was exported from Petra or Open Petra. Such files are typically used for importing Partner data and, although human readable text, conform to a specific format

When Open Petra opens one of these files it examines the file content and attempts to work out which type of text encoding has been used. As explained below, it may be able to do this unambiguously if the file originated from Open Petra, or the encoding may be ambiguous and the user will need to select an encoding from a list of possible options.

Supported Code Pages

In principle OpenPetra supports all of the hundreds of possible code pages that .NET supports. However, in order to make the GUI simpler and more manageable the GUI supports a limited number of code pages in its drop-down lists - the same ones that Petra supported. These are

  • 1250 Central European
  • 1251 Cyrillic
  • 1252 Western European
  • 1253 Greek
  • 1254 Turkish
  • 1257 Baltic
  • 932 Japanese
  • 949 Korean
  • 950 Traditional Chinese

In addition OpenPetra always supports the client's own machine code page, if it is not one of those listed above. So a user in Vietnam would also see code page 1258 included.

Auto-Detection of Encoding

  • If the file has a BOM the encoding is known immediately and the result is unambiguous
  • If the file does not have a BOM
    • If the file has no characters at all whose byte value is above 127 (0&F hex) then for our purposes the encoding is not important because all the possible UTF and ANSI options will be interpreted as containing the same text.
    • If the file does contain some characters above 127 then
      • We can find out if the file is not a Unicode file but unfortunately if it is not-not a Unicode file we have not proved conclusively that it is one. If there are some tell tale signs of byte sequences that look like Unicode then the more of these we find the more likely it is that it is one of the Unicode formats. So deciding that a file is, say, UTF-8 becomes a matter of statistics. The actual statistical limits used depend on what the content of the file refers to. A piece of prose will have different statistics from something like OpenPetra's files that are used for a specific purpose. We have decided that once a file contains more than 2 'good' UTF code points we are happy to say that that is statistically significant. To put that another way we are saying that it is very unlikely that our text would contain more than two instances of a consecutive pair of the equivalent special ANSI characters that made textual sense.
      • Unfortunately, if an ANSI file may contain Chinese, Korean or Japanese characters then we cannot be confident about any of the Unicode formats - although we can be confident that it is not Unicode if we find a byte pattern that cannot be Unicode.

So how likely is it that we could encounter a UTF-8 character and mistake it for a pair of ANSI characters? Well for that to happen with western European languages you would have to imagine a scenario where one of the characters in the first list below was followed by one of the characters from the second list. People can think of a few ways but I doubt there are many in OpenPetra data.

First list:

ÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß

Second list:

¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿  and non-break space

If you are importing a Finance or Partner CSV file, Open Petra always displays a 'preview' window so that you can see how the content will be read. You will need to specify the column separator, the format for dates and fractional numbers and choose the text encoding from an auto-detected list of one or more possibilities. As the client modifies any of these settings the preview display responds to the changes. Once the client is happy with the content the import can proceed. An important difference which was introduced in OpenPetra at the same time as the auto-detection of encoding, is that the preview window now shows the content of the whole file and not just the first few lines. This means that the actual text imported will be exactly the text displayed in this window.

If you are importing an EXT file a preview window will only be displayed if the text encoding is ambiguous. Furthermore, the preview window will not show the whole file content but only those lines that contain ambiguous characters. If the preview window is displayed the user selects one of the available encodings to make sure that the text looks correct.

It was noted above that ANSI files containing Asian characters make the auto-detection of the file encoding significantly harder. For that reason OpenPetra has a User Preference named 'Files imported from Petra may contain Asian characters'. By default this is not ticked. The result is that Open Petra does not need to allow for the possibility of, for example, ANSI Japanese. As a result the auto-detection is much more unambiguous.

It is worth pointing out that both the CSV and EXT Preview windows are included in the list of OpenPetra screens whose location and size are 'remembered'. So you can feel free to resize both of these preview windows to something that is almost full screen and OP will helpfully display them in the future in the same place.

Testing All the Encoding Possibilities

It will be clear from all the above that Open Petra has been written to support a wide variety of importable text files. These may have been sourced from Petra, OpenPetra or indeed spreadsheets or document editors that may themselves be able to save their content in a wide variety of formats. In addition to that, the way that OpenPetra auto-detects the encoding depends on the content of the file and whether the 'Asian characters' tick box has been ticked on the User Preferences screen.

So we have created a large number of different import files that are stored in the source code repository that can be used to view the different ways that the client software responds to different inputs. These files have been 'zipped up' in a single file: doc/TextEncoding.zip. If you want to explore the issue of text encoding you should extract all these files to a location on your hard drive.

Explanation of File Names

Once you have extracted the files you will see that there are three folders, each of which have two sub-folders.

The three folders are

  • ANSI
  • UTF8_No_BOM
  • UTF8_With_BOM

The files in the ANSI folder have been saved using an ANSI code page - usually Western European but in one case using Greek The files in the UTF8_No_BOM folder have been saved using UTF-8 format but with no byte order mark. Again this includes both western European and Greek files. The files in the UTF8_With_BOM folder have been saved in UTF-8 format with a byte order mark at the start of the file.

Within the folders the files have names that follow this scheme:

  • Files containing the word 'Vanilla' have content with no 'funny characters' at all. This means that any of the encodings that we support would produce exactly the same file content. So any encoding choice is ok - the imported text will be the same whatever. In the absence of a BOM most software applications default to assuming the file was saved as ANSI with the client user's machine codepage.
  • Files containing the word 'Ambiguous' contain just one or two code points that could be UTF-8, but equally could be ANSI. In the absence of a BOM the user is going to have to confirm the encoding, but we will initially suggest that the encoding is UTF-8
  • Files containing the word 'Unambiguous' contain more than two examples of a UTF-8 character

Please note that these names refer more to the content than the outcome. An 'ambiguous' file that has a BOM is unambiguously UTF-8! So it becomes interesting to see the difference in the way that OpenPetra handles the two cases.

The content of the Vanilla, Ambiguous and Unambiguous files in the various folders is the same. The difference is in the encoding and the BOM.

Setting Up a Client System to Test The Files

The file contents are anonymised so you should only work with a database that you are happy can contain invalid data. I like to use a DEMO database whose content can be reset several times during testing. Remember that similar files contain the same data saved different ways.

Most of the time you do not need to actually proceed all the way to importing the data. Once you have seen a preview of the data there is always a 'Cancel' option to halt the import. Even when importing partners using an EXT file that is unambiguous you can click 'Start Import' and see the first partner's data and then click 'Stop Import'.

You should also bear in mind that you will get different behaviour depending on whether the 'Asian characters' checkbox is ticked in your User Preferences. I suggest that you start by leaving it un-ticked and then go through the tests again with it ticked in order to see the differences.

Using the Files

The tables below show what you should expect to see when using each of the files.

CSV ANSI Files

The content of these files is all based on ANSI Code Pages.

File Name Description Expected Result
Vanilla ANSI.csv This file consists entirely of simple Latin characters This file could have been saved in any of our supported formats but we adopt the convention that it was probably saved in the client's own code page. The drop down list contains all the possible code pages but the user's own page is pre-selected. Changing to a different code page would not change the imported result.
Ambiguous ANSI.csv This file contains one character with an accent saved as ANSI. It will not have any valid UTF code points. This file is only ambiguous to the extent that, while it cannot be UTF, it could be any of the standard code pages.
Unambiguous ANSI.csv This file contains several ANSI characters with an accent, but no characters that can be UTF-8 This file could have been saved in any standard code page.
Unambiguous Greek.csv This file contains a string of characters corresponding to the first part of the Greek alphabet. Once again it contains multi-byte code points that cannot be UTF-8. The content will only display correctly if you choose 'Greek' from the drop-down list of encodings.

CSV UTF-8 With No BOM Files

The content of these files has been saved as UTF-8 but there is no BOM

File Name Description Expected Result
Vanilla UTF8.csv This file consists entirely of simple Latin characters Since this file has no BOM and contains no special characters it could have been saved in any of our supported formats but we adopt the convention that it was probably saved in the client's own code page as an ANSI file. The drop down list contains all the possible code pages but the user's own page is pre-selected. Changing to a different code page does not change the imported result. Although the file was technically saved as UTF-8 we do not show that as an option because it is redundant.
Ambiguous UTF8.csv This file contains one accented character saved as UTF-8. But this could equally be interpreted as two consecutive special ANSI characters. Since we cannot be completely confident of the correct interpretation, we show both the UTF-8 option as well as all our standard code pages. However, given that in this case two consecutive ANSI characters are less likely, we pre-select UTF-8 as the likely option in the drop-down list. You can see the effect of the different code pages by trying all the options.
Unambiguous UTF8.csv This file contains sufficient good UTF-8 accented characters for us to be confident that this is indeed the correct format, even in the absence of a BOM. We can display the content and there is no need to offer the ANSI code pages as an alternative.
Unambiguous Greek.csv This file contains the same content as the ANSI one above - a string of characters corresponding to the first part of the Greek alphabet - but this time they are all encoded in UTF-8 format. There are so many 'good' UTF-8 code points that the result is unambiguous. We can display the content and there is no need to offer the ANSI code pages as an alternative.

CSV UTF-8 With BOM Files

The content of these files has been saved as UTF-8. A BOM is included so all the files are explicitly declared as UTF-8

All the files, which contain exactly the same content as their No-BOM equivalents are opened as UTF-8 and there is no need to offer the client any other option.

EXT ANSI Files

File Name Description Expected Result
Vanilla ANSI.ext This file consists entirely of simple Latin characters This file could have been saved in any of our supported formats but would always have the same content. Thus we do not show the Preview Dialog.
Ambiguous ANSI.ext This file contains one character with an accent saved as ANSI. It will not have any valid UTF code points. This file is only ambiguous to the extent that, while it cannot be UTF, it could be any of the standard code pages. We show the Preview Dialog.
Unambiguous ANSI.ext This file contains several ANSI characters with an accent, but no characters that can be UTF-8 This file could have been saved in any standard code page. We show the Preview Dialog.
Unambiguous Greek.ext This file contains a string of characters corresponding to the first part of the Greek alphabet The content will only display correctly if you choose 'Greek' from the drop-down list of encodings. We show the Preview Dialog.

EXT UTF-8 With No BOM Files

The content of these files has been saved as UTF-8 but there is no BOM

File Name Description Expected Result
Vanilla UTF8.ext This file consists entirely of simple Latin characters Since this file has no BOM and contains no special characters it could have been saved in any of our supported formats but would always have the same content. Thus we do not show the Preview Dialog.
Ambiguous UTF8.ext This file contains one accented character saved as UTF-8. But this could equally be interpreted as two consecutive special ANSI characters. Since we cannot be completely confident of the correct interpretation, we show the Preview Dialog with both the UTF-8 option as well as all our standard code pages. However, given that in this case two consecutive ANSI characters are less likely, we pre-select UTF-8 as the likely option in the drop-down list. You can see the effect of the different code pages by trying all the options.
Unambiguous UTF8.ext This file contains sufficient good UTF-8 accented characters for us to be confident that this is indeed the correct format, even in the absence of a BOM. There is no need to display the Preview Dialog.
Unambiguous Greek.ext This file contains the same content as the ANSI one above - a string of characters corresponding to the first part of the Greek alphabet - but this time they are all encoded in UTF-8 format and there are so many of them that the result is unambiguous. There is no need to display the Preview Dialog.

EXT UTF-8 With BOM Files

The content of these files has been saved as UTF-8. A BOM is included so all the files are explicitly declared as UTF-8

All the files, which contain exactly the same content as their No-BOM equivalents are opened as UTF-8 and there is no need to show the Preview Dialog.

Finance CSV Files

There are similar CSV files for Finance: Vanilla, Ambiguous and Unambiguous. They have been included for completeness but their behaviour is identical to the Partner equivalents.