Text File Encoding in Open Petra: Difference between revisions

From OpenPetra Wiki
Jump to navigation Jump to search
No edit summary
Line 44: Line 44:


=== Exporting ===
=== Exporting ===
When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark for files that are to be read by Open Petra and an ANSI file in the user's own machine cod epage for a file that is destined for Petra.
When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark for files that are to be read by Open Petra and an ANSI file in the user's own machine codepage for a file that is destined for Petra.


=== Importing ===
=== Importing ===
There are two types of file that Open Petra may import:
There are two types of file that Open Petra may import:
* a 'CSV' file where the columns of data are separated by a character such as a comma, semi-colon or tab (or another character of your choice)
* a 'CSV' file where the columns of data are separated by a character such as a comma, semi-colon or tab (or another character of your choice).  OpenPetra uses CSV files for importing both partner data and finance data.
* an 'EXT' file which is a file that was exported from Petra or Open Petra.  Such files are typically used for importing Partner data and, although human readable text, conform to a specific format
* an 'EXT' file which is a file that was exported from Petra or Open Petra.  Such files are typically used for importing Partner data and, although human readable text, conform to a specific format


Line 56: Line 56:
* If the file has a BOM the encoding is known immediately and the result is unambiguous
* If the file has a BOM the encoding is known immediately and the result is unambiguous
* If the file does not have a BOM
* If the file does not have a BOM
** If the file has no characters at all whose byte value is above 127 (0&F hex) then for our purposes the encoding is not important because all the possible options will be interpreted as containing the same text.
** If the file has no characters at all whose byte value is above 127 (0&F hex) then for our purposes the encoding is not important because all the possible UTF and ANSI options will be interpreted as containing the same text.
** If the file '''does''' contain some characters above 127 then
** If the file '''does''' contain some characters above 127 then
*** We can find out if the file is '''not''' a Unicode file but unfortunately if it is '''not-not''' a Unicode file we have not proved conclusively that it '''is''' one.  If there are some tell tale signs of byte sequences that look like Unicode then the more of these we find the more likely it is that it is one of the Unicode formats.  So deciding that a file is, say, UTF-8 becomes a matter of statistics.
*** We can find out if the file is '''not''' a Unicode file but unfortunately if it is '''not-not''' a Unicode file we have not proved conclusively that it '''is''' one.  If there are some tell tale signs of byte sequences that look like Unicode then the more of these we find the more likely it is that it is one of the Unicode formats.  So deciding that a file is, say, UTF-8 becomes a matter of statistics.
*** Unfortunately, if an ANSI file may contain Chinese, Korean or Japanese characters then we cannot be confident about any of the Unicode formats - although we can be confident that it is '''not''' Unicode if we find a byte pattern that cannot be Unicode.
*** Unfortunately, if an ANSI file may contain Chinese, Korean or Japanese characters then we cannot be confident about any of the Unicode formats - although we can be confident that it is '''not''' Unicode if we find a byte pattern that cannot be Unicode.


If you are importing a CSV file, Open Petra always displays a 'preview' window so that you can see how the content will be read.  You will need to specify the column separator, the format for dates and fractional numbers and choose the text encoding from an auto-detected list of one or more possibilities.  As the client modifies any of these settings the preview display responds to the changes.  Once the client is happy with the content the import can proceed.
If you are importing a Finance or Partner CSV file, Open Petra always displays a 'preview' window so that you can see how the content will be read.  You will need to specify the column separator, the format for dates and fractional numbers and choose the text encoding from an auto-detected list of one or more possibilities.  As the client modifies any of these settings the preview display responds to the changes.  Once the client is happy with the content the import can proceed.  ''The actual text imported will be exactly the text displayed in this window''.


If you are importing an EXT file a preview window will only be displayed if the text encoding is ambiguous.  Furthermore, the preview window will not show the whole file content but only those lines that contain ambiguous characters.  If the preview window is displayed the user selects one of the available encodings to make sure that the text looks correct.
If you are importing an EXT file a preview window will only be displayed if the text encoding is ambiguous.  Furthermore, the preview window will not show the whole file content but only those lines that contain ambiguous characters.  If the preview window is displayed the user selects one of the available encodings to make sure that the text looks correct.


It was noted above that ANSI files containing Asian characters make the auto-detection of the file encoding significantly harder.  For that reason Open Petra has a User Preference named 'Files imported from Petra may contain Asian characters'.  By default this is '''not''' ticked.  The result is that Open Petra does not need to allow for the possibility of, for example, ANSI Japanese.  As a result the auto-detection is much more unambiguous.
It was noted above that ANSI files containing Asian characters make the auto-detection of the file encoding significantly harder.  For that reason OpenPetra has a User Preference named 'Files imported from Petra may contain Asian characters'.  By default this is '''not''' ticked.  The result is that Open Petra does not need to allow for the possibility of, for example, ANSI Japanese.  As a result the auto-detection is much more unambiguous.
 
== Testing All the Encoding Possibilities ==
It will be clear from all the above that Open Petra has been written to support a wide variety of importable text files.  These may have been sourced from Petra, OpenPetra or indeed spreadsheets or document editors that may themselves be able to save their content in a wide variety of formats.  In addition to that, the way that OpenPetra auto-detects the encoding depends on the content of the file and whether the 'Asian characters' tick box has been ticked on the User Preferences screen.
 
So we have created a large number of different import files that are stored in the source code repository that can be used to view the different ways that the client software responds to different inputs.  These files have been 'zipped up' in a single file: doc/TextEncoding.zip.  If you want to explore the issue of text encoding you should extract all these files to a location on your hard drive.
 
=== Explanation of File Names ===
Once you have extracted the files you will see that there are three folders, each of which have two sub-folders.
 
The three folders are
* ANSI
* UTF8_No_BOM
* UTF8_With_BOM
The files in the ANSI folder have been saved using an ANSI code page - usually Western European but in one case using Greek
The files in the UTF8_No_BOM folder have been using UTF-8 format but with no byte order mark
The files in the UTF8_With_BOM folder have been saved in UTF-8 format with a byte order mark at the start of the file
 
Within the folders the files have names that follow this scheme:
* Files containing the word 'Vanilla' have content with no 'funny characters' at all.  This means that any of the encodings that we support would produce ''exactly'' the same file content.  So any encoding choice is ok - the imported text will be the same whatever.  In the absence of a BOM most software applications default to assuming the file was saved as ANSI with the client user's machine codepage.
* Files containing the word 'Ambiguous' contain just one or two code points that could be UTF-8, but equally could be ANSI.  In the absence of a BOM the user is going to have to confirm the encoding, but we will initially suggest that the encoding is UTF-8
* Files containing the word 'Unambiguous' contain more than two examples of a UTF-8 character

Revision as of 09:10, 29 June 2016

Background

Character Sets

Open Petra runs under the .NET Framework and all text in the Open Petra program is stored in 'Unicode' - as is all text in the database itself. Unicode text is how you would imagine sensible text is managed inside a computer. Literally every character that you can imagine ever existing is assigned a unique number in a huge table. (Unicode refers to characters as 'code points' because sometimes a writeable entity is more than a character. But for our purposes a character is good enough). So if you know the number, you know the character and vice versa. No number ever refers to more than one character.

But things were not always so simple. In the early days of computers memory and disk space were expensive and people tried to make do with a limited number of characters. The earliest computers (which were built upon the workings of teletype and other radio transmissions) could only display about 95 different letters and punctuation marks. All of these were so called Latin characters - A to Z and 0 to 9. Of course quite quickly in an international world there was a demand for some other characters but there was a lot to be gained by restricting the total number of individual characters a particular computer used to a maximum of 255. So a system grew up which allowed for many different 'code pages' of characters. So, for example, there was a 'Western Europe' code page and a 'Greek' code page. these were the same for the original first 127 characters but the remaining 128 were different.

This is an extract from the Unicode Consortium web site.

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 8.0 provides codes for 120,672 characters from the world's alphabets, ideograph sets, and symbol collections.

File Formats

The choice of character set is only the beginning of the problem! Because the next stage is how to transfer a text document from one person to another using a computer file. Unless you are going to be a hermit with your computer you are going to need to 'export' or 'import' a text document so how is the stream of bytes that is a file to be turned back into a stream of characters?

ANSI Format

Once again, in the early days, things were simple. Computers read and wrote bytes which conveniently were the right size for storing our 255 characters (some of which were graphical so that you could display boxes and shading). So a stream of characters mapped straight to a stream of bytes - and that was the file. There was nothing to say which code page had been used for the characters themselves. Mostly that did not matter because you were only sharing files with friends in the same country as you, so nobody needed to bother with code pages. This kind of file format is referred to as ANSI , although it actually never was sponsored by the American National Standards Institute.

This is the file format that was used by Petra.

Unicode Format

When you have the capacity to have millions of characters uniquely represented you are not going to manage with a single byte - potentially you will need 4 bytes for every character. But that will be very wasteful for all the commonest characters which would only need one or two bytes. So the Unicode Consortium recognises three basic file encodings: UTF-8, UTF-16 and UTF-32. The numbers give you a clue that 1, 2 and 4 bytes are involved but in a flexible manner. Here again is the official explanation:

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

UTF-8 is the native import/export format for files from Open Petra but as we shall see in the next section, Open Petra is capable of working with many different file formats.

Byte Order Marks

If there are all these different file formats, wouldn't it be nice if all the different formats could 'self-identify'? If there was some way that you could tell when you read a file how it was encoded? Well it's too late to do that with the old ANSI files but Unicode files can optionally be saved with a three-byte 'header' or 'preamble' that identifies the Unicode format precisely. If the software that reads the file checks these first three bytes it then knows how to read the rest of the file. But, of course, the downside of starting with a BOM is that old software that was written before Unicode became popular can get very confused and sometimes fail to open the file altogether. For many UTF-8 format files this is a pity because the old software would be able to read much of the file quite happily. So BOM's are by no means universal and most recommendations come out in favour of not using them if there is any chance that the file will need to be read by 'legacy' software.

More Background

There is a very good article by Joel Spolsky about Text and Encoding that you can find here

Text Encoding in Open Petra

Exporting

When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark for files that are to be read by Open Petra and an ANSI file in the user's own machine codepage for a file that is destined for Petra.

Importing

There are two types of file that Open Petra may import:

  • a 'CSV' file where the columns of data are separated by a character such as a comma, semi-colon or tab (or another character of your choice). OpenPetra uses CSV files for importing both partner data and finance data.
  • an 'EXT' file which is a file that was exported from Petra or Open Petra. Such files are typically used for importing Partner data and, although human readable text, conform to a specific format

When Open Petra opens one of these files it examines the file content and attempts to work out which type of text encoding has been used. As explained below, it may be able to do this unambiguously if the file originated from Open Petra, or the encoding may be ambiguous and the user will need to select an encoding from a list of possible options.

Auto-Detection of Encoding

  • If the file has a BOM the encoding is known immediately and the result is unambiguous
  • If the file does not have a BOM
    • If the file has no characters at all whose byte value is above 127 (0&F hex) then for our purposes the encoding is not important because all the possible UTF and ANSI options will be interpreted as containing the same text.
    • If the file does contain some characters above 127 then
      • We can find out if the file is not a Unicode file but unfortunately if it is not-not a Unicode file we have not proved conclusively that it is one. If there are some tell tale signs of byte sequences that look like Unicode then the more of these we find the more likely it is that it is one of the Unicode formats. So deciding that a file is, say, UTF-8 becomes a matter of statistics.
      • Unfortunately, if an ANSI file may contain Chinese, Korean or Japanese characters then we cannot be confident about any of the Unicode formats - although we can be confident that it is not Unicode if we find a byte pattern that cannot be Unicode.

If you are importing a Finance or Partner CSV file, Open Petra always displays a 'preview' window so that you can see how the content will be read. You will need to specify the column separator, the format for dates and fractional numbers and choose the text encoding from an auto-detected list of one or more possibilities. As the client modifies any of these settings the preview display responds to the changes. Once the client is happy with the content the import can proceed. The actual text imported will be exactly the text displayed in this window.

If you are importing an EXT file a preview window will only be displayed if the text encoding is ambiguous. Furthermore, the preview window will not show the whole file content but only those lines that contain ambiguous characters. If the preview window is displayed the user selects one of the available encodings to make sure that the text looks correct.

It was noted above that ANSI files containing Asian characters make the auto-detection of the file encoding significantly harder. For that reason OpenPetra has a User Preference named 'Files imported from Petra may contain Asian characters'. By default this is not ticked. The result is that Open Petra does not need to allow for the possibility of, for example, ANSI Japanese. As a result the auto-detection is much more unambiguous.

Testing All the Encoding Possibilities

It will be clear from all the above that Open Petra has been written to support a wide variety of importable text files. These may have been sourced from Petra, OpenPetra or indeed spreadsheets or document editors that may themselves be able to save their content in a wide variety of formats. In addition to that, the way that OpenPetra auto-detects the encoding depends on the content of the file and whether the 'Asian characters' tick box has been ticked on the User Preferences screen.

So we have created a large number of different import files that are stored in the source code repository that can be used to view the different ways that the client software responds to different inputs. These files have been 'zipped up' in a single file: doc/TextEncoding.zip. If you want to explore the issue of text encoding you should extract all these files to a location on your hard drive.

Explanation of File Names

Once you have extracted the files you will see that there are three folders, each of which have two sub-folders.

The three folders are

  • ANSI
  • UTF8_No_BOM
  • UTF8_With_BOM

The files in the ANSI folder have been saved using an ANSI code page - usually Western European but in one case using Greek The files in the UTF8_No_BOM folder have been using UTF-8 format but with no byte order mark The files in the UTF8_With_BOM folder have been saved in UTF-8 format with a byte order mark at the start of the file

Within the folders the files have names that follow this scheme:

  • Files containing the word 'Vanilla' have content with no 'funny characters' at all. This means that any of the encodings that we support would produce exactly the same file content. So any encoding choice is ok - the imported text will be the same whatever. In the absence of a BOM most software applications default to assuming the file was saved as ANSI with the client user's machine codepage.
  • Files containing the word 'Ambiguous' contain just one or two code points that could be UTF-8, but equally could be ANSI. In the absence of a BOM the user is going to have to confirm the encoding, but we will initially suggest that the encoding is UTF-8
  • Files containing the word 'Unambiguous' contain more than two examples of a UTF-8 character