Text File Encoding in Open Petra: Difference between revisions

From OpenPetra Wiki
Jump to navigation Jump to search
No edit summary
Line 33: Line 33:
''All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.''
''All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.''


'''UTF-8 is the native import/export format for files from OpenPetra''' but as we shall see in the next section, Open Petra is capable of working with many different file formats.
'''UTF-8 is the native import/export format for files from Open Petra''' but as we shall see in the next section, Open Petra is capable of working with many different file formats.
 
=== Byte Order Marks ===
If there are all these different file formats, wouldn't it be nice if all the different formats could 'self-identify'?  If there was some way that you could tell when you read a file how it was encoded?  Well it's too late to do that with the old ANSI files but Unicode files can '''optionally''' be saved with a three-byte 'header' or 'preamble' that identifies the Unicode format precisely.  If the software that reads the file checks these first three bytes it then knows how to read the rest of the file.  But, of course, the downside of starting with a BOM is that old software that was written before Unicode became popular can get very confused and sometimes fail to open the file altogether.  For many UTF-8 format files this is a pity because the old software would be able to read much of the file quite happily.  So BOM's are by no means universal and most recommendations come out in favour of not using them if there is any chance that the file will need to be read by 'legacy' software.


=== More Background ===
=== More Background ===
Line 39: Line 42:


== Text Encoding in Open Petra ==
== Text Encoding in Open Petra ==
When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark by default.
 
=== Exporting ===
When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark for files that are to be read by Open Petra and an ANSI file in the user's own machine cod epage for a file that is destined for Petra.
 
=== Importing ===
There are two types of file that Open Petra may import:
* a 'CSV' file where the columns of data are separated by a character such as a comma, semi-colon or tab (or another character of your choice)
* an 'EXT' file which is a file that was exported from Petra or Open Petra.  Such files are typically used for importing Partner data and, although human readable text, conform to a specific format
 
When Open Petra opens one of these files it examines the file content and attempts to work out which type of text encoding has been used.  As explained below, it may be able to do this unambiguously if the file originated from Open Petra, or the encoding may be ambiguous and the user will need to select an encoding from a list of possible options.
 
==== Auto-Detection of Encoding ====
* If the file has a BOM the encoding is known immediately and the result is unambiguous
* If the file does not have a BOM
** If the file has no characters at all whose byte value is above 127 (0&F hex) then for our purposes the encoding is not important because all the possible options will be interpreted as containing the same text.
** If the file '''does''' contain some characters above 127 then
*** We can find out if the file is '''not''' a Unicode file but unfortunately if it is '''not-not''' a Unicode file we have not proved conclusively that it '''is''' one.  If there are some tell tale signs of byte sequences that look like Unicode then the more of these we find the more likely it is that it is one of the Unicode formats.  So deciding that a file is, say, UTF-8 becomes a matter of statistics.
*** Unfortunately, if an ANSI file may contain Chinese, Korean or Japanese characters then we cannot be confident about any of the Unicode formats - although we can be confident that it is '''not''' Unicode if we find a byte pattern that cannot be Unicode.
 
If you are importing a CSV file, Open Petra always displays a 'preview' window so that you can see how the content will be read.  You will need to specify the column separator, the format for dates and fractional numbers and choose the text encoding from an auto-detected list of one or more possibilities.  As the client modifies any of these settings the preview display responds to the changes.  Once the client is happy with the content the import can proceed.
 
If you are importing an EXT file a preview window will only be displayed if the text encoding is ambiguous.  Furthermore, the preview window will not show the whole file content but only those lines that contain ambiguous characters.  If the preview window is displayed the user selects one of the available encodings to make sure that the text looks correct.
 
It was noted above that ANSI files containing Asian characters make the auto-detection of the file encoding significantly harder.  For that reason Open Petra has a User Preference named 'Files imported from Petra may contain Asian characters'.  By default this is '''not''' ticked.  The result is that Open Petra does not need to allow for the possibility of, for example, ANSI Japanese.  As a result the auto-detection is much more unambiguous.

Revision as of 10:19, 31 Mayıs 2016

Background

Character Sets

Open Petra runs under the .NET Framework and all text in the Open Petra program is stored in 'Unicode' - as is all text in the database itself. Unicode text is how you would imagine sensible text is managed inside a computer. Literally every character that you can imagine ever existing is assigned a unique number in a huge table. (Unicode refers to characters as 'code points' because sometimes a writeable entity is more than a character. But for our purposes a character is good enough). So if you know the number, you know the character and vice versa. No number ever refers to more than one character.

But things were not always so simple. In the early days of computers memory and disk space were expensive and people tried to make do with a limited number of characters. The earliest computers (which were built upon the workings of teletype and other radio transmissions) could only display about 95 different letters and punctuation marks. All of these were so called Latin characters - A to Z and 0 to 9. Of course quite quickly in an international world there was a demand for some other characters but there was a lot to be gained by restricting the total number of individual characters a particular computer used to a maximum of 255. So a system grew up which allowed for many different 'code pages' of characters. So, for example, there was a 'Western Europe' code page and a 'Greek' code page. these were the same for the original first 127 characters but the remaining 128 were different.

This is an extract from the Unicode Consortium web site.

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name.

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 8.0 provides codes for 120,672 characters from the world's alphabets, ideograph sets, and symbol collections.

File Formats

The choice of character set is only the beginning of the problem! Because the next stage is how to transfer a text document from one person to another using a computer file. Unless you are going to be a hermit with your computer you are going to need to 'export' or 'import' a text document so how is the stream of bytes that is a file to be turned back into a stream of characters?

ANSI Format

Once again, in the early days, things were simple. Computers read and wrote bytes which conveniently were the right size for storing our 255 characters (some of which were graphical so that you could display boxes and shading). So a stream of characters mapped straight to a stream of bytes - and that was the file. There was nothing to say which code page had been used for the characters themselves. Mostly that did not matter because you were only sharing files with friends in the same country as you, so nobody needed to bother with code pages. This kind of file format is referred to as ANSI , although it actually never was sponsored by the American National Standards Institute.

This is the file format that was used by Petra.

Unicode Format

When you have the capacity to have millions of characters uniquely represented you are not going to manage with a single byte - potentially you will need 4 bytes for every character. But that will be very wasteful for all the commonest characters which would only need one or two bytes. So the Unicode Consortium recognises three basic file encodings: UTF-8, UTF-16 and UTF-32. The numbers give you a clue that 1, 2 and 4 bytes are involved but in a flexible manner. Here again is the official explanation:

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

UTF-8 is the native import/export format for files from Open Petra but as we shall see in the next section, Open Petra is capable of working with many different file formats.

Byte Order Marks

If there are all these different file formats, wouldn't it be nice if all the different formats could 'self-identify'? If there was some way that you could tell when you read a file how it was encoded? Well it's too late to do that with the old ANSI files but Unicode files can optionally be saved with a three-byte 'header' or 'preamble' that identifies the Unicode format precisely. If the software that reads the file checks these first three bytes it then knows how to read the rest of the file. But, of course, the downside of starting with a BOM is that old software that was written before Unicode became popular can get very confused and sometimes fail to open the file altogether. For many UTF-8 format files this is a pity because the old software would be able to read much of the file quite happily. So BOM's are by no means universal and most recommendations come out in favour of not using them if there is any chance that the file will need to be read by 'legacy' software.

More Background

There is a very good article by Joel Spolsky about Text and Encoding that you can find here

Text Encoding in Open Petra

Exporting

When Open Petra creates a text file it uses the UTF-8 format with a Byte Order Mark for files that are to be read by Open Petra and an ANSI file in the user's own machine cod epage for a file that is destined for Petra.

Importing

There are two types of file that Open Petra may import:

  • a 'CSV' file where the columns of data are separated by a character such as a comma, semi-colon or tab (or another character of your choice)
  • an 'EXT' file which is a file that was exported from Petra or Open Petra. Such files are typically used for importing Partner data and, although human readable text, conform to a specific format

When Open Petra opens one of these files it examines the file content and attempts to work out which type of text encoding has been used. As explained below, it may be able to do this unambiguously if the file originated from Open Petra, or the encoding may be ambiguous and the user will need to select an encoding from a list of possible options.

Auto-Detection of Encoding

  • If the file has a BOM the encoding is known immediately and the result is unambiguous
  • If the file does not have a BOM
    • If the file has no characters at all whose byte value is above 127 (0&F hex) then for our purposes the encoding is not important because all the possible options will be interpreted as containing the same text.
    • If the file does contain some characters above 127 then
      • We can find out if the file is not a Unicode file but unfortunately if it is not-not a Unicode file we have not proved conclusively that it is one. If there are some tell tale signs of byte sequences that look like Unicode then the more of these we find the more likely it is that it is one of the Unicode formats. So deciding that a file is, say, UTF-8 becomes a matter of statistics.
      • Unfortunately, if an ANSI file may contain Chinese, Korean or Japanese characters then we cannot be confident about any of the Unicode formats - although we can be confident that it is not Unicode if we find a byte pattern that cannot be Unicode.

If you are importing a CSV file, Open Petra always displays a 'preview' window so that you can see how the content will be read. You will need to specify the column separator, the format for dates and fractional numbers and choose the text encoding from an auto-detected list of one or more possibilities. As the client modifies any of these settings the preview display responds to the changes. Once the client is happy with the content the import can proceed.

If you are importing an EXT file a preview window will only be displayed if the text encoding is ambiguous. Furthermore, the preview window will not show the whole file content but only those lines that contain ambiguous characters. If the preview window is displayed the user selects one of the available encodings to make sure that the text looks correct.

It was noted above that ANSI files containing Asian characters make the auto-detection of the file encoding significantly harder. For that reason Open Petra has a User Preference named 'Files imported from Petra may contain Asian characters'. By default this is not ticked. The result is that Open Petra does not need to allow for the possibility of, for example, ANSI Japanese. As a result the auto-detection is much more unambiguous.