Jump to content

Talk:Byte order mark

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Misc

[edit]

Detailed discussion of BOM does not add to understanding of endianness, and BOM can be taken as a seperate concept, so i've moved it back to its own article.

It really was messy in the endianness article, especially as BOM has its own category links, external links, and the like.

--Pengo 00:52, 27 Oct 2004 (UTC)

some of theese edits seem rather dodgy to me.

used-->misused : you claim that using the BOM to mark text as being in a utf- format is misuse yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading) states that the byte sequence may be used to indicate both byte order and charachtor set.

"contrary to its definition" : you claim that use of the BOM on utf-8 is contary to its definition yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

FF FE 00 00-->00 00 FF FE (already reverted) : encoding the code point FEFF in little endian utf-32 would give FF FE 00 00 as was in the original not 00 00 FF FE as your edit states. Furthermore the table that was there before your edit exactly corresponds to the information given in http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

unless i see good justification for theese edits i will be reverting the two that i have not already reverted Plugwash 16:13, 24 Dec 2004 (UTC)

It is now two days since you made the edits and you have not responded furthermore i find you to be a very new contributer who has got into trouble elsewhere and made few other edits im am therefore reverting the rest of the edits you made to this page Plugwash 02:23, 27 Dec 2004 (UTC)

Concerning UTF-16 big endian vs little endian

[edit]

I have noticed that the Python interpreter reverses the byte order of UTF-16 big endian and little endian as compared to what is actually in the Unicode standard when given invalid input. When Python's codecs module is used to read UTF-8 text in from a file and write UTF-16 text out to another file, and the original UTF-8 file begins with the non-character U+FFFE (encoded as EF BF BE), the non-character is accepted as if it were the byte order mark U+FEFF and the resulting UTF-16 file has the opposite byte order of what was requested. I observed this on multiple platforms and Python versions.

The point is, if you are having trouble with the byte order of UTF-16 text, check your libraries/tools for problems, and verify everything using hexadecimal viewers. You may find incorrect assumptions are being made in your tools or libraries.

Canistota (talk) 14:47, 12 March 2009 (UTC)[reply]

Canistota: It's not only python, the description of UTF-16LE and UTF-16BE is reversed/wrong at this page. The UTF16-LE BOM is \xfe\xff resp. "\376\377", the UTF16-BE BOM is \xff\xfe resp. "\377\376" if read bytewise. This can be observed with every tool accepting BOMs or iconv, but in the meantime there are several tools which took the reverse wikipedia BOM, thus have it wrong. — Preceding unsigned comment added by ReiniUrban (talkcontribs) 13:27, 27 October 2016 (UTC)[reply]

Byte Order Mark in UTF-8

[edit]

Does anyone know why Windows software likes to put a BOM at the front of UTF-8 files? Isn't it true that the order is unambiguous, and thus it does nothing for any endianness problems? Is it simply a way of flagging a file as containing UTF-8 instead of ASCII? -R. S. Shaw 23:38, 5 Jun 2005 (UTC)

yeah its simply used to mark the file as being utf-8 rather than the systems legacy encoding. Plugwash 00:25, 6 Jun 2005 (UTC)
Whenever you save a file as UTF-8 in Windows Notepad, the UTF-8 BOM is prepended to it. You can use a different editor (a non–Unicode-aware editor or a hex editor) to remove the BOM. If the file contains one or more legal UTF-8 sequences, and only legal UTF-8 sequences, then removing the BOM will have no effect on the file—it’ll still be UTF-8. If the file contains only ASCII and you remove the BOM, Notepad will flag it as ANSI (8-bit codepage mode). If the file contains a BOM and you insert an illegal sequence into it (like a single FF byte in the middle of the text, or C2 E4, etc), then the file will stay intact, but if it hasn’t got a BOM and you insert such a sequence, it’ll revert to ANSI, and legal UTF-8 sequences too will be viewed in Notepad according to the current Windows ANSI codepage semantics (for example CF 80 as Ï€ instead of π if you’re on a US WinXP). --Shlomital 22:33, 2005 Jun 11 (UTC)
On Czech WinXP it works the same. Notepad marks it with BOM for easier recognition of the encoding, but does not require it. It is an unexpectedly tolerant approach.

Why is the byte sequence EF BB BF choose to be the mask?

[edit]

Is there a reason? Or someone just picked it by change? —Preceding unsigned comment added by 117.104.188.16 (talk) 10:03, 25 January 2011 (UTC)[reply]

That is U+FEFF (the value of the BOM character) in UTF-8 encoding. It is what you would get if a translator from UTF-16 to UTF-8 that was completely unaware of the BOM would produce by translating the BOM character.Spitzak (talk) 19:37, 25 January 2011 (UTC)[reply]

Why is this a problem?

[edit]

as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages

All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark. Shinobu (talk) 10:18, 20 November 2007 (UTC)[reply]

True though I could see that doing more harm than good, imagine you wrote your script on your desktop and it ran fine but when you put it on the production server an invisiable character stopped it from running. Plugwash (talk) 10:22, 20 November 2007 (UTC)[reply]
That assumes that the "free software" is of varied quality, not following a standard. That may be true. However the context for the quote was biased to support this situation. Tedickey (talk) 11:18, 20 November 2007 (UTC)[reply]
"All those tools are free software or have free software equivalents" — no, not proprietary Unixes, and yes they are still around. -- intgr [talk] 11:27, 20 November 2007 (UTC)[reply]
The de-facto standard is for tools (including such core OS components as the binary loader) to recognise a script by the first two bytes of a file being "#!". If some versions of some tools start ignoring a preceeding BOM but others don't (free software DOES NOT mean you can force your changes on your distro maker or server host) then IMO there is likely to be far more confusion than if scripts with a BOM universally fail (which afaict is the status quo). Plugwash (talk) 12:57, 20 November 2007 (UTC)[reply]
uh - no. No one's presented any evidence of scripts which would be ambiguous if someone provided a loader which handles BOM. Tedickey (talk) 13:10, 20 November 2007 (UTC)[reply]
I think the real question for Unix shell scripts is, what is the native character encoding that /bin/sh supports? Can you have a shell variable "$STRAßE"? An environment variable of the same name? What about Chinese? My bet is that the Unix shells only support ASCII text, in which case a byte order mark is inappropriate. After all, the kernel is looking for the bytes 2321, not the characters "#!". Canistota (talk) 23:28, 12 March 2009 (UTC)[reply]
Shell scripts support non-ASCII characters just fine (for instance in string literals - variable names may be more optimistic). The encoding is LC_CTYPE. But this is irrelevant to the recognition of the #! sequence, which is not performed by the shell in any case. Ewx (talk) 08:59, 13 March 2009 (UTC)[reply]
Python and Perl also support well utf8 encoding, including with BOM althout the shebang does not.~~ — Preceding unsigned comment added by 84.97.14.22 (talk) 16:28, 21 July 2012 (UTC)[reply]

"All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark." – In addition to what User:Plugwash writes above, I do not believe you can convince even a large minority of Unix users that placing a piece of crippled, limited character-encoding metadata into general files is a good idea. Although I only read about it just now, BOM for UTF-8 strikes me as an unusually stupid idea. The section on BOM in RFC 3629 illustrates some reasons why; it is full of heuristics and language that you rarely see in RFCs ("without a good reason", "only when really necessary", "an attempt at diminishing this uncertainty").

Should I interpret the article as if Windows Notepad is the only widely spread software which actually creates UTF-8 BOMs? It would make sense; Microsoft do not care about plain text editing – they are more into "one application, one proprietary file format" – and they have historically not cared about the usefulness of Notepad.

JöG (talk) 09:13, 29 March 2008 (UTC)[reply]

OK, now I see the article says "Quite a lot of Windows software (including Windows Notepad)". But it would be interesting to know if popular, serious text editors on Windows (emacs, vim, UltraEdit and popular Windows-specific editors) do this by default. JöG (talk) 09:18, 29 March 2008 (UTC)[reply]

You named two ports to Windows and one native. That's a rather small and unrepresentative example. There are many Windows editors. Btw, the comment regarding interprocess communication is unnecessary, since it adds no factual information. Take a look at Windows PowerShell, which has to be doing this transparently. Tedickey (talk) 11:01, 29 March 2008 (UTC)[reply]
* UltraEdit: If you select "UTF8" when you save, it adds the BOM without giving you a choice in the matter.
* Vim for Windows: It doesn't give the option to save as UTF8 and does not add a BOM, but when it opens a BOM'ed file it retains the BOM when saving. -- leuce (talk) 20:58, 30 March 2009 (UTC)[reply]
Vim's a port (and it doesn't recognize some of the Windows text formats). By the way, there are probably hundreds of applications to discuss in this manner. Tedickey (talk) 21:05, 30 March 2009 (UTC)[reply]
I agree -- I merely tested these two because they were mentioned by someone previously, plus the three mentioned below. -- leuce (talk) 14:44, 31 March 2009 (UTC)[reply]

In response to JöG's post, here are some Windows programs and whether they add a BOM to UTF8 or not.

  • Akelpad: Gives user a choice, but BOM is suggested by default.
  • MS Word XP: Adds BOM, gives no option not to add BOM. If you open a BOM'ed UTF-8 file in MS Word, it autodetects the encoding as UTF-8; if you open a non-BOM'ed file in MS Word, it makes a guess based on the characters it contains, but if all characters are present in the ANSI scheme, it will save such a file as ANSI, not UTF-8.
  • OpenOffice.org 3.0: Adds BOM, gives no option not to add BOM.

--leuce (talk) 13:11, 29 March 2009 (UTC)[reply]

Too technical!

[edit]

OK, I understand everything in the article, since I'm a unicodopath, but the intro should say:

  • Unicode is a computer encoding of all languages characters (in principle),
  • The byte order mark is designed so that a computer who reads it, can guess (with a reasonable probability) that the data text is probably Unicode, and
  • Guess what kind of Unicode encoding, since there are many - the article already says that, I just wanted to stress that it shall.

The intro is a bit too technical for being an intro. The current text qualifies as a technical description intended for me and you, not any outsider. The missing nouns that should be in the intro are: computer, data coding, natural languages. L8R. Said: Rursus 10:15, 25 April 2008 (UTC)[reply]

I think this set of recommendations is met or eliminated in the current article's text. The explanation that Unicode intends to capture all human languages belongs in the Unicode article (and it's there, and there's a link over there in the first sentence here). The notion that the BOM has the purpose of identifying Unicode (rather than some other encoding entirely) is not, so far as I can see, justified by the primary references, and is significantly undermined by the fact that BOM is in all contexts optional. The "which Unicode encoding" part is, as acknowledged, already captured. Jackrepenning (talk) 22:56, 6 August 2010 (UTC)[reply]

How to remove it

[edit]

There should a section in this page discussing how to remove it. The only reason 99% of people would ever come to this page is because they are trying to remove this ugly little thing from a web page they are developing. The 1% of people who come because they are interested in it may be getting what they want but not the rest of us. —Preceding unsigned comment added by Tjayrush (talkcontribs) 16:44, 6 February 2009 (UTC)[reply]


There is a nice easy to use peice of software called bomstrip that makes removing this thing quick work on Linux. I didn't want to edit the page directly but perhaps an interested party can. —Preceding unsigned comment added by Tjayrush (talkcontribs) 18:08, 6 February 2009 (UTC)[reply]

Added remove script to Unwanted BOMs section. In Linux: 1. To search for files contaning BOM by running this command: grep -rl $'\xEF\xBB\xBF' 2. for each from the search results above, run:

  a. vi <filename from search result>
  b. from inside vi type the command (including the ":" sign)    :set nobomb  
  c. save and exit  :wq  — Preceding unsigned comment added by Drormik (talkcontribs) 13:45, 10 August 2012 (UTC)[reply] 

To be exact: These commands are not for vi but for vim (which is the most popular vi clone). A non-vim implementation of vi (e.g. ex-vi, nvi, ...) most likely will not have an option "nobomb". --Meillo (talk) 18:53, 31 January 2021 (UTC)[reply]

Whether Unicode standards recommends UTF-8 BOM or not

[edit]

The text is "Use of a BOM is neither required nor recommended for UTF-8" (and this already appears in the cite!). That seems like a pretty clear "not recommended" to me - "neither fish nor fowl" means "not fish and not fowl", it doesn't mean "not fish and not specifically fowl". Ewx (talk) 07:47, 31 March 2009 (UTC)[reply]

And then it goes on to say that applications still must expect that it'll happen. May as well address the complete sentence, rather than construe a (reasonably) carefully worded comment into a completely negative recommendation. Tedickey (talk) 10:07, 31 March 2009 (UTC)[reply]
The Wikipedia text already points out that it may be encountered, and a recommendation not to *use* it doesn't contradict that at all. Ewx (talk) 13:54, 31 March 2009 (UTC)[reply]
But that is the point -- the Unicode standard does not contain a recommendation not to use it. -- leuce (talk) 14:37, 31 March 2009 (UTC)[reply]
Yes it does! The text in the standard is "Use of a BOM is neither required NOR RECOMMENDED for UTF-8" (emphasis mine). That is not an absence of a recommendation to use it and it is certainly not an absence of a recommendation not to use it; it is a straighforward and clear recommendation not to use it. Ewx (talk) 07:49, 1 April 2009 (UTC)[reply]
Indeed (the emphasis is yours. Use the complete sentence, or find another source which supports your viewpoint. Tedickey (talk) 10:56, 1 April 2009 (UTC)[reply]
This is completely ridiculous. The text is right there. It says it's not recommended. Ewx (talk) 08:07, 2 April 2009 (UTC)[reply]
Well I suspect this is a sticky point. I have searched chapter 2 and 16 of the Unicode standard for references to BOM, byte order mark and UTF-8, and in my opinion the reference under discussion here is the only instance in the standard that speaks even remotely negatively about a UTF-8 BOM. In all other cases where the UTF-8 BOM is mentioned or discussed, it is mentioned as a matter of course in an informational, neutral tone, without making any value judgements or any indication that the UTF-8 BOM is deprecated. My personal take on this reference is that people who want to implement the Unicode standard might wonder why the Unicode standard keeps making reference to the UTF-8 BOM (also in chapter 16) as if it were a valid construct, and they might become under the impression that the Unicode consortium actually recommends using a UTF-8 BOM even though it is not required. -- leuce (talk) 15:34, 1 April 2009 (UTC)[reply]
Having read comments by some of the people involved (in the topic itself...), my impression is that the statement is a compromise between two viewpoints, neither of which dominated in writing the source we're discussing. Tedickey (talk) 16:31, 1 April 2009 (UTC)[reply]
The phrase "X does not recommend Y" can have two meanings. It can mean that X recommends many things, but that Y is not one of the things that X recommends. Or, it can mean that X makes a recommendation *against* Y. The Unicode article does not recommend against a BOM... it simply does not make a recommend in favour of it. My gripe is that the wiki article before I edited did create the impression that the Unicode standard recommends against the use of a BOM. Even if one quotes directly from the Unicode standard, if quoted in a different context it can certainly give a slightly different impression of what the standard intends to say. -- leuce (talk) 14:37, 31 March 2009 (UTC)[reply]
Agree. And (for instance), if you consult some of the secondary sources, it's easy to come up with one that is wholly in favor of one or another viewpoint. (Some are completely absurd, but I see those reflected on this page ;-) Tedickey (talk) 10:06, 2 April 2009 (UTC)[reply]
A recent flurry of edits has opened this can of worms again, and the text has grown decidedly text-booky and verbose. I’ve reverted to the state pre-edits. Firstly, we cannot interpret the Unicode standard for it. The text comes straight from the source. The reader is going to have to decide for “himself” what that means. There is no other authoritative source and therefore we are not allowed to interpret it for the reader. The cited mailing list thread is not authoritative; it is just one of hundreds of discussions all over the Web on the topic, each coming to its own conclusions. Secondly, it makes no sense to prognosticate at length over the reliability or unreliability of the UTF-8 BOM as a signal for UTF-8 encoding. Go find some reliable reference if you feel something definitive needs to be said about it. The article is fine as it is, particularly since these observations about the unreliability of the UTF-8 BOM apply equally well to the UTF-16 BOMs. A file of unknown provenance can never, with 100% confidence, be stated to be in any encoding whatever, or even to be text even though it might be the collected works of Shakespeare in 7-bit ASCII. The best you can state completely confidently is that the content is not in some particular encoding due to a violation of the encoding’s standards. Strebe (talk) 19:57, 14 July 2012 (UTC)[reply]

May I make this edit?

[edit]

Current:

While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may nonetheless be encountered, and it is explicitly allowed by the Unicode standard[1], the Unicode standard does not specifically recommend its usage[2]. It only identifies a file as UTF-8 and does not state anything about byte order.[3]

When I read these two sentences, it almost sounds as if the Unicode standard identifies a file as UTF-8 :-) That second sentence doesn't really fit anymore. Besides, it repeats what has been explained elsewhere. I suggest we remove it or move it somewhere else in the article. -- leuce (talk) 14:55, 31 March 2009 (UTC)[reply]

Article needs an example

[edit]

This article should include an example of a byte-order-mark. DMahalko (talk) 23:56, 15 June 2009 (UTC)[reply]

The article currently presents the definition and its encodings, including how it looks when rendered naively in various ways. David Spector (talk) 14:21, 28 March 2013 (UTC)[reply]

Why the dash in byte-order?

[edit]

The Unicode specification reads "byte order mark", not "byte-order mark". Why was this article's name changed? On the face of it, this article title is wrong. Strebe (talk) 04:01, 28 July 2009 (UTC)[reply]

Proper English would dictate the use of the hyphen. See http://en.wikipedia.org/wiki/English_compound#Hyphenated_compound_adjectives - Blueguy 65.0.223.146 (talk) 00:25, 7 August 2009 (UTC)[reply]
This article is about something that has a name. The name, by the body that coined it, is "byte order mark". It is not encyclopædic to "correct" established terminology; that is editorializing. This article's title is wrong. Strebe (talk) 09:05, 8 August 2009 (UTC)[reply]
Wikipedia rules tell to name articles as the thing is called on the street and in life, not as it's called in the dictionary or how it should be called; Strebe is right. 88.148.214.15 (talk) 20:35, 12 October 2009 (UTC)[reply]
The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was moved.  Skomorokh, barbarian  11:07, 27 October 2009 (UTC)[reply]


Byte-order markByte order mark — Cannot move back to old name without administrator intervention. Strebe (talk) 09:57, 18 October 2009 (UTC)[reply]

The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

Which text editors add a BOM to the beginning of text files?

[edit]

"Some text editing software in a UTF-8 environment on MS Windows adds a BOM to the beginning of text files." Which ones? Tisane (talk) 02:57, 24 February 2010 (UTC)[reply]

Probably a long list (Visual Studio .NET for instance) Tedickey (talk) 09:20, 24 February 2010 (UTC)[reply]

The BOM will make a batch file not executable on Windows…

[edit]

I removed this completely misleading remark of October 28:[4]. First, it is not impossible to remove the BOM even in Windows, so the conclusion about s.n. "ANSI" has not grounds. Second, user: BIL correctly stated that native encoding for .bat is CP437 but forgot to mention that non-Western Windows localisation actually use different OEMCP (see below a sample with code page 866); in any case this matter is quite off-topical and irrelevant though. And, the most important, .BATs starting with the BOM do execute:

T:\>test.bat

T:\>я╗┐echo ╨╗╤П╨╗╤П╨╗╤П╨╗╤П╨╗╤П╨╗╤П
'я╗┐echo' is not recognized as an internal or external command,
operable program or batch file.

T:\>ver

Microsoft Windows XP [Version 5.1.2600]

T:\>

The test.bat file contains:

echo ляляляляляля
ver/* UTF-8 */

Incnis Mrsi (talk) 18:15, 6 February 2011 (UTC)[reply]

Huh? Your example shows EXACTLY the problem: the BOM is not removed but is considered part of the "echo" command and therefore the .bat file fails to work.Spitzak (talk) 19:22, 6 February 2011 (UTC)[reply]

I see "the problem", but such .BAT do execute contrary to the statement quoted in the topic. As there is an error with the first line, there is, obviously, an easy workaround: skip the first line, say, leave it empty. This is all an WP:OR, just like deleted speculations. So I see no reason to keep BIL’s controversial OR in Wikipedia. Incnis Mrsi (talk) 20:36, 6 February 2011 (UTC)[reply]
If you leave the first line in the bat file empty, and save it as UTF-8, there will still be a BOM there, which will cause an error message, but the bat file will be executable. What I wanted to describe is that the Windows command prompt and bat files do not recognise BOM or Unicode. There might be a workaround, but still.--BIL (talk) 21:34, 6 February 2011 (UTC)[reply]
I agree that you have an extremely literal interpretation of the word "execute". Yes for almost any text file, the program the text file is for will start running, will open the file, and will actually read bytes from it, and only fail when it fails to interpret the line as the user of the text editor intended. By this criteria ALL programs "work with a BOM". However that is a pretty useless definition.Spitzak (talk) 21:23, 7 February 2011 (UTC)[reply]

Please, Spitzak and Strebe: do tango it into a good text here at Talk. If you cannot solve it here, it will not be good a good text in the article page for sure. I really would like to read the good article on this. -DePiep (talk) 20:40, 8 February 2011 (UTC)[reply]

The rationale of this edit is wrong: Without the BOM it would NOT be "the wrong encoding"
The character encoding is declared as part of the text file contents only if there is a BOM and only within Unicode environments. If there is not a BOM, or if the environment is not Unicode, then the character encoding is determined externally. You cannot claim that a file sent to the DOS command line is UTF-8, since, by definition, the file is DOS 437. It does not matter how the file was constructed or what its history was or whether it contains a BOM; when you sent it to the DOS command line, you implicitly declared that it was Code page 437, which is not a Unicode environment. If that is not what you intend, then you simply sent the command line the wrong file. Strebe (talk) 00:04, 9 February 2011 (UTC)[reply]
I shortened the text and wrote that batch files do not support Unicode and therefore not the BOM. Note that echo does not support Unicode, for example writing echo From Genève to Zürich in a batch file gives From Gen├¿ve to Z├╝rich, and Unicode file names do not work either.--BIL (talk) 10:34, 9 February 2011 (UTC)[reply]
Saying "the text has an encoding" shows that you completely do not understand why the BOM is not being recommended by some. Without the BOM, a UTF-8 file containing only ASCII letters is identical to a ASCII file. So it is simultaneously in UTF-8 encoding and also in ASCII encoding and DOS 437 encoding and ISO-8859-1 encoding and CP1252 encoding. The entire design of UTF-8 was to allow this, to eliminate the need to identify and transmit encodings. However it is defeated by the addition of the BOM which makes it no longer in these encodings, for a completely invisible letter that the programs now have to decode depending and add to their input syntax just so they can skip it! And don't print that bull about "batch files are DOS 437", if that was the problem the batch file would produce "this is in the wrong encoding error", not complain about the inability to find a command that happens to be equal to the first ASCII command with the three bytes of the BOM added to the start. In reality, batch files are streams of bytes and the byte values that happen to match the ASCII space and CR and LF and a few other values have some meaning. This is not an "encoding" at all.Spitzak (talk) 20:31, 9 February 2011 (UTC)[reply]
You might consider calming down and perhaps finding some soothing hobbies. You have no idea what I understand and do not understand, and I really am not interested in these sorts of petty pissing matches or discussing who’s stupid. I’m interested in improving Wikipedia. I can’t imagine anyone else is interested in such flaming, either.
We agree that a BOM is not recommended— after all, we must agree because that is what Unicode states. I have no disagreement with the first half of your diatribe. You might reconsider your rant about DOS 437, on the other hand. It is not the job of text processing systems in non-Unicode environments to recognize Unicode conventions. The Unicode Consortium recognizes this and take pains to make sure no one thinks they’re imposing Unicode on everyone, especially systems that existed before Unicode. Batch files existed long before Unicode. It cannot be batch processing systems’ responsibility to declare that the encoding is wrong because they don’t even know that it’s “wrong”. It’s NOT wrong; by using the file as a batch command, you have imposed DOS 437 semantics onto the file. Therefore your assertion that batch file processing ought to produce a “This is the wrong encoding” is nonsense. What you are calling a BOM is not a BOM in a batch file; it is a sequence of three characters: the “intersection” glyph from set theory, and two box-corner symbols. Just because Unicode came along does not deprive DOS 437 (or any other encoding) of its upper ASCII register, which you seem to be arguing for be claiming it’s “not an encoding”.
The important thing here is that the declaration of the encoding system is not part of the file’s content; it is externally imposed. A BOM has specific meaning within the Unicode environment. It does not outside of it. Batch files are outside the Unicode environment. It really is that simple. Strebe (talk) 01:00, 10 February 2011 (UTC)[reply]
It is obviously a waste of time trying to explain this. Bascially though: if a program takes some bytes in a buffer and puts them on a device that interprets them according to encoding X, then that buffer is in encoding X. It does not matter if that program does not understand encoding X or that it was written decades before encoding X existed. The bytes are in that encoding becuase they are interpreted as though they are in that encoding. Anyway I am going to delete the windows batch file comment because adding "it is in DOS 437" makes the argument completely nonsensical.Spitzak (talk) 19:36, 10 February 2011 (UTC)[reply]

Dubious claim in "Representations of byte order marks by encoding" section

[edit]

The GB-18030 section has the following claim: "[132] and [149] are unmapped ISO-8859-1 characters". But my understanding is that these characters aren't unmapped even in ISO-8859, but are C1 control characters; 0-31 is the C0 control area and 128-159 the C1 control area. This is why the mapping by Windows-X of higher Unicode characters to the latter range can cause problems.

I think this section needs to be edited by a knowledgable person. — 93.97.40.177 (talk) 07:00, 16 June 2011 (UTC)[reply]

You are confusing the character values produced after decoding with the bytes that are in the encoding. 132 is a value of one of the bytes in the GB18030 encoding of the BOM. It and 3 other bytes decode into the value 0xFEFF.Spitzak (talk) 19:09, 16 June 2011 (UTC)[reply]

From which version of text editors recognize/do not recognize UTF-8 without BOM in the beginning of text files?

[edit]

From which version of text editors recognize/do not recognize UTF-8 without BOM in the beginning of text files? Because when all text editors will recognize UTF-8 without a BOM, BOM will not be necesary anymore... — Preceding unsigned comment added by 86.75.236.140 (talk) 10:09, 30 June 2012 (UTC)[reply]

[edit]

«One reason the UTF-8 BOM is not recommended is that pieces of software without Unicode support may accept UTF-8 bytes at certain points inside a text but not at the start of a text.» Formulation of this sentence looks strange and illogical as from my point of view: If a software does not support UTF-8, presence of BOM helps to indicate this software is not compatible with UTF-8. — Preceding unsigned comment added by 86.75.236.140 (talk) 10:12, 30 June 2012 (UTC)[reply]

The rest of the paragraph explains it. Strebe (talk) 18:59, 30 June 2012 (UTC)[reply]
I understood the sentence just now: the intent is to mean to not use BOM for backward compatibility with legacy software which accept 8 bits regardless encoding.
It seams to me very specific, althouth I understand such a specific case can be considered by Wikipedia. I assume in 2012, there are very few software which have this issue.
I am not sure the case of a compiler is a good example. To be verifiable, a name and a version of assumed incompatible compiler (for instance PHP 5) should be given as example/reference. For the two compilers I searched for, I understand that this issue is solved and a BOM can be used:
  • A seven years old compiler (Visual C++ 2005)[Notes 1].
  • Another compiler example with gcc fortran five years ago, which considered it as a corrected bug [Notes 2]
So I would prefer a sentence which states BOM is for fully unicode compatible software and for old software (from the XXth century ;-) ), BOM should be avoided. Althought the Unicode position might say the same in a more neutral way[Notes 3] might be better.
Above all, explanation should be simplified to be easily understandable.
To be more neutral, Wikipedia should not also focus on POSIX position but also consider Unicode and Microsoft one.
  1. ^ MSDN states Visual Studio 2005 requires the BOM for code with identifiers, macros, literals and comments in unicode [1]
  2. ^ gcc bugilla bases states not handling BOM by compiler is a bug which has been corrected [2]
  3. ^ The Unicode FAQ in regard to BOM includes the question « Q: How I should deal with BOMs?» hich is answered by a fours cases distinction [3]
The article report’s Unicode’s guidelines. As stated, the Unicode standard permits the BOM but does not require or recommend it. The sentence that starts, “One reason the UTF-8 BOM is not recommended” does not imply that the Unicode standard recommends against using a BOM. It merely means that the Unicode standard does not recommend for using a BOM for UTF-8 and gives an example of why Unicode’s recommendation was formulated the way it is. The Unicode caution may become less and less relevant over time, but the original reasons, one of which appears as that example, are immutable historical fact. By the way, I think your optimism about widespread Unicode compatibility is misplaced: Many, many third-party applications have no concept of a UTF-8 BOM, and some truly ancient code continues to be used and relied on now and into the indefinite future because no one will make the investment to overhaul it. But our opinions do not matter here. The article is supposed to be about verifiable facts. Strebe (talk) 20:14, 1 July 2012 (UTC)[reply]
Your explanation here might be clearer than the article, as it is concise, and giving an historical rational. Now I understand the “not recommended”, whose meaning was not trivial, as a recommendation to caution mainly addressed to the user/data provider: For me it would be clearer to say it is not recommended to a user to store the text with a BOM, and that writing text encoded with BOM might be not compatible with old/ancient/legacy programs limited to reading only ASCII, and for compatibility it is preferable that a program that read text file be improved to handle correctly the BOM when present. In particular, I believe I have read debate to justify not fix incompatibilities bugs obscure reasons such as it is not recommanded to. Then I did not understood!
In my opinion: Now, Unicode is everywhere, from Internet to Linux distributions. Incompatible software will be corrected or be less and less used till abandoned, although if it is a question of years, as it occurs in microsoft-Windows where the DOS box containing legacy softwares such as DIR, which give size of files in CP850 encoding! But our opinions do not matter here.

w3c + existing software to strip BOM

[edit]

Note:

For the w3c, For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else. ( http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#decode-and-encode ) which i did read nothin about in this article.
I also note there are some software to help developpers to deal with BOM, which is not considered by the article, such as:
These look like good additions to the article. Feel free to add them. I think “the… BOM… is considered more authoritative than anything else” is obvious (i.e., if it’s not more authoritative in some instance, then someone has wrecked the byte stream somehow!) or irrelevant (because the byte stream is wrecked and therefore has no coherent encoding), but perhaps there are situations I am not thinking of that W3C has. At least it is a verifiable statement, which the article needs more of. Strebe (talk) 16:41, 2 July 2012 (UTC)[reply]

Bush hid the facts

[edit]

I understand that the BOM is also a mean to avoid the Bush hid the facts bug. — Preceding unsigned comment added by 77.199.96.98 (talk) 18:55, 9 July 2012 (UTC)[reply]

Requiring a BOM would eliminate this bug. The Bush hid the facts bug occurred when an ASCII file without newline looked like UTF-16 without BOM. The BOM is not required even for UTF-16 for reasons written in the article.--BIL (talk) 09:02, 28 May 2014 (UTC)[reply]
Requiring a BOM in UTF-8 would actually encourage such bugs, not fix them, by encouraging software to check for strange encodings first. Pattern recognition where you pick the patterns that are *least* likely first (this would be UTF-8 first, and ASCII if there are no bytes with the high bit set). UTF-16 could be recognized (and endianess determined) by looking for a large numbers of null bytes in non-pairs, but this pattern is a bit more likely in random data than the UTF-8 or ASCII patterns so it should be checked third. Requiring a BOM in 16-bit text has the same problems as requiring it in UTF-8, though it would fix this particular example.Spitzak (talk) 02:17, 29 May 2014 (UTC)[reply]

UTF-8 BOM recommendation

[edit]

User:Karl432 alerted me on my Talk page that the statement clarifying the Unicode Standard's neutrality with regard to use of a UTF-8 BOM was made by a senior Unicode Consortium member (Technical Vice President, Emeritus). Presumably this is reliable “enough” to cite.

Karl432 also elaborates in his edits, which I deleted: “It is to be noted that the presence of the byte sequence representing an UTF-8 encoded BOM at the start of a text stream or file can be interpreted as a hint that a text stream or file might be encoded as UTF-8, but not as a proof, as such a byte sequence may have other unrelating meanings unless such can be excluded by other knowledge of the context.” I deleted this because (besides being verbose and unencyclopedic verbiage) the same comment applies to any BOM, not just UTF-8. Strebe (talk) 23:53, 15 July 2012 (UTC)[reply]

RFC 5198

[edit]

When BOM is used in files, RFC 5198 (a RFC relative to protocols) stands that Net-Unicode forbids BOM usage. — Preceding unsigned comment added by 86.75.160.141 (talk) 20:46, 20 November 2012 (UTC)[reply]

How do you interpret RFC 5198 that way? The injunction against BOMs has nothing to do with files. It has to do with transmission of text strings. Strebe (talk) 21:15, 20 November 2012 (UTC)[reply]
Okay; this article is relative to Byte_order_mark and not only to files 86.76.39.126 (talk) 22:39, 24 November 2012 (UTC).[reply]

Difficulty of detecting UTF-8 without BOM

[edit]

It is not "trivial" to detect if a file is encoded in UTF-8. Easier than other encodings, yes, but it requires reading through a whole file, looking for characters that distinctly look like UTF-8-encoded characters, and finding enough of them to make a determination that the file is indeed UTF-8. It depends on the definition of "trivial," but I doubt that it meets it. Moreoever, if a file contains just one UTF-8 character, the algorithm may fail. Furthermore, what is the file is corrupt and has some invalid characters? The algorithm must not be too quick to bail. Finally, if using the popular ICU library, the detector is for whatever reason very slow. — Preceding unsigned comment added by 173.169.194.3 (talk) 20:49, 27 May 2014 (UTC)[reply]

You are not seeing the solution. You don't look for UTF-8 encoded characters, you look for sequences of bytes that are *not* UTF-8 encoded characters. The *vast* majority of sequences that contain a byte with the high bit set are not valid UTF-8 and it is easy to detect them. For instance a lone byte with the high bit set is not UTF-8. There is no need to read the entire file, and certainly no need to see if the UTF-8 characters make any sense. Checking even the first byte with the high bit set is enough to establish this with such a high degree of certainty that it is very difficult to contrive an example of even one actual word in any language that will fail (I think there is a known German word that if capitalized the ISO-8859-1 will produce a valid UTF-8 byte stream, but this is the only example anybody has come up with).
You are right that errors in the encoding would cause a strict version of this to say it is not UTF-8. However I recommend that coding detection be done on-the-fly: at each byte with the high bit set, it checks to see if it is UTF-8. If it is it uses it as UTF-8. Otherwise it can do a legacy conversion based on local pattern matching. This will fix multiple encodings pasted together, which no encoding-detection or BOM scheme will handle.Spitzak (talk) 23:43, 27 May 2014 (UTC)[reply]
That library is slow because it is written incorrectly. It is trying to pattern-match a vast number of legacy encodings before it ever gets around to UTF-8. The extremely fast and reliable not-UTF-8 test should be run first. However due to the historical addition of UTF-8 after other encodings, they tend not to be written this way. Also the use of the BOM has the preverse effect of making people write incorrect encoding detectors, as the lack of the BOM triggers legacy detection rather than just causing it to check the next high-bit byte for valid encoding.Spitzak (talk) 23:43, 27 May 2014 (UTC)[reply]
The text of the article is now incorrect. While Spitzak is correct that is it normally easy to detect that a file is not UTF-8, he fallaciously claims that is it easy to detect that it is UTF-8. That’s nonsense, especially if the file is short. A correct UTF-8 file could also be a correct file in any number of legacy encodings in the general case. Strebe (talk) 03:51, 28 May 2014 (UTC)[reply]
You are failing to understand. A random sequence of bytes is *extremely* unlikely to be valid UTF-8. This means that if you encounter a string that is valid UTF-8, it is extremely likely it *is* UTF-8, since the odds of encountering a string in an alternative encoding that happens to be valid UTF-8 is very low. For a more specific example, if an ISO-8859-1 string was to be misinterpreted as UTF-8 the only 8-bit characters it could contain are *pairs*, where the first character is an upper-case accented letter (the range 0xC2..0xDF), and the second character is a punctuation mark or C1 control (the range 0x80..0xBF). It is nearly impossible to make a string in any language that makes sense and actually contains such a sequence. I recommend you try to figure one out (try the JP multi-byte encodings and UTF-16 and others, too) and come back here if you can actually find a readable counter-example of more than one word, then you can say this does not work.Spitzak (talk) 02:07, 29 May 2014 (UTC)[reply]
I’m not interested in your proclamations of people not understanding. Knock off that idiocy. You have no idea what goes on in other people’s head. Your logic is fallacious. Text files represent a lot more than just real language. They represent all sorts of non-linguistic information. A valid UTF-8 file is also a valid file in many other encodings. “Valid” doesn’t mean human-readable. Strebe (talk) 03:39, 29 May 2014 (UTC)[reply]
If we make the assumption that if we find a file containing valid UTF-8 data (including more than 5 characters outside the 0-0x7F range) then we should be able to assume that it is intended to be UTF-8 even if there is no BOM. The only example I've seen where valid UTF-8 was intended to be Latin-1 was an example of how text can be misinterpreted if UTF-8 data is thought to be Latin-1. So even if valid UTF-8 data can legally be other encodings (every file are legal ISO-8859-x files) so will it in reality be unlikely, so being without BOM and test for UTF-8 first would do no harm in reality. --BIL (talk) 16:58, 29 May 2014 (UTC)[reply]
You’re just repeating Spitzak, so I guess I’ll repeat myself. You can tell with 100% certainty that a file is not UTF-8 fairly easily in almost all cases because the file violates UTF-8 syntax. You cannot ever tell with 100% certainty that a file is UTF-8 because there is no file that violates all other encodings but adheres to UTF-8. You can increase your confidence the larger the file is and the more non-ASCII bytes are in it. But that heuristic yields low confidence in files that are dominated by ASCII while having just have a few bytes above 0x7f. For Web pages, sure, generally you can detect with strong confidence (not 100%, but strong). If the document is a human language, sure, generally it’s pretty clear (not 100%, but strong). Otherwise, no, sometimes it’s not. The article needs to quit talking in certainties and superlatives expressing this imaginary certainty, and it needs to quit failing to distinguish between syntactical certainty and mere circumstantial evidence. Strebe (talk) 05:20, 30 May 2014 (UTC)[reply]
You are still failing to understand. This is a statistical logic problem that often confuses people. Lets vastly overestimate the chances of a random sequence of bytes being valid UTF-8 as 1/1,000,000. This means that if you take the set of all possible strings in this other encoding, 1/1,000,000 of that set will be valid UTF-8. Now pretend there are 500,000 different strings actually being used in the world in this non-UTF-8 encoding. This predicts that there is 1/2 of a string that will be confused, ie quite possibly none. Now if you actually extend this to real-world numbers, with actual odds of UTF-8 and string lengths of about 100 characters, the chances are astronomically small. In fact they are so small that I see no need to examine more than the first one or two 8-bit bytes and use that to assume the rest.
You are correct that if there are only 7-bit bytes in there, it may be one of the ancient 7-bit non-ASCII encodings and not UTF-8/ASCII. This problem though exists whether or not you consider UTF-8 first.Spitzak (talk) 20:59, 30 May 2014 (UTC)[reply]
1. This conversation does not belong here. It’s WP:OR, and what’s in the article is WP:OR. That is one reason I am finished with this discussion. Because it’s WP:OR, I’m going to delete large swaths of what’s in the article unless it gets cleaned up in a way that meets the consensus of the editors of this article and some semblance of Wikipedia policy. If someone thinks they have something to say on the topic in the article, it had either better not be controversial, or it had better be cited by referring to a WP:RELIABLE source.
2. Spitzak’s proclamations that people don’t understand are a violation of WP:CIV, and his insistence on doing this has made it pointless to engage in productive discourse. That is another reason I am finished with this discussion.
3. Spitzak wishes to talk statistics, which means he’s already agreeing with me that the “is UTF-8” check can only be statistical. He invents some number while ignoring the huge volume of files in existence that are mostly 7-bit ASCII with just a few 8-bit bytes in them. As a simple example, the UTF-8 string, “We don't know what ☔~ means” may just as well be “We don't know what笘梅 means” in Shift-JIS or something else in the myriad other encodings out there. Files of mostly English text with just a few symbols in them are not rare. Spitzak appears to wish for them not to exist. They exist. So. Given that we are an impasse, that is a third reason I am finished. After waiting a reasonable time for the article to get cleaned up or cited, I’ll simply do it myself, removing all the WP:OR. The article will be much shorter. Strebe (talk) 04:23, 31 May 2014 (UTC)[reply]
I don't see the problem. There is in general hard to determine encoding if it is not explicitly given. Why shall we then worry that on rare occasions something is detected to be UTF-8 when it is not? When other encoding determination situations don't work? I admit that there are advantages of having BOM on UTF-8 but there should be the option of being without, for example to edit source code for compilers and interpreters (such as Unix shell scripts) that don't understand UTF-8. Always adding or requiring a BOM would be sub-optimization (optimizing one aspect of a problem which causes problems for other aspects). --BIL (talk) 09:28, 31 May 2014 (UTC)[reply]
I don’t advocate a BOM. I advocate that the article be correct and WP:VERIFIABLE. Strebe (talk) 03:55, 1 June 2014 (UTC)[reply]
I'm sorry, but I am going to continue to state that you don't understand. You blatently state an irrelevant fact: "Files of mostly English text with just a few symbols in them are not rare." The fact that you say this shows that you are not comprehending the problem or solution. The real fact is: files containing symbols ONLY IN PAIRS AND ONLY IN A 2.6% SUBSET OF ALL POSSIBLE PAIRS are rare! Please understand this and don't screw up wikipedia with your incorrect assumptions, as they are as much original research as anything else.Spitzak (talk) 00:18, 1 June 2014 (UTC)[reply]

Some interesting (and mostly correct) comments here: [[6]]. They actually underestimated the chance that a 3-byte sequence will collide with UTF-8, although they also did a poor job of extrapolating this to actual files (which are longer than 3 bytes and are in encodings that have their own patterns that make UTF-8 collisions less likely), I tried to add a comment to fix this.

The main problem is that there are bad detectors who basically say "if the start is not the UTF-8 BOM" then NEVER consider UTF-8 again, or defer it until after a lot of other much less reliable pattern detectors are tried.

A correct detector should run the UTF-8 validation first: if it passes and contains an 8-bit byte it is so highly likely that it is UTF-8 that the answer should be considered done. Note that this is redundant with the BOM detection (since the BOM would trigger this test) so that should not be done (also because it fails to detect the common mistake of concatenation so that a BOM is before non-UTF-8 text).

These bad detectors are probably the biggest impediment to implementation of Unicode, because they encourage programs to "default" to non-unicode.Spitzak (talk) 19:16, 2 June 2014 (UTC)[reply]

Charset detectors

[edit]

It may be useful to add indicators to broken charset detectors that cause people to think the BOM is necessary to identify UTF-8. A broken detector is one that does not test for UTF-8 first using pattern matching, and returns UTF-8 if and only if that succeeds. This works due to the fact that valid UTF-8 is a significantly tiny subset of all possible byte sequences, thus a pattern match is an extremely reliable detector.

The most obvious broken detector is the Bush hid the facts one in Windows, which tests for UTF-16LE first, then apparently checks for one-byte "legacy code page" encodings. It is not clear if it *ever* checks for UTF-8 other than making sure the first three bytes are the UTF-8 encoding of BOM.

The Mozilla example was cited as one that does it correctly. Referenced document [7] seems to indicate this though it is not clear. The paragraph on coding patterns sort-of identifies it: an invalid sequence immediately says this is *not* UTF-8. However it also requires a "threshold" of multi-byte characters to exist. This may just mean a threshold at which it aborts without checking the entire text, which is ok. However the text seems to imply that a certain number of multibyte characters must be found for a positive result. This theshold should be one, or zero if the only 7-bit character set it can return is ASCII (because ASCII is UTF-8 and therefore if there are no 8-bit characters the file is UTF-8). The document is also not very clear if passing this test forces UTF-8 to be detected, or if other tests could somehow "weigh" higher, that would also be broken.

Spitzak (talk) 20:00, 22 April 2015 (UTC)[reply]

Thanks for your efforts, Spitzak. The edit you’ve made has a lot of problems, and I’ve reverted it again but have left the citation for Mozilla Universal Charset Detector. I see now what you mean by “counterexample of…”, and so we agree on this point, but the verbiage is so convoluted that I read it the opposite of how you meant it. So:
  • Parenthetical statements are discouraged because they’re syntactically weak and generally indicated content that’s not relevant or should be explained elsewhere.
  • Next, your edit states, “…many algorithms first try to detect legacy encodings, which are complicated, error-prone, or slow…” Lexically, the sentence is stating that legacy encodings are complicated, error-prone, and slow, when I assume you mean the implementations of the algorithms.
  • Your text seems to give two counterexamples, but says “a counterexample”.
  • Your text again uses exaggerating adjectives such as “trivial” while downplaying the problem of ambiguity in results.
  • Your text states, “complicated, error-prone, or slow, thus making UTF-8 be accidentally detected as another encoding”. Error-prone is the only one of those three conditions that would cause “accidental detections”, making the sentence again awkward in any case and incorrect by some readings. Meanwhile you have no reference for this claim of accidental detections. (The entire section is poorly referenced, but as long as the content is not controversial, it might as well stay.)
Meanwhile, the original text is accurate and reasonable. Your justification for reverting was, “Try to restore the fact that bad code is the reason for the BOM.” This rationale is unsupported opinion—and indeed, unsupportable. Strebe (talk) 06:40, 23 April 2015 (UTC)[reply]
It is obviously impossible to fix the misconceptions. I will just leave it, I give up. However I seriously believe that an error chance of 1/20^n (where n is the number of bytes in the document) does NOT have a "problem of ambiguity in the results", we are talking about chances that quickly reach astronomical proportions. I also believe an algorithm that does a pattern matching that can be described by a regexp is "trivial", when the alternatives actually have to know information about human languages. But it is impossible to correct this because people are convinced that the BOM is there for a reason, so incorrect information will remain and propagate in wikipedia. What a shame.Spitzak (talk) 15:38, 24 April 2015 (UTC)[reply]
The BOM has existed since 1.0, before there was any “bad code”, so your assertion about the purpose of the BOM is false. Strebe (talk) 01:20, 25 April 2015 (UTC)[reply]
My assertion is about the use of the BOM to identify UTF-8. That was not it's purpose back then, it's purpose was to determine the byte order of UCS-2. The "bad code" dates from the 1960's if you include the first attempts to determine character sets of raw data, though until multibyte showed up there was no illegal byte sequences so code detection could only be done the bad way.Spitzak (talk) 03:26, 29 April 2015 (UTC)[reply]

Latest edit disagreements

[edit]

User:Spitzak reverted this edit with the explanation, No, or it would not be called "BYTE ORDER MARK!!!!" It is NOT for identfying UTF-8 no matter how much you wish otherwise. His reason is spurious; the edit does not state or imply that the BOM is for identifying UTF-8. He then makes his own edit. I reverted his edits by explaining that his rational for reverting mine was flawed. Then he chastises me with, “Let's try this. Please don't revert huge amounts of work without explanation”, which is a curious complaint given his earlier behavior.

User:Spitzak’s new edits make the article worse in the following ways:

  • It removes the concise, easily digested list of purposes for the BOM and replaces it with dense prose that breaks up the uses of the BOM into two paragraphs.
This one buries the original reason in the middle and repeats the "identify Unicode" reason twice as items 1 and 3. Not only that, it is wrong. Even Microsoft will recognize UTF-16 without the BOM, and I kind of doubt much software recognizes UTF-32 using it (as it will look like UTF-16 starting with a BOM and a NUL).
Therefore there are two reasons: identify byte order in UTF-16 and UTF-32, and as a marker to distinguish UTF-8 from legacy 8-bit encodings.
Identifying the stream as Unicode is not the same as determining byte order. In fact they are completely distinct in the abstract. It’s fine to put the endiannness at the top if its placement in the middle bothers you.
  • It now states, “Because Unicode can be encoded as 16-bit or 32-bit integers…”, deleting the important fact that Unicode can also be encoded as 8-bit integers.
8-bit integers are not a reason for the BOM, therefore it certainly should not say "because it can be encoded in 8 bit units"!
It doesn’t say that at all. All mentioning 8-bit does is give an exhaustive list for people who would otherwise misunderstand that only 16- and 32-bit Unicode exist.
  • It fails to make clear immediately that programs consume the BOM (as opposed to humans), which is one of the useful pieces of information User:Stevage’s recent edit was getting at.
I have no idea what you are talking about. The character when correctly rendered is invisible so of course it is looked at by programs and not humans. Not sure what you mean by "consumed" but (see next item) even you agree that the character should not be removed by anything.
The character when correctly rendered is invisible: You know that. I know that. The person reading the article doesn’t necessarily know that. Why do you think User:Stevage added that fact? Again, when you write an article you need to give reasonable context for the human who comes to learn something.
  • It deletes a reason given in the citation for why the standard does not recommend removing a BOM when present.
I believe it is vital that programs not remove it, so I don't think I deleted that.
But you did. You deleted, “and so that code that relies on it continues to work.”
  • It replaces “Not using” with “not requiring” in “Not requiring a BOM lets the text be backward-compatible…” which is nonsensical. It is the use, not the requirement, that interacts with software.
I agree with you
  • It injects “…or 8-bit ASCII-based character sets” into “Often, a file encoded in UTF-8 is compatible with software designed for ASCII or 8-bit ASCII-based character sets…”, which is simply wrong without a lot of caveats and explanation.
It is *more* likely to work with software designed for 8-bit character sets (ASCII-only code sometimes thought it was not important to preserve the high bit) so this statement certainly is not wrong. However it probably is not necessary to mention it as true ASCII-only software is long obsolete.
It is difficult to grasp what you mean here, but perhaps you mean, “Often, a file encoded in UTF-8 is compatible with software that assumes 8-bit characters but that does not base processing around some specific encoding”. In that case, I agree with you.
  • It yet again injects massive personal advocacy with It is actually reliable to detect UTF-8 in a byte stream without relying on the BOM. The vast majority of random byte sequences are not valid UTF-8, therefore a file in another encoding can very likely to be proven to not be UTF-8 because it will contain at least one byte sequence that is not valid UTF-8. This has the (counter-intuitive) implication that a file that contains only valid UTF-8 is equally likely to be UTF-8.. Over the course of several years User:Spitzak has repeatedly used unencyclopædic, superlative verbiage to enthuse over the detectability of UTF-8. This is not acceptable practice. Meanwhile, these claims are never cited, and the citation connected to this new edit and remaining from the previous state of the article does not support User:Spitzak’s claims.
Best one I can find is [8]. This seems to be much more accurate and gives the probability of 0.87739479563671×0.56471777839234^n (~ 1/(1.14*1.77^n)), somewhat better than the values I have estimated of 1/2^(n-2) for small n (mine appears to be wrong for large n but both are huge then). That is for fully random sequences, I have generally worked with a value of twice that because I am assuming that you first search for a byte with the high bit set and then test starting from there. Using his approximation it looks like testing 28 bytes will give you a more reliable test than checking if the first 3 bytes are the BOM (since that has a chance of being wrong of 1/256^3). This really should be in the UTF-8 page, too, it is a common question.
A random byte array is only one minor factor in making such a claim. Real text isn’t random bytes. To be clear: I agree (as I have stated several times here over the years) that false positives in identifying UTF-8 are rare when done correctly. However, I have no citation for that claim, and my own experience and calculations are irrelevant.
  • It deletes the important, and cited, observation that Microsoft’s infrastructure requires a BOM for UTF-8.
"Microsoft software will not recognize UTF-8 unless it starts with a BOM or contains only ASCII characters" is right there!
My mistake. The rearrangement confused me.
  • It’s broken syntactically any number of ways, from periods on both sides of references multiple times to poorly structured sentences with multiple, confusing conjunctions.
Most likely you are right.

Meanwhile it provides nothing over what was already there. I am reverting this edit. Strebe (talk) 02:09, 7 May 2015 (UTC)[reply]

It is obviously hopeless for me to stop this propagation of misinformation. Oh well.Spitzak (talk) 02:11, 8 May 2015 (UTC)[reply]
I’m constantly bemused by your assertion that “misinformation” is being propagated. That’s simply not true. The article does not advocate a BOM for UTF-8. It recognizes that UTF-8 can be detected with good chance of success. We’re on the same side here, but for some reason, over the course of years, you have repeatedly claimed the article “misleads” people, apparently because it doesn’t express your belief that UTF-8 can be detected perfectly. What is the problem here? I have never gone on record because my opinion ought to be irrelevant, but the fact is that I do not approve of using a BOM with UTF-8. My only agenda here is to provide verifiably correct information in the article. Strebe (talk) 05:59, 12 May 2015 (UTC)[reply]
The text has ALWAYS misled by conflating the ability to decide if a string is UTF-8 with the ability to figure out which non-UTF-8 encoding is being used, when discussing pattern recognition.
  • It conflates no such thing.
QUOTE DIRECTLY FROM CURRENT ARTICLE WITH IMPORTANT WORDS IN BOLD: "without a BOM, heuristic analysis is required to determine what character encoding a file is using. Many extant algorithms for distinguishing legacy encodings are complicated, error-prone, or slow". Why are "legacy encodings" not mentioned as a step you have to do when there is no BOM? That is artificially burdening pattern recognition with a step that either it does not have to do, or that you have to do when you use the BOM. This is incorrect.
  • I still cannot figure out what your objection is or what you believe you’re saying or even what you believe the article text is saying. Whether UTF-8 or a legacy encoding, a heuristic is required for detection, just as the text states. The text states legacy encodings are complicated. It states UT8-8 is simpler. Why is any of that controversial?
The need to figure out which non-UTF-8 is in use when there is no BOM is however ignored. This artificially makes pattern recognition seem less reliable in comparison.
  • It makes seem no such thing.
SEE ABOVE QUOTE. Why does it mention that you have to figure out "legacy encodings"???? If you insist that a BOM start UTF-8, then you can say that text that does not start with a BOM is "not UTF-8". You can do EXACTLY THE SAME THING with pattern matching, you can say a piece of text is "not UTF-8". Now when it is "not UTF-8" it is in *some* "legacy encoding" and maybe you have to figure out which. BUT THIS IS THE SAME PROBLEM FOR BOTH!!!!!
  • Why does it mention that you have to figure out "legacy encodings"???? It mentions no such thing. You have simply chosen to read it that way because of your multi-year crusade about this. The text says no such thing. It means no such thing. It compares the complexity of legacy encoding detection to the simplicity of UTF-8 detection. That’s all it does. I would think you would approve, but you do not seem to be able to read the text objectively.
The new intro calls the BOM a "high confidence" indicator. This implies it is better than other methods,
  • It implies no such thing.
Holy crap. Let's quote DIRECTLY FROM THE ARTICLE: "That the text stream is Unicode, to a high level of confidence". Really it says that exactly!!!! So 1/256^3 is a "high level of confidence", while pattern matching the string, which has a lower chance of being wrong if there are 6 or more non-ASCII characters, is not????
  • Yet again, you do not seem to be able to read what the article says, but instead inject all sorts of meaning and implication where there is none and cannot reasonably be interpreted to have any. There is no comparison whatever to UTF-8 detection without a BOM. None. There is none intended. It is just an independent, obvious, uncontroversial statement of fact. No one else reading the article would read it the way you have because there is no context that suggests any such thing. Your history with this article is the only thing that leads you to inject these ridiculous interpretations, in my assessment. Please step back, calm down, and read what is there instead of seeing all these monsters in the dark closet.
which is false.
It also now claims the BOM is being used to identify all types of Unicode, also false, Microsoft does not require the BOM on UTF-16.
  • It’s not false; it’s not clear why you think it must be false, and invoking Microsoft is a non sequitur. Just because programs can identify Unicode without a BOM does not mean programs do not use the BOM, when present, to identify Unicode. That happens all over the place. It happens in the Mozilla charset detector;[1] it happens all over Microsoft’s infrastructure;[2] browsers compliant with HTML5 are required to do so: In HTML5 browsers are required to recognize the UTF-8 BOM and use it to detect the encoding of the page, and recent versions of major browsers handle the BOM as expected when used for UTF-8 encoded pages;[3] and it happens all over the place in practice, as you can see if you perform an Internet search on “first check for unicode bom”. If a BOM is present, nobody then goes to see if the text is actually Windows-1252 or anything else. People would be silly to ignore the BOM hint if present.
Please learn the difference between "BOM implies Unicode" (which I have no problem with and what you are describing here), and "no BOM implies this is NOT Unicode". The second statement is where the difference is. You can make Microsoft software read UTF-16 without a BOM. You CANNOT make Microsoft software read UTF-8 without a BOM. THERE IS A DIFFERENCE!!!!!!
  • Please learn the difference between "BOM implies Unicode" (which I have no problem with and what you are describing here), and "no BOM implies this is NOT Unicode". The text says no such thing. There is no justifiable interpretation by which you can make it read or imply that “no BOM implies this is not Unicode.” That text does not exist. You have, again, imagined what is not there. Stop it. The texts states that if the BOM is there, it can be used to infer Unicode—which you seem to agree with. It states that if there is no BOM, the fact that it is Unicode can be determined through heuristics—which you also agree with. That’s all it says. Strebe (talk) 01:29, 15 May 2015 (UTC)[reply]
I now think a possible fix is to link this with Magic_number_(programming)#Magic_numbers_in_files, to point out what it actually is. That article has good points about the positive and negatives of using it for this.Spitzak (talk) 19:41, 12 May 2015 (UTC)[reply]
Can we please be done with this? The article is correct and verifiable. Nobody is being misled. No bad practices are taught. Strebe (talk) 02:02, 13 May 2015 (UTC)[reply]
No. This article is horribly misleading and causing dangerous incorrect software to be written. This has to be fixed.Spitzak (talk) 23:10, 14 May 2015 (UTC)[reply]
I have reverted your unjustified deletions and the problematic control characters that don’t render correctly. I’m probably going to put the shebang comment back in, but perhaps you have a better explanation than you put into your edit comment; I will wait for that. Strebe (talk) 01:37, 15 May 2015 (UTC)[reply]

References

"legacy encodings" objection

[edit]

Probably should start a new section about this. I am VERY opposed to the current wording and will try to explain why.

strebe: I still cannot figure out what your objection is or what you believe you’re saying or even what you believe the article text is saying. Whether UTF-8 or a legacy encoding, a heuristic is required for detection, just as the text states. The text states legacy encodings are complicated. It states UT8-8 is simpler. Why is any of that controversial?

The disputed sentence: heuristic analysis is required to determine what character encoding a file is using. Many extant algorithms for distinguishing legacy encodings are complicated, error-prone, or slow

I will try VERY HARD to explain my objection. Can you please read this carefully and post any question you have:

Imagine there is a function that checks if the BOM is at the start of the file. It is called hasUTF8BOM().

Imagine a second function that does pattern matching to determine if a file contains only valid UTF8, and it is called matchUTF8(). It is in no way "error-prone" (it is in fact a good deal more reliable than relying on the odds that the first 3 bytes happen to not be the BOM in non-UTF8). You can argue about whether it is "complicated" or "slow" but imho it is neither of them when compared to the next function.

Let's imagine a third function which examines a text string called getLegacyEncoding(). This returns every other encoding you are interested in other than UTF8. I think it is fair to describe this function as "complicated, error-prone, or slow" and in fact that is exactly what the text is referring to.

Okay, lets write a program that uses these functions and returns true/false as to whether you think the file is UTF-8:

Version 1:

  isUTF8(f): return hasUTF8BOM(f)

Version 2:

  isUTF8(f): return matchUTF8(f)

Wait a second! Neither version calls getLegacyEncoding()! So why is there some text talking about that when discussing using version 2????

Okay, maybe your concern is that you do need to figure out the legacy encoding. Lets try some new functions that figure out the encoding using the above:

Version 1:

  getEncoding(f): return hasUTF8BOM(f) ? UTF8 : getLegacyEncoding(f)

Version 2:

  getEncoding(f): return matchUTF8(f) ? UTF8 : getLegacyEncoding(f)

Oops! You need to call getLegacyEncoding() in both of them!

Do you understand? The complexity of distinguishing legacy encodings is irrelevant when choosing whether to use the BOM or not. Both cases need or can ignore it equally. Therefore mentioning how hard that is in the context of not using a BOM is misleading.

Spitzak (talk) 01:16, 21 May 2015 (UTC)[reply]

There isn’t anything wrong with your logic. Again, it’s about your interpretation of the article’s meaning and intent. The complexity of distinguishing legacy encodings is relevant because many programmers will choose a library that does not look specifically for UTF-8 indicators first but instead acts as a general detector for character sets. Not only that, even if the programmer does the right thing “herself” for detecting encoding, a choice she makes about encoding affects systems that she has no control over: that text will end up in places where her choice of encoding suffers because the downstream system uses a slow, buggy character encoding detector—or is a Microsoft product. The article helpfully points out that UTF-8 is easier to detect, which you could read as a hint that the programmer should try to detect UTF-8 first if the problem domain is likely to be Unicode. But you seem to be so obsessed with the horrifying thought that someone might treat UTF-8 as a peer to other encodings that you interpret the article as, “If you don’t use a BOM then you’re stuck dealing with the messy legacy encoding problem.” Well, in a sense you are stuck with such problems, because you have no control over how others will handle your text. Meanwhile if you choose to use a BOM instead, your text just works with Microsoft products, and presumably any character encoding detection system will recognize the the encoding reliably as well. But the article emphasizes neither BOM or BOM-less. It is neutral and merely points out a few facts about the consequences of choice. As it should. Strebe (talk) 08:45, 22 May 2015 (UTC)[reply]

Huh?

[edit]

Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).
The italicized sentence here is not at all clear to me. Is it saying that a program expecting a UTF-8 file will display a UTF-16 file as garbage outside of parts of the file that are ASCII only? Why doesn't it just say that, instead of what I first interpreted as that it would display a file beginning with a UTF-16 BOM as garbage because of the BOM, which is what it might be taken to mean from the context, and is obviously silly because of the way UTF-8 resyncs. Also, saying that display of an ASCII only file in this case is fairly readable is a bit of a stretch, for instance is display of a highlighted NUL between each pair of characters fairly readable?
In fact, if this sentence is talking about how a program expecting UTF-8 displays UTF-16 in general, why is it in the article on BOM anyway without some clarification about what this has to do with BOM? 2601:646:8D01:8A90:29F4:44B8:5515:B1EE (talk) 07:25, 3 October 2015 (UTC)[reply]

The italicized text is just some random observation by some random editor and doesn't seem to have anthing to do with BOM. Feel free to fix it. Strebe (talk) 07:47, 3 October 2015 (UTC)[reply]

UTF-32LE matching UTF-16LE

[edit]

I added a mention that the UTF-32LE BOM is the same byte pattern as a UTF-16LE BOM followed by a null (0) character. I thought this was interesting and indicates an example where blindly obeying the bit patterns does not work. Somebody else thought to add a lot of text which I think amounts to "text starting with null is very uncommon so this is not a problem". He seems to think it is because null is a string terminator in null-terminated strings, but actually that makes a leading null *MORE* common, since zero-length strings are probably by far the most common. The real reason this is not a problem is that UTF-32 is so very rare that there is no reason to test for it.

I guess I'll just delete this, as the tiny anecdote has inflated to a mess of unreadable text. Spitzak (talk) 19:09, 27 January 2016 (UTC)[reply]

Repeated deletion of math showing chances of misidentifying another encoding as UTF-8

[edit]

This has been deleted as OR but it is actually based on several answers from Stack Exchange which is not allowed as a source. It also has provable math statements in it. Strebe does not seem to understand it, making two mistakes: 1/15 chance of an error is not 85% chance of it being correct, it is 93.3%. And that is for ONE character, the chance of finding N correct multibyte characters is 1-1/15^N which quickly becomes astronomical. For instance finding 7 UTF-8 characters without first finding an invalid sequence is 1/170,869,375. This can be compared to the 1/16,777,216 chance of the first three bytes being the BOM, a chance that is assumed to be zero by defenders of the BOM method.

This question of odds is asked quite a few times on the internet, though there are a lot of incorrect answers. I thought it would be useful to provide some kind of answer. Spitzak (talk) 22:44, 29 June 2017 (UTC)[reply]

'Consuming'

[edit]

A tiny issue of language. My amendment [9] was reverted with the comment 'As in, to “ingest” and serially “use up” the incoming stream. This is standard terminology.' As a native English speaker, I am confident my edit was an improvement. As a reader, I found the word 'consuming' for what a program does with an input string both jarring and jargon. It lacks clarity, and is not consistent with English usage. Detection of character encoding and byte order does not 'ingest', 'consume' or 'use up' anything. If this were standard terminology, why is there no common method consume()? Instead we have read() and open(), the standard analogy being with a book. That analogy is surely clearer to a reader. (Conversely 'emit' may be used occasionally in a computing context as a substitute for 'print', but not 'excrete' or 'egest'.) I'd not seen the word 'consume' in any Unicode documentation I've been reading recently. The first language of English Wikipedia is English. I'll reinstate the correction once and once only. --Cedderstk 06:43, 29 October 2018 (UTC)[reply]

In standard computer terms, reading implies only the transfer of data, e.g. from a file or network into memory. Consuming involves interpreting the data. Of course it is possible do both in parallel, but the reading part is independent of the type of data (for example it would be the same whether it was text, an image, etc.) and then consuming depends on the type of data. If multiple types of data are accepted, the first step in consuming the data is often to identify the type of the data, and the next step would be to process it depending on the type that was identified. The article is really talking about this step of identifying the type of data, which is often the first step in consuming the data after it is read. The byte order marks tells it that it is text (and not, say, an image) and also the specific encoding of the text, so that it can then be processed/parsed. I didn't revert but I don't think "reading" or "accessing" is an improvement; both of those terms refer to transfer of raw data that occurs before it is consumed or interpreted, but the byte order mark is about interpreting the data that was read. -LiberatorG (talk) 15:44, 29 October 2018 (UTC)[reply]
How about "interpreting" then? "consuming" does have the problem that (at least for many readers) it implies the destruction of the original data.Spitzak (talk) 17:16, 29 October 2018 (UTC)[reply]
I have no issue with "interpreting". -LiberatorG (talk) 18:01, 29 October 2018 (UTC)[reply]

Odds of UTF-8 being in random byte stream

[edit]

There are 128 bytes with the high bit set, the following bytes can have 256 values each. Therefore there are 128×256N-1 N-byte sequences starting with a byte with the high bit set.

If there are M characters encoded in UTF-8 using N bytes, the chances of a byte with the high bit set starting a value N-byte character is M / (128×256N-1).

  • For N=2, M is 0x800 - 0x80 = 0x780.
  • For N=3, M is 0x10000 - 0x800 = 0xF800 (I am allowing surrogate halves)
  • For N=4, M is 0x110000 - 0x10000 = 0x100000.

As the valid sequences are disjoint the odds of finding the different lengths can simply be added.

0x780/(128×256) + 0xF800/(128×256×256) + 0x100000/(128×256×256×256) = 0.0586 + 0.0075 + 0.00049 = 0.06665

This is really close to 1/15. In fact if you do this in integers it is 143130624/2147483648 which reduces to 273/4096, and 15*273 is 4095. Spitzak (talk) 23:42, 26 April 2019 (UTC)[reply]

I don’t understand your notation. In binary:
  • N = 2, M = 110xxxxx (and therefore 32 values)
  • N = 3, M = 1110xxxx (and therefore 16 values)
  • N = 4, M = 11110xxx (and therefore 8 values)
The hex numbers are how many characters are encoded with N bytes. For N=2 for instance the number of characters it could encode is 0x800 (2^11), but 0x80 of these are overlong encodings of 1-byte characters, so the total is 0x800 - 0x80 = 0x780.
Since you already stipulated the top bit being set, that’s (32+16+8)/128 probability of a random byte being a legitimate first byte of a multibyte UTF-8 character. However, for those to be legitimate, the following byte must be:
  • M = 10xxxxxx (and therefore 64/256 values)
And this is not an independent probability, since the preceding byte’s legitimacy as a first byte depends on it. Byte 3 and Byte 4 have similar requirements as Byte two, but the results are dominated by Bytes 1 and 2 and so I didn’t bother with them. Also, not all of the xxx bits result in real characters, but again, most of that space is populated in the two-byte space, and so I ignore it.
And this is why these sorts of factoids need to be cited, not just generated. We shouldn’t even be having this conversation. Strebe (talk) 05:26, 27 April 2019 (UTC)[reply]
You are wrong. The byte stream is *RANDOM*. This means the value of byte 1 is independent of the value of byte 0, completely the opposite of what you said. You also added 5 invalid lead bytes. So a quick approximation like you are doing is (30+16+5)/128 * 64/256, which is close to 1/10, this is an overestimate as it counts many invalid 3 and 4 byte sequences. An underestimate would be to say that only 2-byte leads are valid, which gives a value close to 1/17. My math is in fact correct and gives a value near 1/15.
I don't think any citation is needed for obvious math. However what I wanted a citation for is some analysis for *real* text, or even real binary data, which is not random. I believe the odds of a byte having the high bit set is significantly less than 1/2 in real text, but at this point I do think a citation is needed.
I am unclear where you are getting your impression that the second byte somehow magically has greater odds of being correct, though I think the underlying misconception is why people insist on the BOM and can do such muddled thinking about UTF-8. It seems there is a chain of thought whereby just the existence of UTF-8 in the Universe some how forces patterns that are invalid UTF-8 to either be physically impossible or (in your thinking) less likely than valid patterns, perhaps as some kind of quantum physics effect? Besides simple errors like yours of over-estimating the chances that a random stream will be valid UTF-8, this thinking has caused the far more serious problem of designs that assume turning an array of bytes into "Unicode" and then back again is a lossless operation, and thus storing strings as "Unicode" rather than UTF-8 internally is acceptable.Spitzak (talk) 20:02, 27 April 2019 (UTC)[reply]
There is a reason Wikipedia does not permit WP:OR, so stop trying to engage in WP:OR and stop telling people what is going on in their own heads and fantasizing that you are the only person in the world who understands these things. I’m not interested. 1:10 versus 1:15 is meaningless, and only serves to demonstrate your obsession with this topic that has incited, over the span of many years, a long history of poor edits under the misconception that someone might construe the article to be advocating using the BOM. It doesn’t; it never has; and it’s just boggling that you cannot move on with your life. The world doesn’t work anything like your rant implies. Strebe (talk) 21:45, 27 April 2019 (UTC)[reply]

"the text stream's encoding is Unicode"

[edit]

Unicode is no encoding. It is a charset and provides several encodings. Thus, it should either read "the text stream's character set is Unicode" or the "text stream has a Unicode encoding". --Meillo (talk) 18:56, 31 January 2021 (UTC)[reply]

I don’t think anyone is confused by the present verbiage, which is merely shorthand for “the text stream’s encoding is one of Unicode’s encoded forms”. Unicode isn’t a “charset”, either; Unicode is standard that defines a character repertoire, a code point for each abstract character in that repertoire, several encoded forms for the set of code points, and a lot of rules and recommendations. I don’t think either of your proposed changes is satisfactory. In the first case, Unicode does not use the term “character set” because the term is not well defined. In the second case, almost any text stream “has a Unicode encoding”; i.e., can be represented in a Unicode encoding by some transformation. Strebe (talk) 00:40, 1 February 2021 (UTC)[reply]
So, in the end, the presence of a BOM only really tells that there (most likely) is some Unicode stuff involved? ;-) I really would like to get rid of the words "the encoding is Unicode". There is so much confusion in the whole topic, and wordings like this one add to it. I like the long form (“the text stream’s encoding is one of Unicode’s encoded forms”) much more, as it provides more clarity. But actually, is this really what the bullet point wants to point out? In relation to the third bullet point ("Which Unicode character encoding is used."), the second one maybe should rather be: "The fact that the text stream is Unicode, to a high level of confidence;" (i.e. omitting "'s encoding"). --Meillo (talk) 03:07, 1 February 2021 (UTC)[reply]
I’m fine with “the text stream is Unicode”. Yes, the BOM just says Unicode is probably involved. Strebe (talk) 04:50, 1 February 2021 (UTC)[reply]