Jump to content

Talk:Extended ASCII

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Comment

[edit]

Shouldn't the Unicode input methods require a page of its own? Or should it be made an item in Unicode? — Hhielscher


moved it to Unicode, as (Unicode in use).(Input methods). --Mac-arena the Bored Zo 15:21, 2004 Dec 28 (UTC)

520256644 identified as vandalism

[edit]

Compatibility with UTF-8

[edit]

The final sentence of this article states:

A computer language that supports Extended ASCII can also support UTF-8 without any changes; this was a major factor in UTF-8's popularity.


Which, I suppose, is technically true. A few years back I had the unpleasant task of converting a large enterprise system from ISO 8859-1 to UTF-8 (although in practice it was Windows 1252 since those additional characters were present despite the database being declared as 8859-1) It was difficult, time consuming, and expensive.

It is true that none of the computer languages needed to be changed. C, C++, C#, Java, Javascript, ASP, SQL, Pl-SQL, etc. needed no modifications. But every single "extended" character that took up one byte in 8859 needed two bytes in UTF-8 causing all sorts of sizing issues. The sentence above could be very misleading - I can imagine one of my managers (who has never coded anything themselves) reading this and thinking the systems are backward compatible. Perhaps if there was a cite for it we could expand or clarify the sentence. As it stands, the best thing would be just to remove it.

Mr. Swordfish (talk) 13:16, 18 September 2023 (UTC)[reply]

Yes, I agree, just delete it.
TBH, I came very close to deleting the whole section as I can see no redeeming features. For now, I've tagged it as WP:OR but unless someone does a major cleanup and sourcing job on it real so, off with its head. --𝕁𝕄𝔽 (talk) 13:46, 18 September 2023 (UTC)[reply]
Thanks. I'm new to this page so I didn't want charge in and make sweeping changes without asking first. I've removed the sentence in question.
As for the rest of the section, I don't think it adds much, and if your C or C++ code uses any fixed length character arrays there's a lot of maintenance coding to deal with single byte characters turning into multi-byte characters when converting from 8859 to UTF-8, in sharp contrast to the assertion of little extra programming effort.
I'd say just delete the section. Mr. Swordfish (talk) 15:21, 18 September 2023 (UTC)[reply]
That said, a short treatment of how 8859 is basically not compatible with UTF-8 might be worth including if someone wants to write it. Mr. Swordfish (talk) 15:24, 18 September 2023 (UTC)[reply]
As you stated, computer languages did not have to change. This is a big deal, making switching to UTF-8 from extended ascii much easier than other possible switching. Also even at that time there was lots of software that only dealt with character strings, not individual characters, and that also needed no changes.Spitzak (talk) 00:39, 19 September 2023 (UTC)[reply]
The first question that comes to mind is "what does it mean for a computer language to "support" extended ASCII or UTF-8?" Does it mean:
  • extended-ASCII comments will not be rejected by programs that process that language?
  • character string constants in the language can contain extended ASCII, and octets in the string that aren't ASCII characters will be inserted into the string as is?
  • identifiers in the lannguage can contain extended ASCII characters?
  • the language's support for character strings handles strings containing extended-ASCII characters?
  • Something else? Guy Harris (talk) 00:58, 19 September 2023 (UTC)[reply]
It means that strings can contain all byte values with the high bit set, and printing the string prints the same byte with the high bit set that is in the source code. Spitzak (talk) 04:28, 19 September 2023 (UTC)[reply]
Does it also mean that, for example, if the language offers a "convert string to lower case" operation (either as a library routine or as something defined in the language's grammar), it will properly convert strings if the encoding is known, and that other string-processing operations deal with all supported encodings, including multi-byte ones? If not, then you don't get full support for non-ASCII text for free.
(And there's the separate question of whether the compiler, if it indicates errors in the source code with, for example, a ^ or characters pointing to the error, correctly understands that, even with a fixed-width character display, there isn't a one-to-one correspondence between octets and character positions.) Guy Harris (talk) 20:23, 19 September 2023 (UTC)[reply]
>...switching to UTF-8 from extended ascii much easier than other possible switching...
Could you elaborate on what you mean by "other possible switching"?
Was anybody still using EBCDIC or or any of the other proprietary character sets from the sixties by the time UTF-8 came along?
ASCII -> UTF-8 conversion is trivial since ASCII is identical to UTF-8 as long as only ASCII characters are used. 8859 -> UTF-8 is not trivial. Or easy. Mr. Swordfish (talk) 21:37, 19 September 2023 (UTC)[reply]
The only thing that's "easy" is code that handles "extended ASCII" in the sense of "strings are a combination of ASCII characters and arbitrary uninterpreted bytes with the 8th bit set". Once you care what those 8th-bit-set bytes represent, you're dealing with the encoding, and you have to worry about the n in ISO 8859-n, at minimum. Maybe the locale makes that work if you're not doing anything too fancy. And if you have to worry about multi-byte character encodings, dealing with the encoding gets harder, as in "going from single-byte encodings to UTF-8 isn't trivial". Guy Harris (talk) 22:46, 19 September 2023 (UTC)[reply]
This isn't rocket science. Printf "works" in UTF-8 because it only looks for '%' characters in the string, which have the exact same byte value in both ASCII and UTF-8, and otherwise prints all the other bytes unchanged. If the thing it is printing on understands UTF-8 then UTF-8 in the printf string will be interpreted correctly. Obviously any code that actually cares about which non-ASCII characters are in use will need to be changed, but the VAST MAJORITY of code does not care and does not need to be changed!Spitzak (talk) 22:59, 19 September 2023 (UTC)[reply]
Right. Not rocket science. All you have to do is examine every character field in your thousands of database tables, look at how many bytes are allocated, look at the several million records that use those fields and see which ones are not going to fit anymore when you convert to a multi-byte character set.
Then, look at all the code, which might include Java, C, C++, C#, T-SQL, etc and make sure that there are no assumptions about string lengths that will blow up when the strings get longer due to multi-byte encoding.
And if you have web forms, or some other UI that exposes textboxen of a fixed length, they might need to be updated too.
Nothing hard, just time consuming and tedious.
All that said, the section in question seems to be WP:OR so let's either get some cites or nuke the section. Mr. Swordfish (talk) 23:21, 19 September 2023 (UTC)[reply]
It's not changing the encoding. The input is UTF-8 and the output is UTF-8, and it does not change size. If you are measuring the string as anything other than bytes then you have much more serious problems than dealing with encodings. Spitzak (talk) 23:39, 19 September 2023 (UTC)[reply]
Obviously if you change the encoding you need to change all the string constants to the new encoding. However at least you don't have to change the compiler, which is the whole point of this section! And if your code is such that changing the length of a string constant will cause it to not work, well all I can say is that I am sorry about your lack of programming skills. Spitzak (talk) 23:40, 19 September 2023 (UTC)[reply]
Instead of insulting my "lack of programming skills" you might try finding some sourcing for this section. Per Wikipedia policy, unsourced material gets removed. Mr. Swordfish (talk) 14:29, 20 September 2023 (UTC)[reply]

"Usage in computer-readable languages"

[edit]

Ignore for a moment the OR tag and absence of any sourcing, but the section "Usage in computer-readable languages" is a mess. I attempted to clean it up but on reflection, concluded that it would be wiser to get a text teased out here in talk space first. This is as far as I've got with the first para:

For programming languages and document languages such as C and HTML, the principle of extended ASCII is important, since it enables many different encodings and therefore many human languages to be supported with little extra programming effort in the languages software that interprets the computer-readable language files. Software can rely on all of the original ASCII standard bytes (first 128 bytes, codes 0x00 to 0x7F) to have the same meaning in all variants of extended ASCII; conversely they must not assume or assign any meaning to the bytes with the high bit set (second 128 bytes, codes 0x80 to 0xFF), allowing them only in free-form text such as string constants and comments.

This is still not clear. A document written in Cyrillic, for example, will use many bytes from the top 128 (as well as a few from the base ASCII set. HTML syntax is in English, not just ASCII. This text is just confusing to the general reader; either it needs a lot of work or should just be deleted.

The second para is barely literate:

Before extended ASCII became widely supported, lots of software would mangle non-ASCII text, most often by removing the high bit. Supporting extended ASCII forced the compilers to be fixed to preserve the bytes in the source unchanged. This has been a benefit for Unicode, as it is relatively easy to support UTF-8 in the same software.

What a mess.

  • "Lots of software"? "mangle non-ASCII text"?? (I think this means "would fail to process bytes in the 0x80 to 0xFF range".)
  • "Supporting extended ASCII forced the compilers to be fixed to preserve the bytes in the source unchanged." Nonsense: a compiler processes code written according to the language specification, which has standardised instructions. Syntax using any incorrect syntax is flagged as erroneous. Sorry, but this just reads as meaningless waffle. Does it have any redeeming features?

I propose that we delete the whole section and not waste any more time on it. --𝕁𝕄𝔽 (talk) 16:46, 20 September 2023 (UTC)[reply]

Deleted, with the fact that extended ascii did help UTF-8 moved to the intro section. Spitzak (talk) 19:01, 20 September 2023 (UTC)[reply]

Windows 1252 and the popularity thereof

[edit]

I made some changes to the CP-1252 section, renaming it to comport with the main article's name and removed some unsourced citations, including the assertion that it was once the most common character set used on the internet.

Was it? My recollection from the early 90s is that "in the beginning, there was ASCII", then sites started using 8859-1 and anybody with a windows machine would happily input characters that were in 1252 but not 8859 and almost all software would happily pass along those bytes with the UI sometimes knowing what to do with them and sometimes not.

Then HTML 5 standard came out, which told the browsers "if a page declares 8859, treat it as if it was 1252" and a lot of those UI issues went away.

But did 8859/1252 ever become dominant? The evolution was ASCII -> 8859/1252 -> UTF-8 but it's unclear whether the middle set ever became more popular that both the others for a time. I was there for it, but didn't take notes and don't remember, and even if I did we'd still need sourcing.

If anybody knows, I'm curious. That said, historical facts like that would probably belong in the main article not this one. Good catch by Guy Harris. Mr. Swordfish (talk) 21:37, 20 September 2023 (UTC)[reply]

Just noticed the final sentence of the previous section:
ISO 8859-1 is the common 8-bit character encoding used by the X Window System, and most Internet standards used it before Unicode.
Do we have a cite for this? Seems related... Mr. Swordfish (talk) 21:42, 20 September 2023 (UTC)[reply]
Looking at the UTF-8 article, one of the sources is [1] which includes this handy chart: [2].
The rise of 8859/1252 coincided with the decline of ASCII and the rise of UTF-8. But at no time did 8859/1232 exceed either one. Looks like the three were about equal in 2008, with UTF-8 taking over afterwards. Mr. Swordfish (talk) 23:57, 22 September 2023 (UTC)[reply]

This page goes back to 2012:

https://w3techs.com/technologies/history_overview/character_encoding/ms/y

It shows that 1252 was used by about 19% of websites in 2012, (8859-1 and us-ascii were treated as 1252 per the HTML 5 standard) . HTML 5 came out four years prior to this. I would assume that the share of single-byte implementations was higher then, but that's conjecture. We can probably say something like "Extended ASCII in the form or ISO 8850-1 and Windows-1252 were once common on the world wide web, but have been replaced by UTF-8 in almost all websites." Mr. Swordfish (talk) 02:38, 21 September 2023 (UTC)[reply]

All of them were replaced with UTF-8 eventually, there is nothing special about CP1252 here and it should not be mentioned. It is still the most-used character set after UTF-8. Also the reference says that 8859-1 should be treated as CP1252 and says nothing about UTF-8, another reason to not mention UTF-8. I'm not sure if there is any real assumption that 8859-1 is used by X, a lot of X was designed long before it existed so that seems a bit doubtful.Spitzak (talk) 00:26, 21 September 2023 (UTC)[reply]

There's nothing in the X Window system article about character sets that I can find. Seems to me that if it was important (or true) it would be covered by that article.
The next section (on Windows-1252) makes it clear that the 1252 extension of 8859-1 became the most widely used extended ASCII, so I don't think it's necessary to make the vague and unsupported assertion "most Internet standards used it before Unicode." I'm going to delete the sentence.
The third paragraph of the intro states that Unicode or UTF-8 has replaced 8859/1252 so we probably don't need to repeat that here. Mr. Swordfish (talk) 14:09, 22 September 2023 (UTC)[reply]
Actually it does look like X used 8859-1. https://www.cl.cam.ac.uk/~mgk25/ucs/keysymdef.h Not sure if this is very important though. Spitzak (talk) 17:14, 22 September 2023 (UTC)[reply]
It might be important enough to include in the X Window system article. I don't think it's important enough to include it here. Mr. Swordfish (talk) 18:44, 22 September 2023 (UTC)[reply]