猪鬃为什么是战略物资| 史铁生为什么瘫痪| 什么样的电动牙刷好| 属兔带什么招财| 反应蛋白偏高说明什么| 怀孕的最佳时间是什么时候| 人参归脾丸适合什么人吃| 女生右手食指戴戒指什么意思| 钢笔ef尖是什么意思| 去台湾需要什么证件| 八卦是什么生肖| 女人什么时候停经| 88年出生属什么生肖| 什么是什么意思| 外阴起红点是什么病| 自提是什么意思| 什么食物含碘高| 晚上睡眠不好有什么办法可以解决| visa卡是什么| 霜和乳有什么区别| 撤侨是什么意思| 胀气打嗝是什么原因| 意外流产有什么症状| 子非鱼什么意思| 智能电视什么品牌好| 百合有什么作用| 高温中暑吃什么药| 环移位了有什么症状| 什么是窝沟封闭| 瘘管是什么病| 便秘是什么引起的| 什么鱼吃鱼屎| 吃什么有助于骨头愈合| 核磁dwi是什么意思| 肚子胀疼是什么原因| 腋下是什么部位| 西边五行属什么| 焦糖色裤子配什么颜色上衣| lof是什么基金| 舌头麻木是什么原因| 胃疼吃什么药最好| 吃饭就吐是什么原因| 腰椎间盘突出吃什么药| 桂花是什么生肖| 晚上多梦是什么原因| 犄角旮旯是什么意思| 铖字五行属什么| 胆囊萎缩是什么原因| 右耳朵发热代表什么预兆| 吃了什么药不能喝酒| 未退化胸腺是什么意思| 做梦梦见屎是什么意思| 什么是配速| 猫癣长什么样| 一九九七年属什么生肖| 转氨酶偏高是什么意思| 化疗吃什么补白细胞| handmade是什么牌子| 一般什么意思| 烂脚丫用什么药能治除根| 肛门潮湿是什么情况| 感光是什么意思| 化疗中的病人应该吃什么| 支气管扩张是什么原因引起| 幼儿园转园需要什么手续| eau是什么意思| 慈字五行属什么| 4月24号是什么星座| 月经刚完同房为什么痛| 米西米西什么意思| 孕初期需要注意些什么| 儿童结膜炎用什么眼药水| 脑部缺氧有什么症状| 薄荷叶泡水喝有什么好处| 人为什么会怕鬼| 星期三左眼皮跳是什么预兆| 老人吃什么水果好| 视力模糊是什么原因| 夏至喝什么汤| 检查膝盖挂什么科| skll什么牌子| 什么样的天安门| 缺钾有什么表现和症状| 什么是基因突变| 睛可以组什么词| 破釜沉舟是什么意思| 心绞痛有什么症状| 牛肉和什么炒最好吃| 乙醇是什么| 为什么会过敏| 胃病忌什么| 什么是正骨| 毛囊炎是什么症状图片| 什么是纤维瘤| 拼音的音序是什么| 头痛吃什么药效果好| 悦是什么意思| 情劫是什么| 黄芪泡水喝有什么功效| 50分贝相当于什么声音| 胸口正中间疼痛是什么病症| 饺子包什么馅好吃| 胎盘后壁是什么意思| 雨落心尘是什么意思| 早餐吃什么| 为什么会得口腔溃疡| 阴虚内热吃什么药| 何首乌是什么| 妄语是什么意思| 林可霉素主治什么病| 颈椎不舒服挂什么科| 小便不利是什么意思| 酱油的原料是什么| 饭后呕吐是什么原因引起的| us什么意思| 乳房边缘疼是什么原因| 刚愎自用是什么意思| 慢性胃炎要吃什么药| cd8高是什么原因| 黄皮果是什么水果| 鼓目念什么| 恶作剧是什么意思| 血管瘤吃什么药| 考试穿什么颜色最吉利| 梦见孩子哭是什么意思| 白带黄什么原因| 古驰属于什么档次| 吃什么去湿气最快| 小孩子流鼻血是什么原因引起的| 破除是什么意思| 心脏病是什么症状| 物以类聚是什么意思| 人造珍珠是什么材质| 臀纹不对称有什么影响| 1942年属什么生肖属相| 拔鼻毛有什么危害| 医学ca是什么意思| 乳房看什么科| 1978年是什么年| 吃菠萝蜜有什么好处| 夜晚尿频尿多是什么原因| 今年流行什么发型女| 带状疱疹是什么引起的| 水泊梁山什么意思| 条子是什么意思| 里字五行属什么| 什么是重力| 神经性耳鸣吃什么药| 鄙人不才是什么意思| 六字真言是什么意思| 补气血吃什么食物| 六合什么意思| 肾结石要注意些什么| ysl是什么品牌| 怎么吃都不胖是什么原因| 712什么星座| 吃什么补头发| 每天吃一根黄瓜有什么好处| 灶性肠化是什么意思| 手心为什么老出汗| 宫腔镜是什么检查| 吃什么补肺养肺比较好| 茵陈有什么功效| 同房有什么姿势| 妈妈的妈妈叫什么| 嗓子疼咳嗽是什么原因| 0.5什么意思| 因人而异什么意思| 人活一辈子到底为了什么| 胰腺炎吃什么水果| 空你几哇什么意思| 夏天穿什么鞋| 党参有什么功效| 五月五日什么星座| 做梦梦见别人怀孕是什么意思| 好吃懒做是什么生肖| 为什么叫北洋政府| 咳嗽完想吐是什么原因| 中央组织部部长什么级别| 角先生是什么| 皮脂腺痣是什么原因引起的| 内火旺是什么原因| 嗯是什么意思| 学分是什么意思| 生地麦冬汤有什么功效| 李白为什么叫青莲居士| 看舌头应该挂什么科| 心脏病吃什么食物好| 什么水果含糖量最低| 属龙什么命| cro是什么职位| 美尼尔综合症吃什么药| 为什么身上会出现淤青| 肉桂茶属于什么茶| 脚底有痣代表什么意思| 2月27号是什么星座| 大便粘稠是什么原因| 贻笑大方是什么意思| 造血干细胞是什么| pfs是什么意思| 从革是什么意思| 夹腿什么意思| 羊膜束带是什么意思| 下肢动脉硬化吃什么药| 血糖高吃什么好能降糖| 宝宝睡觉头上出汗多是什么原因| 轰20什么时候首飞| 什么茶适合煮着喝| 偶发室性期前收缩是什么意思| 夜里2点到3点醒什么原因| 赶集是什么意思| 十二月十号是什么星座| 什么是修养| 秋葵与什么菜相克| 眼白浑浊是什么原因| 甲状腺不均质改变是什么意思| 肠梗阻是因为什么原因引起的| 黄腔是什么意思| 5月12是什么星座| 小马是什么牌子| 什么提示你怀了女宝宝| 电荷是什么意思| 腿毛长的男人代表什么| 玫瑰疹是什么病| 什么年龄割双眼皮最好| 眼睛有黑影是什么原因| 猫藓用什么药| 喝柠檬茶有什么好处| 童心未泯是什么意思| 什么是亚麻籽油| 肠炎能吃什么| 胃不舒服做什么检查| 众矢之的是什么意思| 喝茶对身体有什么好处| 梦见剃光头是什么预兆| 西游记是什么朝代| edm是什么意思| 强阳下降到什么程度开始排卵| 欲代表什么生肖| 什么是直肠炎| 毛泽东女儿为什么姓李| 什么是碱性食物有哪些| 浑身乏力什么病的前兆| 静脉炎吃什么药| pf是什么意思| 手麻胳膊麻是什么原因引起的| 鼻尖长痣代表什么| 高血压能吃什么| 夏天喝什么茶最好| 东宫是什么意思| 大脚骨疼是什么原因| 吃什么药能快速降血压| 牛肉炒什么菜| 蚯蚓是什么动物| 肝胆脾挂什么科| 2006属什么| 嘴唇暗红色是什么原因| 宝宝湿疹用什么药膏| 沙果是什么水果| 木石念什么| 喝柠檬茶有什么好处| 为什么会长疣| 樱桃是什么季节的水果| 梦见小鬼是什么预兆| 百度
Unicode Frequently Asked Questions

Private-Use Characters

百度 尽管我们不能片面强调“仓廪实而知礼节,衣食足而知荣辱”,似乎只有富裕了才会讲道德,但是也不能说贫穷的时候就没有问题。

Q: What are private-use characters?

Private-use characters are code points whose interpretation is not specified by a character encoding standard and whose use and interpretation may be determined by private agreement among cooperating users. Private-use characters are sometimes also referred to as user-defined characters (UDC) or vendor-defined characters (VDC).

Q: What are the ranges for private-use characters in Unicode?

Yes. There are three ranges of private-use characters in the standard. The main range in the BMP is U+E000..U+F8FF, containing 6,400 private-use characters. That range is often referred to as the Private Use Area (PUA). But there are also two large ranges of supplementary private-use characters, consisting of most of the code points on Planes 15 and 16: U+F0000..U+FFFFD and U+100000..U+10FFFD. Together those ranges allocate another 131,068 private-use characters. Altogether, then, there are 137,468 private-use characters in Unicode.

Q: Why are there so many private-use characters in Unicode?

Unicode is a very large and inclusive character set, containing many more standardized characters than any of the legacy character encodings. Most users have little need for private-use characters, because the characters they need are already present in the standard.

However, some implementations, particularly those interoperating with East Asian legacy data, originally anticipated needing large numbers of private-use characters to enable round-trip conversion to private-use definitions in that data. In most cases, 6,400 private-use characters is more than enough, but there can be occasions when 6,400 does not suffice. Allocating a large number of private-use characters has the additional benefit of allowing implementations to choose ranges for their private-use characters that are less likely to conflict with ranges used by others.

The allocation of two entire additional planes full of private-use characters ensures that even the most extravagant implementation of private-use character definitions can be fully accomodated by Unicode.

Q: Will the number of private-use characters in Unicode ever change?

No. The set of private-use characters is formally immutable. This is guaranteed by a Unicode Stability Policy.

Q: What legacy character encodings also have private-use characters?

Private-use characters are most commonly used in East Asia, particularly in Japan, China, and Korea, to extend the available characters in various standards and vendor character sets. Typically, such characters have been used to add Han characters not included in the standard repertoire of the character set. Such non-standard Han character extensions are often referred to as "gaiji" in Japanese contexts.

Q: What is the purpose of private-use characters?

Private-use characters are used for interoperability with legacy CJK encodings. They can also be used for characters that may never get standard encodings, such as characters in a constructed artificial script (ConScript) which has no general community of use. Or a particular implementation may need to use private-use characters for specific internal purposes. Private-use characters are also useful for testing implementations of scripts or other sets of characters which may be proposed for encoding in a future version of Unicode

Q: How can private-use characters be input?

Some input method editors (IME) allow customizations whereby an input sequence and resulting private-use character can be added to their internal dictionaries.

Q: How are private-use characters displayed?

With common font technologies such as OpenType and AAT, private-use characters can be added to fonts for display.

Q: What happens if definitions of private-use characters conflict?

The same code points in the PUA may be given different meanings in different contexts, since they are, after all, defined by users and are not standardized. For example, if text comes from a legacy NEC encoding in Japan, the same code point in the PUA may mean something entirely different if interpreted on a legacy Fujitsu machine, even though both systems would share the same private-use code points. For each given interpretation of a private-use character one would have to pick the appropriate IME, user dictionary and fonts to work with it.

Q: What about properties for private-use characters?

One should not expect the rest of an operating system to override the character properties for private-use characters, since private use characters can have different meanings, depending on how they originated. In terms of line breaking, case conversions, and other textual processes, private-use characters will typically be treated by the operating system as otherwise undistinguished letters (or ideographs) with no uppercase/lowercase distinctions.

Q: What does "private agreement among cooperating parties" mean?

A "private agreement" simply refers to the fact that agreement about the interpretation of some set of private-use characters is done privately, outside the context of the standard. The Unicode Standard does not specify any particular interpretation for any private-use character. There is no implication that a private agreement necessarily has any contractual or other legal status—it is simply an agreement between two or more parties about how a particular set of private-use characters should be interpreted.

Q: How would I define a private agreement?

One can share, or even publish, documentation containing particular assignments for private-use characters, their glyphs, and other relevant information about their interpretation. One can then ask others to use those private-use characters as documented. One can create appropriate fonts and IMEs, or request that others do so.

Noncharacters

Q: What are noncharacters?

A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal use and that will never be assigned to an abstract character.

Q: How did noncharacters get that weird name?

Noncharacters are in a sense a kind of private-use character, because they are reserved for internal (private) use. However, that internal use is intended as a "super" private use, not normally interchanged with other users. Their allocation status in Unicode differs from that of ordinary private-use characters. They are considered unassigned to any abstract character, and they share the General_Category value Cn (Unassigned) with unassigned reserved code points in the standard. In this sense they are "less a character" than most characters in Unicode, and the moniker "noncharacter" seemed appropriate to the UTC to express that unique aspect of their identity.

In Unicode 1.0 the code points U+FFFE and U+FFFF were annotated in the code charts as "Not character codes" and instead of having actual names were labeled "NOT A CHARACTER". The term "noncharacter" in later versions of the standard evolved from those early annotations and labels.

Q: How many noncharacters does Unicode have?

Exactly 66.

Q: Which code points are noncharacters?

The 66 noncharacters are allocated as follows:

For convenient reference, the following table summarizes all of the noncharacters, showing their representations in UTF-32, UTF-16, and UTF-8. (In this table, "#" stands for either the hex digit "E" or "F".)

UTF-32 UTF-16 UTF-8
0000FDD0 FDD0 EF B7 90

...

0000FDEF FDEF EF B7 AF
0000FFF# FFF# EF BF B#
0001FFF# D83F DFF# F0 9F BF B#
0002FFF# D87F DFF# F0 AF BF B#
0003FFF# D8BF DFF# F0 BF BF B#
0004FFF# D8FF DFF# F1 8F BF B#

...

000FFFF# DBBF DFF# F3 BF BF B#
0010FFF# DBFF DFF# F4 8F BF B#

Q: Why are 32 of the noncharacters located in a block of Arabic characters?

The allocation of the range of noncharacters U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A block was mostly a matter of efficiency in the use of reserved code points in the rather fully-allocated BMP. The Arabic Presentation Forms-A block had a contiguous range of 32 unassigned code points, but as of 2001, when the need for more BMP noncharacters became apparent, it was already clear to the UTC that the encoding of many more Arabic presentation forms similar to those already in the Arabic Presentation Forms-A block would not be useful to anyone. Rather than designate an entirely new block for noncharacters, the unassigned range U+FDD0..U+FDEF was designated for them, instead.

Note that the range U+FDD0..U+FDEF for noncharacters is another example of why it is never safe to simply assume from the name of a block in the Unicode Standard that you know exactly what kinds of characters it contains. The identity of any character is determined by its actual properties in the Unicode Character Database. The noncharacter code points in the range U+FDD0..U+FDEF share none of their properties with other characters in the Arabic Presentation Forms-A block; they certainly are not Arabic script characters, for example.

Q: Will the set of noncharacters in Unicode ever change?

No. The set of noncharacters is formally immutable. This is guaranteed by a Unicode Stability Policy.

Q: Are noncharacters intended for interchange?

No. They are intended explicity for internal use. For example, they might be used internally as a particular kind of object placeholder in a string. Or they might be used in a collation tailoring as a target for a weighting that comes between weights for "real" characters of different scripts, thus simplifying the support of "alphabetic index" implementations.

Q: Are noncharacters prohibited in interchange?

This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.

Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification starting with Unicode 7.0.

Q: Are noncharacters invalid in Unicode strings and UTFs?

Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

Q: So how should libraries and tools handle noncharacters?

Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

Q: If my application makes specific, internal use of a noncharacter, what should I do with input text?

In cases where the input text cannot be guaranteed to use the same interpretation for the noncharacter as your program does, and the presence of that noncharacter would cause internal problems, it is best practice to replace that particular noncharacter on input by U+FFFD. Of course, such behavior should be clearly documented, so that external clients know what to expect.

Q: What should I do if downstream clients depend on noncharacters being passed through by my module?

In such a case, your module may need to use a more complicated mechanism to preserve noncharacters for pass through, while not interfering with their specific internal use. This behavior will prevent your downstream clients from breaking, at the cost of making your processing marginally more complex. However, because of this additional complexity, if you anticipate that a future version of your module may not pass through one or more noncharacters, it is best practice to document the reservation of those values from the start. In that way, any downstream client using your module can have clearly specified expectations regarding which noncharacter values your module may replace.

Q: Can failing to replace noncharacters with U+FFFD lead to problems?

If your implementation has no conflicting internal definition and use for the particular noncharacter in question, it is usually harmless to just leave noncharacters in the text stream. They definitely will not be displayable and might break up text units or create other "funny" effects in text, but these results are typically the same as could be expected for an uninterpreted private-use character or even a normal assigned character for which no display glyph is available.

Q: Can noncharacters simply be deleted from input text?

No. Doing so can lead to security problems. For more information, see Unicode Technical Report #36, Unicode Security Guidelines.

Q: Can you summarize the basic differences between private-use characters and noncharacters?

Private-use characters do not have any meanings assigned by the Unicode Standard, but are intended to be interchanged among cooperating parties who share conventions about what the private-use characters mean. Typically, sharing those conventions means that there will also be some kind of public documentation about such use: for example, a website listing a table of interpretations for certain ranges of private-use characters. As an example, see the ConScript Unicode Registryexternal link — a private group unaffiliated with the Unicode Consortium — which has extensive tables listing private-use character definitions for various unencoded scripts. Or such public documentation might consist of the specification of all the glyphs in a font distributed for the purpose of displaying certain ranges of private-use characters. Of course, a group of cooperating users which have a private agreement about the interpretation of some private-use characters is under no obligation to publish the details of their agreement.

Noncharacters also do not have any meanings assigned by the Unicode Standard, but unlike private-use characters, they are intended only for internal use, and are not intended for interchange. Occasionally, there will be no public documentation available about their use in particular instances, and fonts typically do not have glyphs for them.

Noncharacters and private-use characters also differ significantly in their default Unicode character property values.

Code Point Type Use Type Properties
noncharacter private, internal gc=Cn, bc=BN, eaw=N
private use private, interchange gc=Co, bc=L, eaw=A

Sentinels

Q: What is a sentinel?

A sentinel is a special numeric value typically used to signal an edge condition of some sort. For text, in particular, sentinels are values stored with text but which are not interpreted as part of the text, and which indicate some special status. For example, a null byte is used as a sentinel in C strings to mark the end of the string.

Q: Is it safe to use a noncharacter as an end-of-string sentinel?

It is not recommended. The use of any Unicode code point U+0000..U+10FFFF as a sentinel value (such as "end of text" in APIs) can cause problems when that code point actually occurs in the text. It is preferable to use a true out-of-range value, for example -1. This is parallel to the use of -1 as the sentinel end-of-file (EOF) value in the standard C library, and is easy and fast to test for in code with a (result < 0) check. Alternatively, a clearly out-of-range positive value such as 0x7FFFFFFF could also be used as a sentinel value.

Q: How about using NULL as an end-of-string sentinel?

When using UTF-8 in C strings, implementations follow the same conventions they would for any legacy 8-bit character encoding in C strings. The byte 0x00 marks the end of the string, consistent with the C standard. Because the byte 0x00 in UTF-8 also represents U+0000 NULL, a UTF-8 C string cannot have a NULL in its contents. This is precisely the same issue as for using C strings with ASCII. In fact, an ASCII C string is formally indistinguishable from a UTF-8 C string with the same character content.

It is also quite common for implementations which handle both UTF-8 and UTF-16 data to implement 16-bit string handling analogously to C strings, using 0x0000 as a 16-bit sentinel to indicate end of string for a 16-bit Unicode string. The rationale for this approach and the associated problems completely parallel those for UTF-8 C strings.

Q: The Unicode Standard talks about U+FEFF BYTE ORDER MARK (BOM) being a signature. Is that the same as a sentinel?

No. A signature is a defined sequence of bytes used to identify an object. In the case of Unicode text, certain encoding schemes use specific initial byte sequences to identify the byte order of a Unicode text stream. See the BOM FAQ entries for more details.

Q: Why is the byte-swapped BOM (U+FFFE) a noncharacter?

U+FFFE was designated as a noncharacter to make it unlikely that normal, interchanged text would begin with U+FFFE. The occurrence of U+FFFE as the initial character as part of text has the potential to confuse applications testing for the two initial signature bytes <FE FF ...> or <FF FE ...> of a byte stream labeled as using the UTF-16 encoding scheme. That can interfere with checking for the presence of a BOM which would indicate big-endian or little-endian order.

Q: Are U+FFFE and U+FFFF illegal in Unicode?

U+FFFE and U+FFFF are noncharacters just like the other 64 noncharacters in the standard. Implementers should be aware of the fact that all noncharacters in the standard are also valid in Unicode strings, must be converted between UTFs, and may be encountered in Unicode strings.

Q: Is it possible to use U+FFFF or U+10FFFF as sentinels?

Because U+FFFF and U+10FFFF are noncharacters, nothing would prohibit a privately-defined internal use of either of them as a sentinels: both have interesting numerical properties which render them likely choices for internal use as sentinels. However, such use is problematical in the same way that use of any valid character as a sentinel can be problematical. They are valid elements of Unicode strings and they may be encountered in Unicode data—not necessarily used with the same interpretation as for one's own sentinel use.

Q: How did the current status of U+FFFE and U+FFFF evolve?

In the days of Unicode 1.0 [1991], when the standard was still architected as a pure 16-bit character encoding, before the invention of UTF-16 and supplementary characters, U+FFFE and U+FFFF did have an unusual status. The code charts were printed omitting the last two code points altogether, and in the names list, the code points U+FFFE and U+FFFF were labeled "NOT A CHARACTER". They were also annotated with notes like, "the value FFFF is guaranteed not to be a Unicode character at all". Section 2.3, p. 14 of Unicode 1.0 contains the statement, "U+FFFE and U+FFFE are reserved and should not be transmitted or stored," so it is clear that Unicode 1.0 intended that those values would not occur in Unicode strings. The block description for the Specials Block in Unicode 1.0 contained the following information:

U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode character value, and should be taken as a signal that Unicode characters should be byte-swapped before interpretation. U+FFFE should only be interpreted as an incorrectly byte-swapped version of U+FEFF.

U+FFFF. The 16-bit unsigned hexadecimal value U+FFFF is not a Unicode character value, and can be used by an application as a [sic] error code or other non-character value. The specific interpretation of U+FFFF is not defined by the Unicode standard, so it can be viewed as a kind of private-use non-character.

It should be apparent that U+FFFF in Unicode 1.0 was the prototype for what later became noncharacters in the standard—both in terms of how it was labeled and how its function was described.

Unicode 2.0 [1996] formally changed the architecture of Unicode, as a result of the merger with ISO/IEC 10646-1:1993 and the introduction of UTF-16 and UTF-8 (both dating from Unicode 1.1 times [1993]). However, both Unicode 2.0 and Unicode 3.0 effectively were still 16-bit standards, because no characters had been encoded beyond the BMP, and because implementations were still mostly treating the standard as a de facto fixed-width 16-bit encoding.

The conformance wording about U+FFFE and U+FFFF changed somewhat in Unicode 2.0, but these were still the only two code points with this unique status, and there were no other "noncharacters" in the standard. The code charts switched to the current convention of showing what we now know as "noncharacters" with black cells in the code charts, rather than omitting the code points altogether. The names list annotations were unchanged from Unicode 1.0, and the Specials Block description text was essentially unchanged as well. Unicode 3.0 introduced the term "noncharacter" to describe U+FFFE and U+FFFF, not as a formal definition, but simply as a subhead in the text.

The Chapter 2 language in Unicode 2.0 dropped the explicit prohibition against transmission or storage of U+FFFE and U+FFFF, but instead added the language, "U+FFFF is reserved for private program use as a sentinel or other signal." That statement effectively blessed existing practice for Unicode 2.0 (and 3.0), where 16-bit implementations were taking advantage of the fact that the very last code point in the BMP was reserved and conveniently could also be interpreted as a (signed) 16-bit value of -1, to use it as a sentinel value in some string processing.

Unicode 3.0 [1999] formalized the definition of "transformations", now more widely referred to as UTFs. And there was one very important addition to the text which makes it clear that U+FFFE and U+FFFF still had a special status and were not considered "valid" Unicode characters. Chapter 3, p. 46 included the language:

To ensure that round-trip transcoding is possible, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE16, FFFF16, and unpaired surrogates.

That initial formulation of UTF mapping was erroneous. A lot of work was done to correct and clarify the concepts of encoding forms and UTF mapping in the versions immediately following Unicode 3.0, to correct various defects in the specification.

Unicode 3.1 [2001] was the watershed for the development of noncharacters in the standard. Unicode 3.1 was the first version to add supplementary characters to the standard. As a result, it also had to come to grips with the fact the ISO/IEC 10646-2:2001 had reserved the last two code points for every plane as "not a character", despite the fact that their code point values shared nothing with the rationale for reserving U+FFFE and U+FFFF when the entire codespace was just 16 bits.

The Unicode 3.1 text formally defined noncharacters, and also designated the code point range U+FDD0..U+FDEF as noncharacters, resulting in the 66 noncharacters defined in the standard.

Unicode 4.0 [2003] finally corrected the statement about mapping noncharacters and surrogate code points:

To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.

That correction results in the current situation for Unicode, where noncharacters are valid Unicode scalar values, are valid in Unicode strings, and must be mapped through UTFs, whereas surrogate code points are not valid Unicode scalar values, are not valid in UTFs, and cannot be mapped through UTFs.

Unicode 4.0 also added an entire new informative section about noncharacters, which recommended the use of U+FFFF and U+10FFFF "for internal purposes as sentinels." That new text also stated that "[noncharacters] are forbidden for use in open interchange of Unicode text data," a claim which was stronger than the formal definition. And it made a contrast between noncharacters and "valid character value[s]", implying that noncharacters were not valid. Of course, noncharacters could not be interpreted in open interchange, but the text in this section had not really caught up with the implications of the change of wording in the conformance requirements for UTFs. The text still echoed the sense of "invalid" associated with noncharacters in Unicode 3.0.

Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

懵懵懂懂是什么意思 花裤子配什么上衣 多囊为什么要跳绳而不是跑步 什么原因导致打嗝 甲状腺桥本是什么意思
阴历六月十八是什么日子 牙刷属于什么垃圾 宝宝吐奶是什么原因 脚心发凉是什么原因 小暑是什么
什么是痛风 华侨是什么 倾巢出动是什么意思 载脂蛋白b偏低是什么意思 10.5号是什么星座
什么是开放性伤口 太阳里面有什么 敬请是什么意思 谩骂是什么意思 1.9号是什么星座
高原反应的原因是什么hcv9jop8ns2r.cn 后背一推就出痧是什么原因hcv9jop1ns5r.cn 三伏天晒背有什么好处cj623037.com 什么颜色的头发显白hcv7jop5ns4r.cn 子字属于五行属什么hcv9jop3ns1r.cn
女人做爱什么感觉dayuxmw.com 死鱼眼是什么样子的hcv9jop3ns3r.cn 过敏性紫癜有什么危害hcv8jop4ns3r.cn 阴虱用什么药可以根除hcv8jop4ns4r.cn 光绪是慈禧的什么人hcv9jop2ns2r.cn
黄宗洛黄海波什么关系hcv9jop5ns5r.cn 月和什么有关hcv9jop8ns2r.cn 克霉唑为什么4天一次hcv8jop3ns9r.cn 支气管炎吃什么药好得快hcv9jop3ns4r.cn 有什么小说hcv9jop0ns1r.cn
知柏地黄丸有什么作用hcv9jop5ns7r.cn 身体缺钾有什么症状hcv8jop3ns9r.cn 竖心旁的字有什么hcv7jop5ns0r.cn 羽字五行属什么hcv9jop7ns2r.cn 地中海贫血携带者是什么意思hcv9jop1ns0r.cn
百度