Post Reply 
Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
UTF and 朱
05-02-2012, 08:14 AM (This post was last modified: 05-02-2012 08:19 AM by RobertShiplett.)
Post: #1
UTF and 朱
In my Henshall book and applet, Kanji 1346 or 朱, is vermilion or cinnabar red - so what is that on the web ?

In Curl we can write the 朱 character as UTF-16 '\u6731' or '\U00006731' as UTF-32.

You can test these as

{text \u6731 \U00006731 } || note that no single quotes are needed in the {text } macro

inside any example code area in the Curl docs viewer to see the lovely character twice.

The HTML entity ( in decimal ) is & #26417 ( which is equal to the x6731 below )
The HTML entity ( in hexadecimal ) is & #x6731; (without the space between & and # in both cases)
But the URL Escape Code %E6%9C%B1
because the UTF-8 code sequence is 0xE6 0x9C 0xB1 ( which is the e6-9C-B1 above )
and our \u6731 is the UTF-16BE 0x6731 (written big-endian here, but your architecture may store words little-endian ...)

The CJK characters are among those having the top bits 1110 or giving as a byte something greater than hex E0 but not greater than xEF.

Our xE6 is binary 11100110. The top 3 bits say that 3 bytes form this codepoint.

What we see here is xE6 and then the the 2 bytes of 9CB1.

UTF itself stands for UNICODE transformation format. The transform rule says that every 1110bbbb byte must followed by a 10bbbbbb byte.

Ours is followed by 9C or 10011100 and it is followed by B1 or 10110001 .

The UTF algorithm for UTF-8 to UTF-16BE transform converts E6-9C-B1 only and always to x6731 and vice-versa for the reverse transformation
(and again, I say BE only because I am writing the hex big-to-small just as we write 2012 or The Answer (which is, of course, 0042 in the end !)

One of the things that the Curl documents say about CharEncoding is this:

The mapping between ucs2-big-endian bytes and utf8 bytes is as follows:

utf32 -> utf8
00000000 0xxxxxxx -> 0xxxxxxx
00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx -> 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Elsewhere the Curl docs state:

[...] each character has an associated Unicode value, which must be in one of the following ranges:

• 0x0000 - 0xD7FF

• 0xE000 - 0x10FFFF

Values outside of these ranges are not legal values for char.

Robert Shiplett, Curlr
Fredericton NB

Visit this user's website Find all posts by this user
Quote this message in a reply
Post Reply 

Forum Jump:

User(s) browsing this thread:
1 Guest(s)