Explanation of UFT-8 and Unicode--打印文章

打印本文

关闭窗口

Explanation of UFT-8 and Unicode

作者：武汉SEO闵涛文章来源：敏韬网点击数2646 更新时间：2009/4/25 0:44:56 文章录入：mintao 责任编辑：mintao

ars);
        System.out.println("Point 1 : " + str);
        System.out.println("   UTF-8 - UTF-8      "
                + new String(str.getBytes("UTF-8"), "ISO-8859-1"));
        System.out.println("   ISO-8859-1 - UTF-8 "
                + new String(str.getBytes("ISO-8859-1"), "UTF-8"));
        System.out.println();

        chars = new char[]{''''\uE840''''};
        str = new String(chars);
        System.out.println("Point 2 : " + str);
        //just a sample you can use this method to verify more characters
        System.out.println("   No less than 7F      " + getHexString(str));

        chars = new char[]{''''\u2260''''};
        str = new String(chars);
        //just a sample you can use this method to verify more characters
        System.out.println("Point 3 : " + str);
        System.out.println("   Range of 1st Byte      " + getHexString(str));
    }

    public static String getHexString(String num) throws Exception {
        StringBuffer sb = new StringBuffer();
        //You must specify UTF-8 here, else it will use the defaul encoding
        //which depends on your enviroment
        byte[] bytes = num.getBytes("UTF-8");
        for (int i = 0; i < bytes.length; i++) {
            sb.append(Integer.toHexString((bytes[i] >= 0 ?
                    bytes[i] : 256 + bytes[i])).toUpperCase() + " ");
        }
        return sb.toString();
    }
}
---------------------------------------------------------------------------------
Pinciple of presenting a unicode use UTF-8:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

How to use the principle above?

Sample:
The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9

Explain :

A:1010

9:1001

principle 2 : 00000080 < 00A9 < 000007FF

from low to high

1. There 6 x in the low bit we cut last 6 bit from - 10101001(A9) which is 101001

2.There 5 x in the high bit. we cut the rest 2 bit of A9 which is 10 and extend it to 5 bit with three 0 which is 00010

complete the low byte with 10. ----> (10) combine (101001) -> 10101001

complete the high byte with 110, ---> (110) combine (00010) -> 11000010

the Result is

11000010 10101001 = 0xC2 0xA9

you can also verify the following unicode with principle 3 use the way above:

U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

Reference:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#unicode

上一页 [1] [2]

打印本文

关闭窗口