打印本文 打印本文 关闭窗口 关闭窗口
Explanation of UFT-8 and Unicode
作者:武汉SEO闵涛  文章来源:敏韬网  点击数2317  更新时间:2009/4/25 0:44:56  文章录入:mintao  责任编辑:mintao
ars);
        System.out.println("Point 1 : " + str);
        System.out.println("   UTF-8 - UTF-8      "
                + new String(str.getBytes("UTF-8"), "ISO-8859-1"));
        System.out.println("   ISO-8859-1 - UTF-8 "
                + new String(str.getBytes("ISO-8859-1"), "UTF-8"));
        System.out.println();

        chars = new char[]{''''\uE840''''};
        str = new String(chars);
        System.out.println("Point 2 : " + str);
        //just a sample you can use this method to verify more characters
        System.out.println("   No less than 7F      " + getHexString(str));

        chars = new char[]{''''\u2260''''};
        str = new String(chars);
        //just a sample you can use this method to verify more characters
        System.out.println("Point 3 : " + str);
        System.out.println("   Range of 1st Byte      " + getHexString(str));
    }

    public static String getHexString(String num) throws Exception {
        StringBuffer sb = new StringBuffer();
        //You must specify UTF-8 here, else it will use the defaul encoding
        //which depends on your enviroment
        byte[] bytes = num.getBytes("UTF-8");
        for (int i = 0; i < bytes.length; i++) {
            sb.append(Integer.toHexString((bytes[i] >= 0 ?
                    bytes[i] : 256 + bytes[i])).toUpperCase() + " ");
        }
        return sb.toString();
    }
}

---------------------------------------------------------------------------------
Pinciple of presenting a unicode use UTF-8:

U-00000000 - U-0000007F:  0xxxxxxx 
U-00000080 - U-000007FF:  110xxxxx 10xxxxxx 
U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx 
U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 
U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 

How to use the principle above?

Sample:
The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

    11000010 10101001 = 0xC2 0xA9

Explain :

A:1010

9:1001

principle 2 : 00000080 <  00A9 < 000007FF

from low to high

1. There 6 x in the low bit    we cut last 6 bit from  - 10101001(A9)  which is 101001

2.There 5 x in the high bit. we cut the rest 2 bit of A9 which is 10 and extend it to 5 bit with three 0 which is 00010

complete the low byte with 10. ----> (10) combine (101001) -> 10101001

complete the high byte with 110, ---> (110) combine (00010) -> 11000010

the Result is

11000010 10101001 = 0xC2 0xA9

you can also verify the following unicode with principle 3 use the way above:

U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx 

character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

    11100010 10001001 10100000 = 0xE2 0x89 0xA0

Reference:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#unicode


上一页  [1] [2] 

打印本文 打印本文 关闭窗口 关闭窗口