ars); System.out.println("Point 1 : " + str); System.out.println(" UTF-8 - UTF-8 " + new String(str.getBytes("UTF-8"), "ISO-8859-1")); System.out.println(" ISO-8859-1 - UTF-8 " + new String(str.getBytes("ISO-8859-1"), "UTF-8")); System.out.println(); chars = new char[]{''''\uE840''''}; str = new String(chars); System.out.println("Point 2 : " + str); //just a sample you can use this method to verify more characters System.out.println(" No less than 7F " + getHexString(str)); chars = new char[]{''''\u2260''''}; str = new String(chars); //just a sample you can use this method to verify more characters System.out.println("Point 3 : " + str); System.out.println(" Range of 1st Byte " + getHexString(str)); } public static String getHexString(String num) throws Exception { StringBuffer sb = new StringBuffer(); //You must specify UTF-8 here, else it will use the defaul encoding //which depends on your enviroment byte[] bytes = num.getBytes("UTF-8"); for (int i = 0; i < bytes.length; i++) { sb.append(Integer.toHexString((bytes[i] >= 0 ? bytes[i] : 256 + bytes[i])).toUpperCase() + " "); } return sb.toString(); } } --------------------------------------------------------------------------------- Pinciple of presenting a unicode use UTF-8: U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx How to use the principle above? Sample: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as 11000010 10101001 = 0xC2 0xA9 Explain : A:1010 9:1001 principle 2 : 00000080 < 00A9 < 000007FF from low to high 1. There 6 x in the low bit we cut last 6 bit from - 10101001(A9) which is 101001 2.There 5 x in the high bit. we cut the rest 2 bit of A9 which is 10 and extend it to 5 bit with three 0 which is 00010 complete the low byte with 10. ----> (10) combine (101001) -> 10101001 complete the high byte with 110, ---> (110) combine (00010) -> 11000010 the Result is 11000010 10101001 = 0xC2 0xA9 you can also verify the following unicode with principle 3 use the way above: U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as: 11100010 10001001 10100000 = 0xE2 0x89 0xA0 Reference: http://www.cl.cam.ac.uk/~mgk25/unicode.html#unicode
上一页 [1] [2] |