转至繁体中文版     | 网站首页 | 图文教程 | 资源下载 | 站长博客 | 图片素材 | 武汉seo | 武汉网站优化 | 
最新公告:     敏韬网|教学资源学习资料永久免费分享站!  [mintao  2008年9月2日]        
您现在的位置: 学习笔记 >> 图文教程 >> 软件使用 >> 系统软件 >> 正文
Explanation of UFT-8 and Unicode         

Explanation of UFT-8 and Unicode

作者:闵涛 文章来源:闵涛的学习笔记 点击数:1820 更新时间:2009/4/25 0:44:56

What is unicode?

  A mapping with characters and a index, we use u+xxxx to represent it.

Confuse with unicode and UTF-8?
    Unicode is a standard char set, UTF-8 is one of implementation, just one of UCS-2, UCS
-4 and so forth, but it becomes standard way of encoding. but note one thing, when we are talking about some english characters, those two standard are the same, it means

U-00000000 - U-0000007F:  0xxxxxxx

    sometimes, especially the programmer, since U-00000000 - U-0000007F is enough for their dialy use(26 english and some symbols), so, there is no different between the character set standards(unicode) and implementation standard(UTF-8) for them. when they are talking with you, you may confuse.

Why is UTF-8?
    You may ask why not use UCS-4 or UCS-2? do people like 8 more(in cantonese, it means
become rich)?
    The answer is no. Using UCS-2 (or UCS-4) under Unix would lead to very severe problems.
Strings with these encodings can contain as parts of many wide characters bytes like ''''\0'''' or ''''/'''' which have a special meaning in filenames and other C library function parameters.

(An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.)

    In addition, the majority of UNIX tools expects ASCII files and can''''t read 16-bit words as characters without major modifications.(In UTF-8
U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility

This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8)

------------prove the ASCII and UTF-8 are the same---------
package unicode;
public class CharTest {
    public static void main(String[] args) throws Exception {
        char[] chars = new char[]{''''\u007F''''};
        String str = new String(chars);
        System.out.println("within 0000 - 007F : " + str);
       //for the character whose unicode less than u0080, it is no different      with encode
 //ISO-8859-1 or UTF-8. they are compatiable.
        System.out.println("   UTF-8 - UTF-8      " + new String(str.getBytes("UTF-8"),

        System.out.println("   ISO-8859-1 - UTF-8 " +new String(str.getBytes("ISO-8859-1"),

        chars = new char[]{''''\u00F2''''};
        str = new String(chars);
 //The above principle can not apply to the character lager than 007F
        System.out.println("out of 0000 - 007F : " + str);
        System.out.println("   UTF-8 - UTF-8      "  + new String(str.getBytes("UTF-8"),

        System.out.println("   ISO-8859-1 - UTF-8 "  + new String(str.getBytes("ISO-8859-

1"), "UTF-8"));

How long is the UTF-8 encoding?
    Theoretically, it can be 6 bytes, but actually, 3 byte is enough for us since BMP is not
longer than 3(The most commonly used characters, including all those found in major older encoding standards,
have been placed into the first plane (0x0000 to 0xFFFD), which is called the Basic
Multilingual Plane (BMP))

Important UTF-8 features:
  1. UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F
(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  2. All UCS characters >U+007F are encoded as a sequence of several bytes, each of which
has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
  3. The first byte of a multibyte sequence that represents a non-ASCII character is always
in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. (?? the further investigate is necessary. can explain this currently)
  4. All possible 231 UCS codes can be encoded.
  5. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP

characters are only up to three bytes long.
  6. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
------------Prove the features(1,2,3)-----------------
package unicode;

public class UTF8Features {
    public static void main(String[] args) throws Exception {
        //Why not write some no-ASCII character in the src?
        //Since it will depends on your system rather than
        //a UTF-8 as your image
        char[] chars = new char[]{''''\u007F''''};
        String str = new String(ch

[1] [2]  下一页

[办公软件]如何实现Office工具栏、菜单以及菜单命令重命名  [办公软件]如何在Office文档(大)括号内输入多行文字
[办公软件]如何在office(PowerPoint,Word,Excel)中制作带圈的…  [办公软件]批量删除Office文档(word,excle,powerpoint)中的超…
[办公软件]Office(Word,Excel)密码破解软件(Office Password…  [办公软件]如何让低版本的Office也能顺利编辑2007文档
[办公软件]设置office艺术字的形状  [办公软件]如何将Office菜单设置、工具设置、宏设置等应用到…
[办公软件]在Office(word,excel)中输入各级钢筋符号的方法  [办公软件]打开Office文档就提示安装的原因及解决方案
教程录入:mintao    责任编辑:mintao 
  • 上一篇教程:

  • 下一篇教程:
  • 【字体: 】【发表评论】【加入收藏】【告诉好友】【打印此文】【关闭窗口
      注:本站部分文章源于互联网,版权归原作者所有!如有侵权,请原作者与本站联系,本站将立即删除! 本站文章除特别注明外均可转载,但需注明出处! [MinTao学以致用网]

    · 办公软件  · 系统软件
    · 常用软件  · 聊天工具
    热门推荐 更多内容
  • 没有教程
  • 赞助链接
    闵涛博文 更多关于武汉SEO的内容
    500 - 内部服务器错误。

    500 - 内部服务器错误。


    | 设为首页 |加入收藏 | 联系站长 | 友情链接 | 版权申明 | 广告服务

    Copyright @ 2007-2012 敏韬网(敏而好学,文韬武略--MinTao.Net)(学习笔记) Inc All Rights Reserved.
    闵涛 投放广告、内容合作请Q我! E_mail:admin@mintao.net(欢迎提供学习资源)

    站长:MinTao ICP备案号:鄂ICP备11006601号-18
