Network Working Group                                       D. Goldsmith
Request for Comments: 2152                          Apple Computer, Inc.
Obsoletes: RFC 1642                                             M. Davis
Category: Informational                                   Taligent, Inc.
                                                                May 1997

                                 UTF-7
              電子メールにとって安全な Uncodeの変換フォーマット
                                 UTF-7
              A Mail-Safe Transformation Format of Unicode

このメモの状態
Status of this Memo

このメモは、インターネットコミュニティに情報を提供する。いかなる種類のイ
ンターネット標準も規定しない。このメモの配布は無制限である。

   This memo provides information for the Internet community.  This memo
   does not specify an Internet standard of any kind.  Distribution of
   this memo is unlimited.

要約
Abstract

Unicode標準 version 2.0と、ISO/IEC 10646-1:1993(E)(改正版)は、世界の大部
分の言語をカバーする文字セットを共同で定義している（以下、Unicodeと記す）。
しかし、現在、インターネットメール(STD 11,RFC 822)は、文字セットとしては
７ビットのUS-ASCIIしかサポートしていない。MIME (RFC 2045～2049)は、イン
ターネットメールが他のメディアタイプや文字セットをサポートするように拡張
し、これによって、メール文書でのUnicodeの仕様がサポートされた。MIMEは、
Unicodeを使用可能な文字セットとして定義したわけでも、エンコード方法を定義
したわけでもない。しかし、それは、その後の文字セットの追加登録を可能にした。

   The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as
   amended) jointly define a character set (hereafter referred to as
   Unicode) which encompasses most of the world's writing systems.
   However, Internet mail (STD 11, RFC 822) currently supports only 7-
   bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends
   Internet mail to support different media types and character sets,
   and thus could support Unicode in mail messages. MIME neither defines
   Unicode as a permitted character set nor specifies how it would be
   encoded, although it does provide for the registration of additional
   character sets over time.

この文書は、７ビットのASCIIだけを含むUnicodeの変換フォーマットを記述し、
文書がUS-ASCIIの範囲の文字だけを含むという限られた状況では、文書を人間が
読めるようにすることを意図している。また、"MIMEでUnicodeを使って"、MIMEと
RFC1641の中でどのようにこの変換フォーマットが使われるかも規定している。

   This document describes a transformation format of Unicode that
   contains only 7-bit ASCII octets and is intended to be readable by
   humans in the limiting case that the document consists of characters
   from the US-ASCII repertoire. It also specifies how this
   transformation format is used in the context of MIME and RFC 1641,
   "Using Unicode with MIME".

動機
Motivation

Unicodeの他の変換フォーマットも存在し、使われていると考えられる(典型的に
は、UTF-8であり、UTF-2やUTF-FSSも知られている）が、それらには、US-ASCIIの
範囲外の１０進で128～255の数値をUncicodeをエンコードするために使用すると
いう欠点を持っている。この為、メールにおいては、これらのバイトはエンコー
ドされなければならない。このことは、テキストを２回の連続したエンコードを
受けさせ、US-ASCII以外の文字の(訳注：データ長の）伸長を引き起こし、英語を
使用しない人々に不利になる。例えば、UTF-8をMIMEのquoted-printableとともに
使用した場合、US-ASCIIは１バイトで表わされるが、他の文字は、９バイトを必
要とする可能性がある。

   Although other transformation formats of Unicode exist and could
   conceivably be used in this context (most notably UTF-8, also known
   as UTF-2 or UTF-FSS), they suffer the disadvantage that they use
   octets in the range decimal 128 through 255 to encode Unicode
   characters outside the US-ASCII range. Thus, in the context of mail,
   those octets must themselves be encoded. This requires putting text
   through two successive encoding processes, and leads to a significant
   expansion of characters outside the US-ASCII range, putting non-
   English speakers at a disadvantage. For example, using UTF-8 together
   with the Quoted-Printable content transfer encoding of MIME
   represents US-ASCII characters in one octet, but other characters may
   require up to nine octets.

概略
Overview

UTF-7は、US-ASCII以外の文字をエンコードするために、シフトシーケンスを使
うことによって、UnicodeをUS-ASCII文字としてエンコードする。この目的のた
めに、一つのUS-ASCII文字をシフト文字として使用するために予約する。

   UTF-7 encodes Unicode characters as US-ASCII octets, together with
   shift sequences to encode characters outside that range. For this
   purpose, one of the characters in the US-ASCII repertoire is reserved
   for use as a shift character.

多くのメールゲートウェイやメールシステムは、完全なUS-ASCII文字セットを扱
えるわけではない(例えば、EBCDIC基づいたもの）。この為、UTF-7は、全てのメ
ールシステムが適応できる方法で、US-ASCIIの範囲の文字をエンコードに使用し
ている。

   Many mail gateways and systems cannot handle the entire US-ASCII
   character set (those based on EBCDIC, for example), and so UTF-7
   contains provisions for encoding characters within US-ASCII in a way
   that all mail systems can accomodate.

UTF-7は、通常、メールのような７ビットの通信手段でのみ使われるべきである。
他の用途では、Unicodeを直接使うか、UTF-8が望ましい。

   UTF-7 should normally be used only in the context of 7 bit
   transports, such as mail. In other contexts, straight Unicode or
   UTF-8 is preferred.

MIMEでのUnicode変換フォーマットの使用方法についての全体的に仕様については、
RFC1641"MIMEでのUnicodeの使用"を参照のこと。

   See RFC 1641, "Using Unicode with MIME" for the overall specification
   on usage of Unicode transformation formats with MIME.

定義
Definitions

まず、Unicodeの定義：
   First, the definition of Unicode:

１６ビットの文字セットUnicodeは"Unicode標準 Version 2.0"で定義されている。
この文字セットは、国際標準のISO/IEC10646-1:1933(E); コード化表現形式=UCS-2; 
部分集合=300; 実装レベル=3, 10646の最初の７つの改正と追加された補足を含めて; 
の文字範囲及びコード化とまったく同一である。

      The 16 bit character set Unicode is defined by "The Unicode
      Standard, Version 2.0". This character set is identical with the
      character repertoire and coding of the international standard
      ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2;
      Subset=300; Implementation Level=3, including the first 7
      amendments to 10646 plus editorial corrections.

補足。Unicode 2.0は、このISO標準を超えて、これらの文字の使用と相互作用を
規定している。しかしながら、全ての正当な10646の数列は、Unicodeの正当な数
列であり、逆も同様である。Unicodeは数列の解釈を提供しているが、ISO標準は、
解釈について言及していない。

      Note. Unicode 2.0 further specifies the use and interaction of
      these character codes beyond the ISO standard. However, any valid
      10646 sequence is a valid Unicode sequence, and vice versa;
      Unicode supplies interpretations of sequences on which the ISO
      standard is silent as to interpretation.

次に、いくつかの有用なUS-ASCII文字の部分集合の定義：

   Next, some handy definitions of US-ASCII character subsets:

セットＤ（直接エンコードされる文字）は、以下の文字を含む（RFC1521 補足Ｂ
からの引用,RFC2045には現れない）：大文字・小文字のA～Z、0-9の１０個の数
字、それに、以下の９つの特別な文字（"+"と"="は除かれていることに注意）：

      Set D (directly encoded characters) consists of the following
      characters (derived from RFC 1521, Appendix B, which no longer
      appears in RFC 2045): the upper and lower case letters A through Z
      and a through z, the 10 digits 0-9, and the following nine special
      characters (note that "+" and "=" are omitted):

　　　　　　　　文字　　　 ASCII及びUnicde値（１０進）
               Character   ASCII & Unicode Value (decimal)
                  '           39
                  (           40
                  )           41
                  ,           44
                  -           45
                  .           46
                  /           47
                  :           58
                  ?           63

セットＯ（選択的に直接エンコードされる文字）は以下の文字を含む（"\"ｓ"~"
が除かれていることに注意）：

      Set O (optional direct characters) consists of the following
      characters (note that "\" and "~" are omitted):

　　　　　　　　文字　　　 ASCII及びUnicde値（１０進）
               Character   ASCII & Unicode Value (decimal)
                  !           33
                  "           34
                  #           35
                  $           36
                  %           37
                  &           38
                  *           42
                  ;           59
                  <           60
                  =           61
                  >           62
                  @           64
                  [           91
                  ]           93
                  ^           94
                  _           95
                  '           96
                  {           123
                  |           124
                  }           125

根拠。"\"と"~"は、ASCIIの変形で、しばしば再定義されるので除いてある。

   Rationale. The characters "\" and "~" are omitted because they are
   often redefined in variants of ASCII.

セットＢ（Base64)は、RFC2045で定義されているBase64アルファベットから、
パッド文字(訳注：Base64エンコードで４バイトに満たないときに、埋める文字)"
="(１０進で６１）を除外した文字の集まりである。

   Set B (Modified Base 64) is the set of characters in the Base64
   alphabet defined in RFC 2045, excluding the pad character "="
   (decimal value 61).

根拠。パッド文字 = は、UTF-7がRFC2047に示されるような、ヘッダーフィール
ドの中で使われるように設計されているため、除外されている。RFC2047の唯一
の読み取り可能なエンコードは、（RFC2045のQuoted-Printableに基づいた)"Q"
であり、(多くのエスケープシーケンスを使うことなしに) "=" を使うことはで
きない。これは、非常に残念なことであったが、不可避であった。このようなこ
とがなければ、"=" は、UTF-7のエスケープ文字として使用するのに("+"よりは
むしろ)適していただろう。

   Rationale. The pad character = is excluded because UTF-7 is designed
   for use within header fields as set forth in RFC 2047. Since the only
   readable encoding in RFC 2047 is "Q" (based on RFC 2045's Quoted-
   Printable), the "=" character is not available for use (without a lot
   of escape sequences). This was very unfortunate but unavoidable. The
   "=" character could otherwise have been used as the UTF-7 escape
   character as well (rather than using "+").

US-ASCIIの全ての文字は、Unicodeの中では、１６ビットにゼロ拡張された同じ
値を持っていることに注意すること。

   Note that all characters in US-ASCII have the same value in Unicode
   when zero-extended to 16 bits.

UTF-7の定義
UTF-7 Definition

UTF-7数列は、７ビットのUS-ASCIIバイトを使って、以下のように１６ビットの
Unicode文字を表現する：

   A UTF-7 stream represents 16-bit Unicode characters using 7-bit US-
   ASCII octets as follows:

ルール１：（直接エンコード)セットＤのUnicode文字は、ASCII値で直接エンコ
ードしてよい。セットＯのUnicode文字は、これらの文字の多くが、ヘッダーフ
ィールドでは不正であり、いくつかのメールゲートを正しく通過しないかもしれ
ないことに留意するなら、ASCII値で直接エンコードすることを選択してもよい。

      Rule 1: (direct encoding) Unicode characters in set D above may be
      encoded directly as their ASCII equivalents. Unicode characters in
      Set O may optionally be encoded directly as their ASCII
      equivalents, bearing in mind that many of these characters are
      illegal in header fields, or may not pass correctly through some
      mail gateways.

ルール２：（Unicodeシフトエンコード)全てのUnicode文字の数列は、シフト文
字"+"(US-ASCII文字値１０進で４３）を前置するならば、セットＢの文字の数列
を使ってエンコードしてよい。この "+" は、これ以降のバイトが、Base64のア
ルファベットではない文字が検出されるまでの間、Base64アルファベットの要素
と解釈されるべきであることを指示する。この文字(訳注：Base64のアルファベ
ットではない文字）には、キャリッジリターンとラインフィードを含む。これに
より、Unicodeシフト数列は、常に１行で終了する。特殊な例として、数列が "-"
 (US-ASCII １０進で４５）で終わっていたとき、この文字は消去される。他の終
 了文字は、消去されずに通常に処理される。

      Rule 2: (Unicode shifted encoding) Any Unicode character sequence
      may be encoded using a sequence of characters in set B, when
      preceded by the shift character "+" (US-ASCII character value
      decimal 43). The "+" signals that subsequent octets are to be
      interpreted as elements of the Modified Base64 alphabet until a
      character not in that alphabet is encountered. Such characters
      include control characters such as carriage returns and line
      feeds; thus, a Unicode shifted sequence always terminates at the
      of a line. As a special case, if the sequence terminates with the
      character "-" (US-ASCII decimal 45) then that character is
      absorbed; other terminating characters are not absorbed and are
      processed normally.

もし、シフトされた数列後の先頭の文字が "-" ならば、余分な "-" がシフトさ
れた数列を終了するために現れなければならず、これによって実際の "-" は消
去されないことに注意すること。

      Note that if the first character after the shifted sequence is "-"
      then an extra "-" must be present to terminate the shifted
      sequence so that the actual "-" is not itself absorbed.

根拠。終了文字が必要なケースは、Base64数列に続く文字が、セットＢの文字の
一部か、終了文字自身であるときである。それはまた、エンコードされた数列を
限定することによって、判読性を向上させることができる。

      Rationale. A terminating character is necessary for cases where
      the next character after the Modified Base64 sequence is part of
      character set B or is itself the terminating character. It can
      also enhance readability by delimiting encoded sequences.

また、特殊なケースでは、"+" をエンコードするために、"+-" を使ってもよい。
"+" に続く文字が、セットＢの一部及び "-" 以外の文字であるなら、それは不
正な数列である。

      Also as a special case, the sequence "+-" may be used to encode
      the character "+". A "+" character followed immediately by any
      character other than members of set B or "-" is an ill-formed
      sequence.

Unicodeは、Base64を使った最初の変換によって、Unicodeの１６ビット値をバイ
トストリームに（最上位バイトを最初にして）エンコードされる。サロゲートペ
ア(UTF-16)は、ペアの半分をそれぞれ別の１６ビット値として扱って（すなわち、
特別扱いせずに）変換される。奇数個のバイトで構成されるテキストは不正であ
る。サロゲートペアによってさえアクセスできない範囲のISO 10646文字はエン
コードできない。

      Unicode is encoded using Modified Base64 by first converting
      Unicode 16-bit quantities to an octet stream (with the most
      significant octet first). Surrogate pairs (UTF-16) are converted
      by treating each half of the pair as a separate 16 bit quantity
      (i.e., no special treatment). Text with an odd number of octets is
      ill-formed. ISO 10646 characters outside the range addressable via
      surrogate pairs cannot be encoded.

根拠。ISO/IEC 10646-1:1993(E)は、UCS-2形式の文字が、バイト数列にシリアル
化されるとき、最も重要なバイトが先頭に現れることを規定している。この事
は、通信の為の標準的なフォーマットを選択する共通のネットワーク慣習と調和
している。

      Rationale. ISO/IEC 10646-1:1993(E) specifies that when characters
      the UCS-2 form are serialized as octets, that the most significant
      octet appear first.  This is also in keeping with common network
      practice of choosing a canonical format for transmission.

根拠。ISO10646とUnicodeのコード割り当て方針では、文字の種類を同期させる
ことになっている。サロゲートペアでアクセスできない範囲には、ISO 10646に
おいてコードが割り当てられることはない。

      Rationale. The policy for code point allocation within ISO 10646
      and Unicode is that the repertoires be kept synchronized. No code
      points will be allocated in ISO 10646 outside the range
      addressable by surrogate pairs.

次に、バイト数列は、パッド文字 "=" を除外する修正をされたRFC2045定義の
転送エンコードアルゴリズムを適用することによってエンコードされる。その
代わり、エンコード時には、Base64文字の境界を埋めるためにゼロビットが追
加される。デコード時には、完全な１６ビットUnicodeを構成しないBase64の最
後のビットは、全て切り捨てられる。もし、このようにして切り捨てられるビ
ットがゼロでないから、その数列は不正である。

      Next, the octet stream is encoded by applying the Base64 content
      transfer encoding algorithm as defined in RFC 2045, modified to
      omit the "=" pad character. Instead, when encoding, zero bits are
      added to pad to a Base64 character boundary. When decoding, any
      bits at the end of the Modified Base64 sequence that do not
      constitute a complete 16-bit Unicode character are discarded. If
      such discarded bits are non-zero the sequence is ill-formed.

根拠。Base64エンコード時に、パッド文字 "=" は用いない。これは、上記のよ
うに、RFC2047ヘッダーフィールドのＱ content-transfer-encoding のエスケ
ープ文字として用いられる対立のためである。

      Rationale. The pad character "=" is not used when encoding
      Modified Base64 because of the conflict with its use as an escape
      character for the Q content transfer encoding in RFC 2047 header
      fields, as mentioned above.

ルール３：空白（１０進で３２)、タプ（１０進で９)、キャリジリターン（１０
進で１３)、ラインフィード（１０進で１０)は、それらのASCII値で直接表わし
てよい。しかし、MIMEのcontent-transfer-encding が持っているこれらの文字
の使用に関するルールに注意すること。例えば、RFC822の制約に従わない使用方
法では、MIME content-transfer-encoding の7bitまたは8bit以外を使用したエ
ンコード、つまりquoted-printable,binary,base64でエンコードされなけばなら
ないだろう。

      Rule 3: The space (decimal 32), tab (decimal 9), carriage return
      (decimal 13), and line feed (decimal 10) characters may be
      directly represented by their ASCII equivalents. However, note
      that MIME content transfer encodings have rules concerning the use
      of such characters. Usage that does not conform to the
      restrictions of RFC 822, for example, would have to be encoded
      using MIME content transfer encodings other than 7bit or 8bit,
      such as quoted-printable, binary, or base64.

これらのルールによって、ルール１または３によってエンコードされたUnicode
文字は、１文字あたり１バイトとなり、他のUnicode文字は１文字あたり平均 2 
2/3 バイトとなる。加えて、Base64に入るために１バイトが、Base64から抜ける
ために選択的な１バイトが追加される。

   Given this set of rules, Unicode characters which may be encoded via
   rules 1 or 3 take one octet per character, and other Unicode
   characters are encoded on average with 2 2/3 octets per character
   plus one octet to switch into Modified Base64 and an optional octet
   to switch out.

例。Unicode 数列 "A<NOT IDENTICAL TO><ALPHA>."
(１６進で 0041,2262,0391,002E)は、以下のようにエンコードされる：

      Example. The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>."
      (hexadecimal 0041,2262,0391,002E) may be encoded as follows:

            A+ImIDkQ.

例。Unicode 数列 "Hi Mom -<WHITE SMILING FACE>-!"
(１６進で 0048, 0069, 0020, 004D, 006F, 006D, 0020, 002D, 263A,002D, 
0021) は以下のようにエンコードされる：

      Example. The Unicode sequence "Hi Mom -<WHITE SMILING FACE>-!"
      (hexadecimal 0048, 0069, 0020, 004D, 006F, 006D, 0020, 002D, 263A,
       002D, 0021) may be encoded as follows:

            Hi Mom -+Jjo--!

例。日本語で "nihongo" を表わす漢字のUnicode数列は(１６進で 65E5,672C,
8A9E)は、以下のようにエンコードされる：

      Example. The Unicode sequence representing the Han characters for
      the Japanese word "nihongo" (hexadecimal 65E5,672C,8A9E) may be
      encoded as follows:

            +ZeVnLIqe-

MIMEでのUTF-7文字セットの使用
Use of Character Set UTF-7 Within MIME

UTF-7はメール送信にとって安全であり、この為、(行長と改行制限に反しない
限り)MIMEの全ての content-transfer-encoding と共に使用してよい。特に、本
文の7bitエンコードと、ヘッダのＱエンコードが適当である。MIMEの文字セット
タグはUTF-7である。これは、Unicodeの2.0以上の全てのバージョンを意味する。

   Character set UTF-7 is safe for mail transmission and therefore may
   be used with any content transfer encoding in MIME (except where line
   length and line break restrictions are violated). Specifically, the 7
   bit encoding for bodies and the Q encoding for headers are both
   acceptable. The MIME character set tag is UTF-7. This signifies any
   version of Unicode equal to or greater than 2.0.

例。Unicode数列 "Hi Mom <WHITE SMILING FACE>!" (１６進で 0048, 0069, 
0020, 004D, 006F, 006D, 0020, 263A, 0021)を含む、MIMEメッセージの一部を
示す：

      Example. Here is a text portion of a MIME message containing the
      Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (hexadecimal 0048,
      0069, 0020, 004D, 006F, 006D, 0020, 263A, 0021).

      Content-Type: text/plain; charset=UTF-7

      Hi Mom +Jjo-!

例。日本語の"nihongo"(１６進で 65E5,672C,8A9E)を表わす漢字のUnicode数列
を含むMIMEメッセージを示す：

      Example. Here is a text portion of a MIME message containing the
      Unicode sequence representing the Han characters for the Japanese
      word "nihongo" (hexadecimal 65E5,672C,8A9E).

      Content-Type: text/plain; charset=UTF-7

      +ZeVnLIqe-

例。Unicode数列 "A<NOT IDENTICAL TO><ALPHA>." (１６進で 0041,2262,0391,
002E)を含む、MIMEメッセージの一部を示す：

      Example. Here is a text portion of a MIME message containing the
      Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (hexadecimal
      0041,2262,0391,002E).

      Content-Type: text/plain; charset=utf-7

      A+ImIDkQ.

例。Unicode数列 "Item 3 is <POUND SIGN>1."  (１６進で 0049, 0074, 0065, 
006D, 0020, 0033, 0020, 0069, 0073, 0020, 00A3, 0031, 002E)を含む、MIME
メッセージの一部を示す：

      Example. Here is a text portion of a MIME message containing the
      Unicode sequence "Item 3 is <POUND SIGN>1."  (hexadecimal 0049,
      0074, 0065, 006D, 0020, 0033, 0020, 0069, 0073, 0020, 00A3, 0031,
      002E).

      Content-Type: text/plain; charset=UTF-7

      Item 3 is +AKM-1.

UnicodeやMIMEをサポートしないシステムへの最高の互換性を達成するために、
メール送信における改行は、インターネット規約に従うべきである。これは、
行が短く、適切なSMTP CRLF数列で終了すべきことを意味している。Unicodeの
行分離文字(１６進で 2028)と段落分離文字(１６進で2029)は、SMTPの改行文
字に変換されるべきである。理想的には、これは、Unicodeを実装したユーザ
ーエージェントによって透過的に扱われるだろう。

   Note that to achieve the best interoperability with systems that may
   not support Unicode or MIME, when preparing text for mail
   transmission line breaks should follow Internet conventions. This
   means that lines should be short and terminated with the proper SMTP
   CRLF sequence. Unicode LINE SEPARATOR (hexadecimal 2028) and
   PARAGRAPH SEPARATOR (hexadecimal 2029) should be converted to SMTP
   line breaks. Ideally, this would be handled transparently by a
   Unicode-aware user agent.

この調整は、UTF-7と適切なMIME content-transfer-encoding が、インターネッ
ト規約に従わないテキストを扱えるならば、必ずしも必要ではない。しかし、
UnicodeやMIMEをサポートしないシステムにおける判読率は低下する。RFC2048の
メール互換性問題に関する議論を参照のこと。

   This preparation is not absolutely necessary, since UTF-7 and the
   appropriate MIME content transfer encoding can handle text that does
   not follow Internet conventions, but readability by systems without
   Unicode or MIME will be impaired. See RFC 2045 for a discussion of
   mail interoperability issues.

行は、UTF-7のシフトされた数列の中では、決して改行されるべきではない。シ
フト数列は行をまたがる事ができないからである。それゆえ、UTF-7エンコード
は、改行後に行なわれるべきである。もし、行が、エンコード後に長すぎるシフ
トされた数列を含んでいるなら、Quoted PrintableのようなMIME content 
transfer encodingを用いてエンコードすることができる。もう一つの可能性と
しては、改行とUTF-7エンコードを同時に行なうことである。この方法によれば、
シフトされた数列は、既に長さ制限に合わせられている。

   Lines should never be broken in the middle of a UTF-7 shifted
   sequence, since such sequences may not cross line breaks. Therefore,
   UTF-7 encoding should take place after line breaking. If a line
   containing a shifted sequence is too long after encoding, a MIME
   content transfer encoding such as Quoted Printable can be used to
   encode the text. Another possibility is to perform line breaking and
   UTF-7 encoding at the same time, so that lines containing shifted
   sequences already conform to length restrictions.

議論
Discussion

この章では、私たちは、MIMEのcontent-transfer-encodingとともに用いられる
既存のUnicode変換フォーマット(例えば、UTF-8)を使用することと対比して、
UTF-7を導入する動機付けを行なう。議論の前に、自然言語の文字列わおける文
字の発生頻度に関する若干の仮定をリストし、平均的な記憶容量を推定する：

   In this section we will motivate the introduction of UTF-7 as opposed
   to the alternative of using the existing transformation formats of
   Unicode (e.g., UTF-8) with MIME's content transfer encodings. Before
   discussing this, it will be useful to list some assumptions about
   character frequency within typical natural language text strings that
   we use to estimate typical storage requirements:

１．多くの西ヨーロッパ言語は、大雑把に、7/8の文字をUS-ASCIIから、1/8を
ラテン語1(ISO-8859-1)から使用する。

   1. Most Western European languages use roughly 7/8 of their letters
      from US-ASCII and 1/8 from Latin 1 (ISO-8859-1).

２．多くのローマアルファベットに基づかない言語(例えば、ギリシャ語)は、
1/6の文字をUS-ASCIIから(空白は７ビットの領域にあるから)、残りを彼らのア
ルファベットから使用する。

   2. Most non-Roman alphabet-based languages (e.g., Greek) use about
      1/6 of their letters from ASCII (since white space is in the 7-bit
      area) and the rest from their alphabets.

３．東アジアの象形文字に基づく言語(日本語を含む)は、本質的に、全ての文字
を漢字かCJK syllabary領域から使用する。

   3. East Asian ideographic-based languages (including Japanese) use
      essentially all of their characters from the Han or CJK syllabary
      area.

４．直接エンコードされない句読文字は、結果に影響を与えるほどの頻度では出
現しない。

   4. Non-directly encoded punctuation characters do not occur
      frequently enough to affect the results.

現在のISO-8859-xのような８ビット標準は、content-transfer-encodingの使用
を必要とすることに注意すること。以下の議論での比較のために、費用を下記の
ように分析する(正確には、テキストの組成に依存するので、これらの表は概算
であることに注意)。

   Notice that current 8 bit standards, such as ISO-8859-x, require use
   of a content transfer encoding. For comparison with the subsequent
   discussion, the costs break down as follows (note that many of these
   figures are approximate since they depend on the exact composition of
   the text):

Base64での8859-x
   8859-x in Base64

　　テキストタイプ　　　一文字あたりの平均バイト数
　　　全て　　　　　　　　　　1.33

      Text type          Average octets/character
      All                      1.33

Quoted Printableでの8859-x
   8859-x in Quoted Printable

　　　テキストタイプ　　一文字あたりの平均バイト数
　　　US-ASCII                 1
　　　西ヨーロッパ             1.25
　　　その他                   2,67

      Text type          Average octets/character
      US-ASCII                 1
      Western European         1.25
      Other                    2.67

Base64でエンコードされたUnicodeは、一定して一文字あたり2,67バイトを要す
ることにも注意すること。比較のために、私たちは、Base64とQuoted Printable
されたUTF-8と、UTF-7を調査する。長い文字列によるオーバーヘッドが、1/n 
に関連して生じることにも注意する。n は、エンコードされる文字のバイト数で
ある。

   Note also that Unicode encoded in Base64 takes a constant 2.67 octets
   per character. For purposes of comparison, we will look at UTF-8 in
   Base64 and Quoted Printable, and UTF-7. Also note that fixed overhead
   for long strings is relative to 1/n, where n is the encoded string
   length in octets.

Base64でのUTF-8
   UTF-8 in Base64

　　　テキストタイプ　　一文字あたりの平均バイト数
　　　US-ASCII                 1.33
　　　西ヨーロッパ             1.5
　　　いくつかのAlphabetics    2.44
　　　その他全て               4

      Text type          Average octets/character
      US-ASCII                 1.33
      Western European         1.5
      Some Alphabetics         2.44
      All others               4

Quoted PrintableでのUTF-8
   UTF-8 in Quoted Printable

　　　テキストタイプ　　一文字あたりの平均バイト数
　　　US-ASCII                1
　　　西ヨーロッパ            1.63
　　　いくつかのAlphabetics   5.17
　　　その他全て              7-9

      Text type          Average octets/character
      US-ASCII                 1
      Western European         1.63
      Some Alphabetics         5.17
      All others               7-9

UTF-7
   UTF-7

　　　テキストタイプ　　一文字あたりの平均バイト数
　　　ほとんどのUS-ASCII       1
　　　西ヨーロッパ             1.5
　　　その他全て               2.67+2/n

      Text type          Average octets/character
      Most US-ASCII            1
      Western European         1.5
      All others               2.67+2/n

UTF-8のQuoted Printableは、西ヨーロッパ以外の全てのテキストでの(訳注：デ
ータ量の)増大が非常に大きく、実用的ではないように感じられる。これは、特
別な他の文字が点在している中に、US-ASCIIまたはラテン語が多く存在している
テキストである場合のみ実用になるということであろう。私たちは、全てのユー
ザにそれなりによく機能する、一つのエンコード方法を紹介したい。

   We feel that the UTF-8 in Quoted Printable option is not viable due
   to the very large expansion of all text except Western European. This
   would only be viable in texts consisting of large expanses of US-
   ASCII or Latin characters with occasional other characters
   interspersed. We would prefer to introduce one encoding that works
   reasonably well for all users.

UTF-8のBase64は、西ヨーロッパ以外のユーザに対して、高い(訳注：データ量
の)増大をもたらし、内容のほとんどがUS-ASCIIであるときでさえ、直接読むこ
ともできないので、好ましくはないと感じられる。UTF-7のBase64エンコードは
対照的な結果をもたらし、ASCIIテキストは読むことができる。

   We also feel that UTF-8 in Base64 has high expansion for non-
   Western-European users, and is less desirable because it cannot be
   read directly, even when the content is largely US-ASCII. The base
   encoding of UTF-7 gives competitive results and is readable for ASCII
   text.

UTF-7は、全てのUnicode文字セットのアクセスに対して、ISO-8859-xと対照的な
結果をもたらす。私たちは、これが新しいUnicodeの変換フォーマットの導入を
正当化すると信じる。

   UTF-7 gives results competitive with ISO-8859-x, with access to all
   of the Unicode character set. We believe this justifies the
   introduction of a new transformation format of Unicode.

UTF-7を使用する代替案としては、既存のMIMEメカニズムを適用した他の文字セ
ットに、Unicode文字を混ぜることが可能かもしれない。multipart/mixed 
content type は、ひとまず、改行の問題を無視できる(これを提案したNathaniel 
Borensteinに感謝)。例えば(以前の例の繰り返し）：

   As an alternative to use of UTF-7, it might be possible to intermix
   Unicode characters with other character sets using an existing MIME
   mechanism, the multipart/mixed content type, ignoring for the moment
   the issues with line breaks (thanks to Nathaniel Borenstein for
   suggesting this). For instance (repeating an earlier example):

      Content-type: multipart/mixed; boundary=foo
      Content-Disposition: inline

      --foo
      Content-type: text/plain; charset=us-ascii

      Hi Mom
      --foo
      Content-type: text/plain; charset=UNICODE-2-0
      Content-transfer-encoding: base64

      Jjo=
      --foo
      Content-type: text/plain; charset=us-ascii

      !
      --foo--

論理的には、この事は、メッセージ本文中のUTF-7の必要性を削除する(ヘッダー
フィールドではmultipartは使えない)。しかし、私たちは、Unicode文字の使用
は、もっと広範囲になると感じている。Unicde文字の断続的な使用(例えば、
dingbatや数学記号)が起こるだろう。その様なテキストは、典型的には他の文字
の小さな断片をも含むだろう。例えば、キュリロス語、ギリシャ語、あるいは東
アジアの言語(ローマ字は既に、既存のMIME文字セットで十分に処理できる)。
multuipart技術が、代替文字セットを含むかなり大きなテキストでよく機能す
るとしても、私たちは、正に議論されたようなユーザたちに十分な支援になっ
ていないと感じる。それ故に、私たちは、UTF-7の導入は正当であると信じる。

   Theoretically, this removes the need for UTF-7 in message bodies
   (multipart may not be used in header fields). However, we feel that
   as use of the Unicode character set becomes more widespread,
   intermittent use of specialized Unicode characters (such as dingbats
   and mathematical symbols) will occur, and that text will also
   typically include small snippets from other scripts, such as
   Cyrillic, Greek, or East Asian languages (anything in the Roman
   script is already handled adequately by existing MIME character
   sets). Although the multipart technique works well for large chunks
   of text in alternating character sets, we feel it does not adequately
   support the kinds of uses just discussed, and so we still believe the
   introduction of UTF-7 is justified.

まとめ
Summary

UTF-7エンコードは、Unicode文字をUS-ASCII７ビット文字でエンコードする。こ
れは、一つのUnicode文字か、Unicode文字列が点在したUS-ASCIIの比較的長い文
字列を含むUnicode数列に対して、最も効果的である。なぜなら、US-ASCIIの部
分は、直接のUnicodeサポートのないシステム上でも読むことができるからであ
る。

   The UTF-7 encoding allows Unicode characters to be encoded within the
   US-ASCII 7 bit character set. It is most effective for Unicode
   sequences which contain relatively long strings of US-ASCII
   characters interspersed with either single Unicode characters or
   strings of Unicode characters, as it allows the US-ASCII portions to
   be read on systems without direct Unicode support.

UTF-7はメールのような７ビット転送でのみ使われるべきである。他の用途では、
Unicodeそのものか、UTF-8の使用が望ましい。

   UTF-7 should only be used with 7 bit transports such as mail. In
   other contexts, use of straight Unicode or UTF-8 is preferred.

受領確認
Acknowledgements

   Many thanks to the following people for their contributions,
   comments, and suggestions. If we have omitted anyone it was through
   oversight and not intentionally.

         Glenn Adams
         Harald T. Alvestrand
         Nathaniel Borenstein
         Lee Collins
         Jim Conklin
         Dave Crocker
         Steve Dorner
         Dana S. Emery
         Ned Freed
         Kari E. Hurtta
         John H. Jenkins
         John C. Klensin
         Valdis Kletnieks
         Keith Moore
         Masataka Ohta
         Einar Stefferud
         Erik M. van der Poel

補足Ａ－例
Appendix A -- Examples

ここでは、五大法典つにいて取り上げた少し長い例を示す。例は、簡潔に要約
され、２つのバージョンがある：第一はセットＯの選択的な(いくつかのメール
ゲートウェイを通過しないかもしれない)文字が使われている。二番目は使われ
ていない。

   Here is a longer example, taken from a document originally in Big5
   code. It has been condensed for brevity. There are two versions: the
   first uses optional characters from set O (and so may not pass
   through some mail gateways), and the second does not.

   Content-type: text/plain; charset=utf-7

   Below is the full Chinese text of the Analects (+itaKng-).

   The sources for the text are:

   "The sayings of Confucius," James R. Ware, trans.  +U/BTFw-:
   +ZYeB9FH6ckh5Pg-, 1980.  (Chinese text with English translation)

   +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-:  +Ti1XC2b4Xpc-, 1990.

   "The Chinese Classics with a Translation, Critical and Exegetical
   Notes, Prolegomena, and Copius Indexes," James Legge, trans., Taipei:
   Southern Materials Center Publishing, Inc., 1991.  (Chinese text with
   English translation)

   Big Five and GB versions of the text are being made available
   separately.

   Neither the Big Five nor GB contain all the characters used in this
   text.  Missing characters have been indicated using their Unicode/ISO
   10646 code points.  "U+-" followed by four hexadecimal digits
   indicates a Unicode/10646 code (e.g., U+-9F08).  There is no good
   solution to the problem of the small size of the Big Five/GB
   character sets; this represents the solution I find personally most
   satisfactory.

   (omitted...)

   I have tried to minimize this problem by using variant characters
   where they were available and the character actually in the text was
   not.  Only variants listed as such in the +XrdxmVtXUXg- were used.

   (omitted...)

   John H. Jenkins +TpVPXGBG- jenkins@apple.com 5 January 1993
   (omitted...)

   Content-type: text/plain; charset=utf-7

   Below is the full Chinese text of the Analects (+itaKng-).

   The sources for the text are:

   +ACI-The sayings of Confucius,+ACI- James R. Ware, trans.  +U/BTFw-:
   +ZYeB9FH6ckh5Pg-, 1980.  (Chinese text with English translation)

   +Vttm+E6UfZM-, +W4tRQ066bOg-, +UxdOrA-:  +Ti1XC2b4Xpc-, 1990.

   +ACI-The Chinese Classics with a Translation, Critical and Exegetical
   Notes, Prolegomena, and Copius Indexes,+ACI- James Legge, trans.,
   Taipei:  Southern Materials Center Publishing, Inc., 1991.  (Chinese
   text with English translation)

   Big Five and GB versions of the text are being made available
   separately.

   Neither the Big Five nor GB contain all the characters used in this
   text.  Missing characters have been indicated using their Unicode/ISO
   10646 code points.  +ACI-U+-+ACI- followed by four hexadecimal digits
   indicates a Unicode/10646 code (e.g., U+-9F08).  There is no good
   solution to the problem of the small size of the Big Five/GB
   character sets+ADs- this represents the solution I find personally
   most satisfactory.

   (omitted...)

   I have tried to minimize this problem by using variant characters
   where they were available and the character actually in the text was
   not.  Only variants listed as such in the +XrdxmVtXUXg- were used.
   (omitted...)

   John H. Jenkins +TpVPXGBG- jenkins+AEA-apple.com 5 January 1993
   (omitted...)

Security Considerations

   Security issues are not discussed in this memo.

References

[UNICODE 2.0]  "The Unicode Standard, Version 2.0", The Unicode
               Consortium, Addison-Wesley, 1996. ISBN 0-201-48345-9.

[ISO 10646]    ISO/IEC 10646-1:1993(E) Information Technology--Universal
               Multiple-octet Coded Character Set (UCS). See also
               amendments 1 through 7, plus editorial corrections.

[RFC-1641]     Goldsmith, D., and M. Davis, "Using Unicode with MIME",
               RFC 1641, Taligent, Inc., July 1994.

[US-ASCII]     Coded Character Set--7-bit American Standard Code for
               Information Interchange, ANSI X3.4-1986.

[ISO-8859]     Information Processing -- 8-bit Single-Byte Coded Graphic
               Character Sets -- Part 1: Latin Alphabet No. 1, ISO
               8859-1:1987.  Part 2: Latin alphabet No.  2, ISO 8859-2,
               1987.  Part 3: Latin alphabet No. 3, ISO 8859-3, 1988.
               Part 4: Latin alphabet No.  4, ISO 8859-4, 1988.  Part 5:
               Latin/Cyrillic alphabet, ISO 8859-5, 1988.  Part 6:
               Latin/Arabic alphabet, ISO 8859-6, 1987.  Part 7:
               Latin/Greek alphabet, ISO 8859-7, 1987.  Part 8:
               Latin/Hebrew alphabet, ISO 8859-8, 1988.  Part 9: Latin
               alphabet No. 5, ISO 8859-9, 1990.

[RFC822]       Crocker, D., "Standard for the Format of ARPA Internet
               Text Messages", STD 11, RFC 822, UDEL, August 1982.

[MIME]         Borenstein N., N. Freed, K. Moore, J. Klensin, and J.
               Postel, "MIME (Multipurpose Internet Mail Extensions)
               Parts One through Five", RFC 2045, 2046, 2047, 2048, and
               2049, November 1996.

Authors' Addresses

   David Goldsmith
   Apple Computer, Inc.
   2 Infinite Loop, MS: 302-2IS
   Cupertino, CA 95014

   Phone: 408-974-1957
   Fax: 408-862-4566
   EMail: goldsmith@apple.com

   Mark Davis
   Taligent, Inc.
   10201 N. DeAnza Blvd.
   Cupertino, CA 95014-2233

   Phone: 408-777-5116
   Fax: 408-777-5081
   EMail: mark_davis@taligent.com

－－　ここからは翻訳ではありません　－－
翻訳の著作権表示

この翻訳の著作権は小林雅弘(bko)が保有します。

Copyright (C) bko 2000

但し、この事は、この翻訳の正確さをbkoが保証することを意味しませんし、こ
の翻訳結果を利用した結果について保証することも意味しません。各自、自ら
の責任で使用してください。
日本語と英語は、当然に一対一に対応するものではありませんので、この手の
規約・規格ものを翻訳するときは、全体を考慮して同一の英単語には同一の日
本語を割り当てて、意味を厳格にするなどの処置が必要かと思いますが、この
翻訳は、一切そういうことは行っておりません。ごくふつーの文章を訳すのと
同一の作業で翻訳しています。
従って、その分、わかりやすくなっている部分もあるように思いますが、厳格
さ明らかに欠落しています。利用される場合は、必ず原文と照らし合わせた上
で利用されることを推奨します。
誤訳等が発見された場合は、予告なく改変する場合がありますので、無断転載
はお断りします。

初版　2000/6/12