透過您的圖書館登入
IP:3.80.24.244
  • 學位論文

國際化域名解析系統相關問題之研究

Internationalized Domain Name Resolution System and Its Localization

指導教授 : 賴飛羆 何建明

摘要


隨著科技進步、網際網路的逐漸普及,允許使用者用他們的母語上網成為一項重要的研究課題。傳統上,我們只能使用ASCII字集裡面的英文字母(A-Z, a-z)、數字(0-9)和連字號(-)來替我們的電腦取名字。這個ASCII字集的子集被泛稱為LDH字集。在西元1999年,亞太網路組織(Asia Pacific Networking Group,簡稱APNG)與新加坡廣播管理局(Singapore Broadcasting Authority,簡稱SBA)合作,採用一種名為iDNS技術,在新加坡推動中文域名的服務。同年,數家網路公司嚐試在臺灣引進該項技術。然而多國語文域名並無明確的技術標準加以規範,不相容的技術很可能造成使用者的系統不正常的運作,甚至影響到其他的使用者。同時,中文域名的使用也很可能對華文社會帶來新的衝擊,例如在西元1999-2000年間發生的搶佔中文域名風潮。 在西元2000年3月,網際網路標準組織Internet Engineering Task Force (IETF)成立國際化域名工作小組(Internationalized Domain Name Working Group,簡稱IDN WG),就國際化域名的技術規範進行討論。同年5月,中、台、港、澳兩岸四地的域名主管機構CNNIC、TWNIC、HKNIC與MONIC於北京成立中文域名協調聯合會(Chinese Domain Name Consortium,簡稱CDNC),負責對各種中文域名實現方案進行評估,制定中文域名技術標準和註冊管理規範,協調相關國家和地區中文域名的運行,並與各相關國際組織展開交流與合作。 在2003年3月,IETF發佈三個RFC文件作為規範國際化域名的技術規範,一般稱之為IDNA技術規範。其主要精神在擴張域名可以使用的字集,從原有的LDH字集擴張到Unicode字集,並使用一種稱為ASCII Compatible Encoding (ACE)的編碼規則,將以Unicode編碼的域名字串轉換成既有域名系統可以接受的字串。IDNA技術規範要求應用軟體在使用國際化域名之前,必須先對該域名進行ACE轉換,才可以交由解析器(resolver)代為進行域名解析的動作。這個要求免除了更新既有域名系統以支援國際化域名的需求。然而,所有的應用軟體都必需整合一個符合IDNA技術規範的模組,才能使用國際化域名。我們也注意到,有許多的多國語文文件是沒有編碼標記的,這些文件中的國際化域名將無法被轉換成正確的Unicode域名,也就無法正確的轉換成ACE字串。這將使得一部份的國際化域名無法被正確的解析。 我們認為在開始運行國際化域名服務的初期,不應該強制使用者全面更新既有的應用軟體。在參與標準制定的過程中,我們注意到,有相當多的網路軟體允許使用者在域名欄位裡使用非LDH字串,例如使用者可以在微軟公司發行的網際網路流覽器Internet Explorer裡面使用中文域名。這個發現促使我們考慮在符合IDNA技術規範的應用軟體普及之前,在域名伺服器端提供多種編碼的國際化域名解析,以方便使用者可以在既有的網路軟體上使用國際化域名。我們提出一個伺服器端的解析架構,稱為Octopus域名解析系統。在這個架構裡,國際化域名以ACE編碼的方式存放在域名伺服器上面。透過一個名為Octopus的伺服器端代理域名伺服器,將非ACE編碼的國際化域名轉換成ACE編碼的國際化域名,再向真正的域名伺服器查詢。使用這個架構,使用者不必更新既有的網路軟體;域名服務的管理員也不須要為了提供多種編碼的國際化域名解析,修改域名伺服器軟體,也不需要在域名記錄檔(zone file)上面存放多種編碼的國際化域名。 我們建置了一個以Octopus域名解析系統為基礎的中文域名服務,並進一步探討在現有軟體中使用中文域名可能會碰到的問題。有些問題是因為中文碼本身與應用軟體的部份模組不相容所造成的,也有些是因為中文域名的沒有編碼標記造成軟體誤判編碼方式,進行影響軟體運作。有些是發生在應用伺服器上面,有些則是在客戶端應用軟體。我們一一釐清這些問題發生的原因,加以歸納,研究解決的方法,以及現階段可行的方案。這些經驗,都是將來軟體廠商實作符合IDNA技術規範軟體的重要參考。 除了在技術上解決國際化域名解析的問題外,我們也注意到使用中文域名在華文社會裡所帶來的衝擊。域名已經普遍的被認為與商標一樣,具有一定的商業價值。可以想見,中文域名在華文社會將具有比英文域名更多的價值。隨之而來網路蟑螂的問題、相似域名引發的侵權爭議等等,必然會給華文社會帶來一波新的衝擊。 就如英文裡有大寫的字母和小寫的字母一樣,中文裡也有繁體字和簡化字,如『臺』和『台』、『灣』和『湾』。繁體字和簡化字這種對應關係其實是所謂異體字關係中的一種。如果兩個中文字的音、義、用都相同的話,我們稱這兩個字互為異體字。如『清』與『淸』,又如『真』與『眞』。顯而亦見,『清真寺』與『淸眞寺』應該被當成同一個域名。又如『臺灣大學』、『台灣大學』與『台湾大学』,這三個域名也應該被視為同一個域名。這種透過異體字代換而產生的等價域名,我們稱之為異體字域名。然而,由於異體字的組合關係十分複雜,在IDN WG的討論裡,這類屬於語文相關的議題,被認為應該由各個域名主管機構從註冊與管理的方向來解決,而非透過網路協定來解決。2004年4月,在中、日、韓幾個區域的域名主管機構集合文字專家、資訊工程專家的努力下,IETF發佈RFC 3743,該文件被泛稱為JET Guidelines,建議域名主管機構以域名套裝(IDL package)作為域名註冊的單位。每一個域名主管機構可以定義專屬的異體字表或者選擇採用其他機構建議的異體字表。當一個域名被註冊時,根據異體字表,產生異體字域名。異體字域名與被註冊的域名同屬於一個不可分割的域名套裝。透過套裝的機制,可以減少如網路蟑螂搶佔域名或者有心人註冊異體字域名來詐欺使用者的現象。 然而JET Guidelines並未對異體字域名的解析提出技術規範。對於數量少的異體字域名,我們可以透過域名系統既有的別名機制,一一列舉。然而,有些中文域名的異體字域名的數量很大,如【中】【華华】【民】【國国囯】【經经経】【濟济済】【部】【標标】【準准】【檢检検】【驗验験】【局】這個域名,有1×2×1×3×3×3×1×2×2×3×3×1共1,944個域名。在研議中文域名專用異體字表的同時,我們也設計了一套異體字域名解析的協定。以異體字表為參考,我們建造一個索引函式,這個函式的主要目標是給異體字域名相同的異體字域名索引。當一個域名被註冊時,我們也在域名系統裡放入以異體字域名索引為索引的異體字域名記錄,指向真正被註冊的域名。當應用軟體發覺某個域名無法解析時,可以嚐試查詢異體字域名記錄,進而找到真正被註冊的域名。 在國際化域名解析系統的研究裡,我們擴充了既有域名系統的功能。我們首先提供了多種編碼的國際化域名解析,接著再針對華文社會的需求,提供中文異體字域名的解析。在擴充系統原有功能的同時,我們一併考慮到回溯相容與既有軟體的再利用。這些經驗提供軟體工程研究對於擴充系統功能一些有用的參考。同時,我們的研究結果也對未來URI等名稱服務在進行國際化、標準化、中文化等工作時,提供重要的參考。

關鍵字

本地化 異體字 國際化域名

並列摘要


In recent years, many attempts have been made to lower the linguistic barriers for non-native English speakers wishing to access the Internet. However, traditional Internet domain names are restricted to being composed of ASCII letters, digits, and hyphens – abbreviated as LDH. In 1999, Internationalized Domain Names, (IDN), were introduced to allow an individual or organization to register a domain name in any major language – from Chinese to Russian to Arabic. In March 2003, IETF published three RFC (Requests for Comments) documents, referred to as IDNA, nameprep, and punycode, as the IETF Internet standard for IDN. These documents specify a name-preparation process for converting a Unicode IDN to an ASCII Compatible Encoding (ACE) string. Once an IDN is registered in an IDN registry, the latter stores the ACE string in the domain name server. When an IDNA-aware application looks for a host using its IDN, the application converts the IDN to an ACE string so that the current DNS can resolve the ACE string into the host's IP address. However, some domain name strings embedded in multilingual content do not have any charset encoding tag, so they cannot be appropriately converted to the corresponding Unicode IDNs and, thus, the ACE strings. Although, IDNA can use the current DNS without modifying domain name servers and resolvers, it does require that an IDNA-compliant module be integrated into every Internet application in order to process IDNs properly. Through our participation in IDN-related activities, we observed that many Internet applications allow the use of non-ASCII characters in domain name slots. This motivated us to design an IDN server proxy architecture that provides IDN resolution in multiple encodings. In this architecture, ACE IDNs are stored in the domain name servers; hence, traditional domain name servers can be used without modification. An IDN server proxy, called Octopus, is employed on the domain name server side to facilitate servers by providing non-ACE IDN resolution. On receipt of a DNS query packet, Octopus converts the non-ACE IDN to ACE. The ACE string is then forwarded to backend domain name servers (where the traditional domain names and ACE IDNs are stored) for further processing. Based on the design and implementation of Octopus, we initiated a CDN trial service to further investigate the interoperability of Internet applications when CDNs are used. We studied several types of errors that cause unsuccessful WWW access via IDNs, such as improper web server configuration, generic multilingual text processing errors, etc. Solutions were then developed, including the use of an IDN-aware web redirection server. While Internet services can be significantly improved by introducing IDNs, the use of characters that have similar appearances and/or meanings has the potential to cause confusion. The introduction of IDNs has raised serious consumer concerns about the likelihood of widespread user confusion, new opportunities for cybersquatting, etc. IDNA does not address linguistic issues, such as Han character variants. Two Han characters are said to be variants of each other if they have the same meaning and are pronounced the same. A variant IDN derived from an IDN by replacing some characters with their variants should match the original IDN. In April 2004, IETF published RFC 3743, referred to the JET Guidelines, for the registration and administration of Chinese, Japanese, and Korean IDNs. The JET Guidelines suggest that zone administrators model the concept of equivalent IDLs (Internationalized Domain Labels) as an atomic IDL package based on zone-specific Language Variant Table (LVT) mechanisms. However, the Guidelines do not address various technical implementation issues. For example, an issue of scalability arises when the number of variant IDLs is large. We propose a resolution protocol that resolves the variant IDLs in an IDL package into its registered IDL with the help a small number of VarIdx RRs (resource records). In this process, each VarIdx RR uses a variant expression to enumerate some of the variant IDLs. An indexing function is designed to give the same variant index to the variant IDLs enumerated by a variant expression. This allows Internet applications to use one of the variant IDLs to look up the VarIdx RRs and find the registered IDL. We have studied different indexing functions. Experiment results show that, although individual zones may have their own rules about permitted characters and the variant relationships of these characters, an indexing function does exist for global use. We set up a redirection service that enables users to access the WWW via variant IDNs. The domain name servers are configured to return the IP address of the redirection server to the client when the queried domain name is not registered. The user request is then sent to the redirection server, which computes the variant index of the unregistered label and looks up the VarIdx RRs. If the right VarIdx RR is located, the server redirects the user request to the new URL by replacing the variant IDL with the registered IDL. Experiment results show that our resolution protocol successfully enables Internet access via variant IDNs. In this research, we first extend the functionality of the current DNS by providing IDN resolution in multiple encodings, and then extend it further by providing variant IDN resolution. Our study also suggests useful practices for software vendors to develop INDA-compliant Internet applications. While extending the functionality of DNS, we retain backward compatibility and reuse existing software as much as possible. Our study provides useful reference for software engineers to extend the functionality of a widely deployed system.

參考文獻


[99] X. Lee, N. W. Hsu, X. Deng, E. Chen, H. Zhang, and G. Sun, "Traditional and Simplified Chinese Conversion," November 2001.
[91] M. Duerst, "Internationalized Domain Names in URIs," Internet Draft, July 2002.
[65] J. Jung, A. Berger, and H. Balakrishnan, "Modeling TTL-based Internet Caches," International Conference on Computer Communications, San Francisco CA, March 2003.
[105] HKNIC - Hong Kong Network Information Center. http://www.hkdnr.net.hk/.
[12] K. Konishi, K. Huang, H. Qian, and Y. Ko. Joint Engineering Team (JET) Guidelines for Internationalized Domain Names (IDN) Registration and Administration for Chinese, Japanese, and Korean. RFC 3743, April 2004.

延伸閱讀