Unicode Encoding Awareness Through Unified String Adapter

A Google Summer of Code 2011 Project Proposal for the Boost C++ Libraries

Unicode Encoding Awareness Through Unified String Adapter

1. Abstract

2. Personal Details

2.1 Availability

3. Background Information

3.1 Education Background

3.2 Programming Background

3.3 Motivation

3.4 Skills Rating

4. Research

4.1 Background Research

4.2 Observation

5. Proposal

5.1 Project Description

5.2 Objective

5.3 Project Milestone

1. Abstract

The Unicode String Adapter is a template class that wraps around any string types to provide encoding semantics to the strings. The adapter ensures encoding correctness through type safety when strings are passed between libraries, and enables transparent conversion between different string types with different encodings that are wrapped by the adapter.

2. Personal Details

Name:                 Chen Ruo Fei

University:         National University of Singapore

Course:                 Computer Engineering

Degree:                 B. Sc

Email:                 crf@hypershell.org

Homepage:         http://crf.scriptmatrix.net

2.1 Availability

The school will be having semester break soon until August, so I will not be busy to study during this period. I also won’t be travelling oversea nor will I take any part time job. My availability might be slightly affected starting mid of July as I will need to prepare myself for school starts and also to prepare for final year project.

I have already begin discussions with my prospective mentor, Chad Nelson. I will begin my coding once the application gets accepted on 25 April. For the time being, I am mainly focusing on the design of the library as there are many aspects to consider about.

3. Background Information

3.1 Education Background

Even though I chose Computer Engineering as my major when I entered university, it was only until recently that I found that my true interest is in Computer Science. Currently I plan to apply to change my major, even though I will be in my fourth and final year and it is quite late to do so. I attended several CS courses as extra modules, while there are a number of other CS modules are common prerequisite for both majors. I joined the Special Programme in Computing during my second year and were active in the special interest group for programming language.

3.2 Programming Background

I started learning programming by myself since 2003, where I learned PHP and Perl for general web programming. I first learned about object oriented programming when I learnt Java in 2006. After that, I learned many other programming languages and CS topics throughout my university life. I started using Python since a year ago to develop web applications using Django, and it was only until half year ago when I really dived into C++.

I am particularly interested in server-side web programming and designing new programming languages. I spent most of my time to figure out the design patterns and better ways to build highly scalable and highly modular websites. My interest is to find new ways of developing web applications, as I am not satisfied with the way web applications are written today. Currently I am working on a personal project to build a web server in C++ with CSP-like concurrency model similar to Kamaelia.

The reason that I choose to contribute to Boost is partly due to my recent involvement in C++. I chose C++ over C for my project because C++ allows me to use object oriented techniques while at the same time is low level enough to easily manipulate raw memory objects. With my interest in programming language, I am more interested to improve C++ to benefit others, rather than building applications myself. Boost in many ways have significant influence to the development of C++, as we can see in TR1 and C++0x. Because of that, I hope that in future my contribution can also be adopted as part of the C++ standard, by first making useful libraries in Boost.

3.3 Motivation

I am interested to introduce Unicode string types into C++ because until now, it is surprisingly hard to represent a Unicode string. However, Unicode has become increasingly common today and that encoding mismatch has caused numerous bugs simply because of wrong assumptions. As a C++ developer myself who also build other libraries, I feel the needs for such Unicode string types.

After the GSoC project ends, hopefully I can continue to work on this project after receiving good feedback from the Boost community. As this topic is quite controversial and due to my lack of experience, there might be a need to rebuild the library to suit practical uses. However I am willing to continue to work on it no matter what is the results in GSoC, if I can receive support from the community.

3.4 Skills Rating

Following is my self assessment for my C++ skills, with 0 being no experience, 5 being expert:

C++

4

C++ Standard Library

4

Boost C++ Libraries

3

Subversion

3

Git

4

I use vim as my C++ IDE, and I have basic knowledge on using Doxygen.

4. Research

4.1 Background Research

The suggestion to create a Unicode string type was first pointed out in the Boost mailing list on January 2011[1], which then received intense discussions that spanned accross several threads[2][3]. For a brief overview, the heated debates was mainly around different ways to ensure consistency between the encoding expected when library code accepts strings, and the actual encoding of strings passed by users. The problem arise because a small minority of developers use std::string in different encoding than UTF-8, and the implicit assumption of UTF-8 encoding for std::string brings inconsistency and causes numerous bugs that are outside the scope of the Boost library.

From the discussion, several proposals have been made to solve the inconsistencies of std::string encoding:

  1. Create new classes that warp around std::string, std::u16string, and std::u32string for each encodings and ensure encoding correctness simply through C++ type safety features. The classes are tentatively called the utf*_t classes (which many disliked the name) - Proposed by Chad Nelson with working prototype available.
  2. Continue to strongly enforce the assumption that all std::strings are UTF-8 encoded. Depreciate or make it hard to use other encodings in std::string.
  3. Reinvent std::string and introduce a new string class. The new string class is proposed to be immutable but also delegate the encoding awareness to templated view<> classes/functions that warp the underlying string.

Unfortunately, the discussions ended without meaningful conclusion because there are several groups of people who have strong opinion on different ways to solve the problem and could not generally agree with each other. The community questioned the need for yet another string class as there are many other Unicode string classes available but failed to be adopted as the standard string class. With std::string already being the sole standard that is widely used, it is believed that a new string class that is designed to replace std::string will probably not succeed.

Even though the subject is controversial enough to halt the discussion, the length of discussion and the active participation shows the importance of this topic and the need to solve this problem. The motivation of this project is to identify the main points in the discussions and find a general solutions for it.

4.2 Observation

A careful observation shows that there is a flaw in the arguments of the previous discussions. It is found that most of the discussions are actually about the feasibility of creating new string classes rather than on the encoding awareness itself. It was believed that the only way to create encoding awareness is to create a new string class which happens to contain encoding information, and that new string class is fundamentally incompatible with the existing string classes.

However, string and encoding are actually two different concepts that deserves separate abstraction. It is found that the reason people oppose to encoding-aware string is because a string is supposed to be a dumb container that carries raw bytes and do not care about the meaning of those bytes. On the other hand, encoding actually works on one layer above the string to make sure that the raw bytes have consistent meanings.

As a result, the Unicode string problem can be solved by simply introducing a string adapter class that wraps around existing string classes. The string adapter class uses the decorator design pattern to decorate and bring encoding awareness into existing string classes instead of replacing them, thus complementing each others. While the string class focus on manipulation of raw bytes such as string creation and concatenation, the string adapter serves as interface specification for library writers to make sure that the provided encoding of strings are consistent with the intended encoding of strings.

Fortunately, the string adapter pattern does already exist in existing libraries, such as Oak Circle Unicode Toolkit[4], Boost.Text[5], and Boost.Filesystem[6].

4.2.1 Oak Circle Unicode Toolkit

Chad Nelson’s Oak Circle Unicode classes have the following signature:

class utf8_t  : public specialized_string_t< utf8_t, std::basic_string<char> >

class utf16_t : public specialized_string_t< utf16_t,

std::basic_string<char16_t> >

class utf32_t : public specialized_string_t< utf32_t,

std::basic_string<char32_t> >

where char16_t and char32_t are custom typedef to 16-bit and 32-bit characters if not in C++0x.

Notice that the classes are all derived from a template called specialized_string_t that has generic interface that access to the underlying string. This makes it possible to add Unicode encoding semantics to any string class that only handle raw bytes by creating new template instances following the pattern specialized_string_t<ClassName, RawStringContainerClass>.

4.2.2 Boost.Text

Similarly, Anders Dalvander’s Boost.Text has a basic_text template with encoding as the template parameter:

template <typename encoding>

class basic_text;

The class basic_text actually uses std::basic_string<typename encoding_type::codeunit_type> as it’s underlying container. It is possible to further generalize the class by adding the string type into the template parameter to allow wrapping of different string types.

4.2.3 boost::string (Boost.Chain)

This pattern actually somewhat similar to the view<> concept mentioned by Dean Michael Berris in the boost::string discussion. Dean's view concept has the signature of class view<Encoding> and wraps the proposed boost::string as it's underlying container. Notice that the view template can actually be generalized to wrap other strings, such as std::string, by adding one template parameter to make it class view<Encoding, StringT>. In the boost::string discussion, it is also generally agreeable that a string class should really be just a dumb container that store raw bytes and do not care about the meaning of those bytes. This is also why even the new proposed boost::string class (now called Boost.Chain) also do not attempt to add Unicode semantics into it. Instead, the view<> class is used at one level higher than boost::string to add encoding semantics to the raw string container.

4.2.4 Boost.Filesystem

This pattern can also be seen applied in Boost.Filesystem, where it use a special class to represent the path, rather than the raw std::basic_string<> variants. The path class has the following signature:

template <class StringT, class PathTraits>

class basic_path;

where StringT is the type for the internal raw string container, and PathTraits contains two conversion functions that know how to convert one type of external (incoming) strings into the type of it's underlying string container. This allows Boost.Filesystem's developers to choose a consistent internal string format, such as the 16-bit wchar_t, while still able to compare it against other string format, such as the 8-bit char.

There is however one inefficiency in the basic_path design, which is that the path traits is restricted to only able to convert between two string types, instead of arbitrary external string type to one internal string type. This means that for example, if the developer chose the path traits to convert between 8-bit and 16-bit character strings, then it is not possible to also convert a 32-bit character string into that path type.

5. Proposal

5.1 Project Description

In this project I will create a Unicode string adapter template that can wrap any existing or future string classes to provide encoding awareness to the strings.

The Unicode string adapter class has the following signature:

template < typename StringT, typename StringTraits = …,

typename EncodingTraits = …, typename Policy = … >

class unicode_string_adapter;

where

The template provides the following benefits against using a raw string type that is specified in the StringT template parameter:

5.2 Objective

5.2.1 Primary Objectives

In this project, the following primary objectives are aimed to be achieved during the 12 weeks project timeline:

  1. To build a working unicode_string_adapter template that provides efficient Unicode code point manipulation and conversion between different string types.
  2. To build specialized implementation for the string traits of std::basic_string<>, and make sure that the default string adapters with UTF-8, UTF-16, and UTF-32 encoding are working correctly with the std::basic_string<> variants.
  3. To build basic Unicode string processing utilities that works on unicode_string_adapter. I might consider to improve Boost String Algorithm library[11] to make sure it works correctly on Unicode string. New string utilities will also be built, such as Unicode-aware hash map class that has Unicode strings as key.
  4. To provide use cases by modifying interfaces of existing libraries through forks, such as Boost.Filesystem, to use the string adapter instead of raw strings. The result is then analysed to show the number of lines of code reduced and the potential bugs that can be prevented.
  5. To provide documentation and tutorials on using the string adapter efficiently.

5.2.2 Optional Objectives

The project will also have the following optional objectives in case it finished early or if time allowed:

  1. To provide string traits implementation for other string types outside of Boost, such QString. The string libraries in these projects will be studied.
  2. To improve the Boost.Unicode library, which powers the encoding backend for unicode_string_adapter.

5.3 Project Milestone

Present - 25 April

Design the structure of the template and classes involved. Solve the various design issues.

26 April - 9 May

Start coding read-only implementation for UTF-8 encoding in std::string.

10 May - 16 May

First draft review on the Boost mailing list. Receive feedback and have dicussion on design issues.

17 May - 23 May

Discussion ends and make changes to existing code according to feedback, possibly requiring complete rewrite.

24 May - 30 May

Second draft review. Final decision has to be made on major design issues as they can’t be changed further on.

31 May - 13 June

Make changes according to the final design decision. Start working on mutable functions and UTF-16/32 encoding.

14 June - 20 June

Third draft review with almost working code.

21 June - 27 June

Make changes according to feedback and mostly work on fixing code with bad programming practices.

28 June - 4 July

Start working on Unicode string utilities library while continue to work on existing code base.

5 July - 12 July

Fourth draft review together with the string utilities library.

13 July - 16 July

GSoC midterm break.

17 July - 25 July

Work on use cases by forking C++ libraries and make changes to the interfaces.

26 July - 8 August

Prepare documentation and the procedures to submit for official Boost review.

9 August - 16 August

Boost review starts and receive feedback for improvement to be made after the GSoC period ends.


[1]Always treat std::strings as UTF-8?, Boost Mailing List Discussion.  http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62

[2]What will string handling in C++ look like in the future, Boost Maling List Discussion.  http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda

[3][string] proposal, Boost Mailing List Discussion. http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0

[4]The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson.

http://www.oakcircle.com/toolkit.html

[5]Boost.Text, by Anders Dalvander. http://www.dalvander.com/boost_text/

[6]Boost Filesystem Library.  http://www.boost.org/doc/libs/1_41_0/libs/filesystem/doc/index.htm 

[7]QString for Qt. http://doc.qt.nokia.com/latest/qstring.html

[8]wxString. http://docs.wxwidgets.org/stable/wx_wxstring.html

[9]SGI rope. http://www.sgi.com/tech/stl/Rope.html

[10]ICU Unicode String. http://icu-project.org/apiref/icu4c/classUnicodeString.html

[11]Boost String Algorithms Library. http://www.boost.org/doc/libs/1_46_1/doc/html/string_algo.html