A Google Summer of Code 2011 Project Proposal for the Boost C++ Libraries
The Unicode String Adapter is a template class that wraps around any string types to provide encoding semantics to the strings. The adapter ensures encoding correctness through type safety when strings are passed between libraries, and enables transparent conversion between different string types with different encodings that are wrapped by the adapter.
Name: Chen Ruo Fei
University: National University of Singapore
Course: Computer Engineering
Degree: B. Sc
The school will be having semester break soon until August, so I will not be busy to study during this period. I also won’t be travelling oversea nor will I take any part time job. My availability might be slightly affected starting mid of July as I will need to prepare myself for school starts and also to prepare for final year project.
I have already begin discussions with my prospective mentor, Chad Nelson. I will begin my coding once the application gets accepted on 25 April. For the time being, I am mainly focusing on the design of the library as there are many aspects to consider about.
Even though I chose Computer Engineering as my major when I entered university, it was only until recently that I found that my true interest is in Computer Science. Currently I plan to apply to change my major, even though I will be in my fourth and final year and it is quite late to do so. I attended several CS courses as extra modules, while there are a number of other CS modules are common prerequisite for both majors. I joined the Special Programme in Computing during my second year and were active in the special interest group for programming language.
I started learning programming by myself since 2003, where I learned PHP and Perl for general web programming. I first learned about object oriented programming when I learnt Java in 2006. After that, I learned many other programming languages and CS topics throughout my university life. I started using Python since a year ago to develop web applications using Django, and it was only until half year ago when I really dived into C++.
I am particularly interested in server-side web programming and designing new programming languages. I spent most of my time to figure out the design patterns and better ways to build highly scalable and highly modular websites. My interest is to find new ways of developing web applications, as I am not satisfied with the way web applications are written today. Currently I am working on a personal project to build a web server in C++ with CSP-like concurrency model similar to Kamaelia.
The reason that I choose to contribute to Boost is partly due to my recent involvement in C++. I chose C++ over C for my project because C++ allows me to use object oriented techniques while at the same time is low level enough to easily manipulate raw memory objects. With my interest in programming language, I am more interested to improve C++ to benefit others, rather than building applications myself. Boost in many ways have significant influence to the development of C++, as we can see in TR1 and C++0x. Because of that, I hope that in future my contribution can also be adopted as part of the C++ standard, by first making useful libraries in Boost.
I am interested to introduce Unicode string types into C++ because until now, it is surprisingly hard to represent a Unicode string. However, Unicode has become increasingly common today and that encoding mismatch has caused numerous bugs simply because of wrong assumptions. As a C++ developer myself who also build other libraries, I feel the needs for such Unicode string types.
After the GSoC project ends, hopefully I can continue to work on this project after receiving good feedback from the Boost community. As this topic is quite controversial and due to my lack of experience, there might be a need to rebuild the library to suit practical uses. However I am willing to continue to work on it no matter what is the results in GSoC, if I can receive support from the community.
Following is my self assessment for my C++ skills, with 0 being no experience, 5 being expert:
C++ Standard Library
Boost C++ Libraries
I use vim as my C++ IDE, and I have basic knowledge on using Doxygen.
The suggestion to create a Unicode string type was first pointed out in the Boost mailing list on January 2011, which then received intense discussions that spanned accross several threads. For a brief overview, the heated debates was mainly around different ways to ensure consistency between the encoding expected when library code accepts strings, and the actual encoding of strings passed by users. The problem arise because a small minority of developers use std::string in different encoding than UTF-8, and the implicit assumption of UTF-8 encoding for std::string brings inconsistency and causes numerous bugs that are outside the scope of the Boost library.
From the discussion, several proposals have been made to solve the inconsistencies of std::string encoding:
Unfortunately, the discussions ended without meaningful conclusion because there are several groups of people who have strong opinion on different ways to solve the problem and could not generally agree with each other. The community questioned the need for yet another string class as there are many other Unicode string classes available but failed to be adopted as the standard string class. With std::string already being the sole standard that is widely used, it is believed that a new string class that is designed to replace std::string will probably not succeed.
Even though the subject is controversial enough to halt the discussion, the length of discussion and the active participation shows the importance of this topic and the need to solve this problem. The motivation of this project is to identify the main points in the discussions and find a general solutions for it.
A careful observation shows that there is a flaw in the arguments of the previous discussions. It is found that most of the discussions are actually about the feasibility of creating new string classes rather than on the encoding awareness itself. It was believed that the only way to create encoding awareness is to create a new string class which happens to contain encoding information, and that new string class is fundamentally incompatible with the existing string classes.
However, string and encoding are actually two different concepts that deserves separate abstraction. It is found that the reason people oppose to encoding-aware string is because a string is supposed to be a dumb container that carries raw bytes and do not care about the meaning of those bytes. On the other hand, encoding actually works on one layer above the string to make sure that the raw bytes have consistent meanings.
As a result, the Unicode string problem can be solved by simply introducing a string adapter class that wraps around existing string classes. The string adapter class uses the decorator design pattern to decorate and bring encoding awareness into existing string classes instead of replacing them, thus complementing each others. While the string class focus on manipulation of raw bytes such as string creation and concatenation, the string adapter serves as interface specification for library writers to make sure that the provided encoding of strings are consistent with the intended encoding of strings.
4.2.1 Oak Circle Unicode Toolkit
Chad Nelson’s Oak Circle Unicode classes have the following signature:
class utf8_t : public specialized_string_t< utf8_t, std::basic_string<char> >
class utf16_t : public specialized_string_t< utf16_t,
class utf32_t : public specialized_string_t< utf32_t,
where char16_t and char32_t are custom typedef to 16-bit and 32-bit characters if not in C++0x.
Notice that the classes are all derived from a template called specialized_string_t that has generic interface that access to the underlying string. This makes it possible to add Unicode encoding semantics to any string class that only handle raw bytes by creating new template instances following the pattern specialized_string_t<ClassName, RawStringContainerClass>.
Similarly, Anders Dalvander’s Boost.Text has a basic_text template with encoding as the template parameter:
template <typename encoding>
The class basic_text actually uses std::basic_string<typename encoding_type::codeunit_type> as it’s underlying container. It is possible to further generalize the class by adding the string type into the template parameter to allow wrapping of different string types.
4.2.3 boost::string (Boost.Chain)
This pattern actually somewhat similar to the view<> concept mentioned by Dean Michael Berris in the boost::string discussion. Dean's view concept has the signature of class view<Encoding> and wraps the proposed boost::string as it's underlying container. Notice that the view template can actually be generalized to wrap other strings, such as std::string, by adding one template parameter to make it class view<Encoding, StringT>. In the boost::string discussion, it is also generally agreeable that a string class should really be just a dumb container that store raw bytes and do not care about the meaning of those bytes. This is also why even the new proposed boost::string class (now called Boost.Chain) also do not attempt to add Unicode semantics into it. Instead, the view<> class is used at one level higher than boost::string to add encoding semantics to the raw string container.
This pattern can also be seen applied in Boost.Filesystem, where it use a special class to represent the path, rather than the raw std::basic_string<> variants. The path class has the following signature:
template <class StringT, class PathTraits>
where StringT is the type for the internal raw string container, and PathTraits contains two conversion functions that know how to convert one type of external (incoming) strings into the type of it's underlying string container. This allows Boost.Filesystem's developers to choose a consistent internal string format, such as the 16-bit wchar_t, while still able to compare it against other string format, such as the 8-bit char.
There is however one inefficiency in the basic_path design, which is that the path traits is restricted to only able to convert between two string types, instead of arbitrary external string type to one internal string type. This means that for example, if the developer chose the path traits to convert between 8-bit and 16-bit character strings, then it is not possible to also convert a 32-bit character string into that path type.
In this project I will create a Unicode string adapter template that can wrap any existing or future string classes to provide encoding awareness to the strings.
The Unicode string adapter class has the following signature:
template < typename StringT, typename StringTraits = …,
typename EncodingTraits = …, typename Policy = … >
The template provides the following benefits against using a raw string type that is specified in the StringT template parameter:
5.2.1 Primary Objectives
In this project, the following primary objectives are aimed to be achieved during the 12 weeks project timeline:
5.2.2 Optional Objectives
The project will also have the following optional objectives in case it finished early or if time allowed:
Present - 25 April
Design the structure of the template and classes involved. Solve the various design issues.
26 April - 9 May
Start coding read-only implementation for UTF-8 encoding in std::string.
10 May - 16 May
First draft review on the Boost mailing list. Receive feedback and have dicussion on design issues.
17 May - 23 May
Discussion ends and make changes to existing code according to feedback, possibly requiring complete rewrite.
24 May - 30 May
Second draft review. Final decision has to be made on major design issues as they can’t be changed further on.
31 May - 13 June
Make changes according to the final design decision. Start working on mutable functions and UTF-16/32 encoding.
14 June - 20 June
Third draft review with almost working code.
21 June - 27 June
Make changes according to feedback and mostly work on fixing code with bad programming practices.
28 June - 4 July
Start working on Unicode string utilities library while continue to work on existing code base.
5 July - 12 July
Fourth draft review together with the string utilities library.
13 July - 16 July
GSoC midterm break.
17 July - 25 July
Work on use cases by forking C++ libraries and make changes to the interfaces.
26 July - 8 August
Prepare documentation and the procedures to submit for official Boost review.
9 August - 16 August
Boost review starts and receive feedback for improvement to be made after the GSoC period ends.
Always treat std::strings as UTF-8?, Boost Mailing List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62
What will string handling in C++ look like in the future, Boost Maling List Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda
[string] proposal, Boost Mailing List Discussion. http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0
Boost Filesystem Library. http://www.boost.org/doc/libs/1_41_0/libs/filesystem/doc/index.htm
ICU Unicode String. http://icu-project.org/apiref/icu4c/classUnicodeString.html
Boost String Algorithms Library. http://www.boost.org/doc/libs/1_46_1/doc/html/string_algo.html