Go 1.2 encoding.TextMarshaler and TextUnmarshaler

Russ Cox

July 2013

Abstract

For Go 1.2, we propose to add a new package “encoding” defining interfaces recognized by multiple encoders. In the long term, the “encoding” package should be similar in scope to “io”: primarily interfaces, importing few other packages.

The initial content of the package is two interfaces, TextMarshaler and TextUnmarshaler, implemented by objects that can convert to and from a UTF-8 text representation.

Discussion on golang-dev.

Background

(A separate proposal defines Marshaler and Unmarshaler for encoding/xml. Those interfaces cannot be implemented without importing “encoding/xml”).

More generally, each time a new encoding is defined, the package defines its own Marshaler and Unmarshaler interfaces that must then be implemented by objects for which the default encoding is inappropriate. This process does not scale: time.Time already implements marshaling and unmarshaling methods for “encoding/gob” and “encoding/json”. Even if it could define methods for “encoding/xml” without the import problem, it is far from clear that we’d want to keep adding methods. We have also resisted adding marshaling and unmarshaling methods to other common types such as net.IP.


A common pair of interfaces agreed upon by many encoders provides a solution to both problems: the interfaces can be implemented without any required imports, and one pair of methods suffices to make types encodable in multiple encodings.

Proposal

Define a new package, imported as “encoding”, that defines TextMarshaler and TextUnmarshaler interfaces:

// Package encoding defines interfaces understood

// by multiple encoders.

package encoding

// A TextMarshaler is the interface implemented by

// an object that can marshal itself as UTF-8 text.

// The result must be valid UTF-8.

type TextMarshaler interface {

        MarshalText() (text []byte, error)

}

// A TextUnmarshaler is the interface implemented by

// an object that can unmarshal a UTF-8 text description

// of itself.

type TextUnmarshaler interface {

        UnmarshalText(text []byte) error

}

Then, change both “encoding/json” and “encoding/xml” to recognize these interfaces if an object does not implement the encoding-specific marshaler/unmarshaler methods. In JSON, the serialization is a string value; in XML, the serialization is either an XML attribute or an XML tag containing textual chardata, depending on the presence of the “,attr” field tag.

In package “time”, define Time.MarshalText and (*Time).UnmarshalText, using the same RFC3339 encoding currently used for Time.MarshalJSON and (*Time).UnmarshalJSON.

In package “net”, define IP.MarshalText and (*IP).UnmarshalText.

For example, if IP.MarshalText returns the usual string form of an IP address, then the value net.IP{192, 168, 0, 10} will marshal as the text “192.168.0.10”, which the JSON encoder would frame as the JSON string value "192.168.0.10" and the XML encoder would frame as the XML element <IP>192.168.0.10</IP>. (Or, if the struct field holding the IP address had the field tag `xml:”,attr”`, it would turn into an XML attribute on the outer element.)

Discussion

It is important to consider backwards compatibility of existing encodings, and having a TextMarshaler raises the question of having a BinaryMarshaler.

time.Time

Today, time.Time encodes to XML using a special case in “encoding/xml”. If we preserve the visible behavior, we can make time.Time encode using these interfaces and remove the special case from “encoding/xml”. Similarly, time.Time encodes to JSON using MarshalJSON and UnmarshalJSON methods. We cannot remove these methods during Go 1, but if we make MarshalText and UnmarshalText use the same text form currently used by JSON, we can remove the JSON-specific methods in Go 2. (We did plan ahead enough that “encoding/json” and “encoding/xml” use the same text encoding for times; otherwise these two constraints would be in conflict.)

net.IP
Today, net.IP encodes to XML as if it were bytes of text, being coerced to UTF-8 in the process. So net.IP{192,168,0,10} encodes to the Go string "\"<IP>\xc0\xa8\uFFFD&#xA;</IP>\"". The coercion of the NUL byte means that Unmarshal cannot reproduce the input. Equally bizarre, net.IP encodes to JSON as if it were binary data, which JSON turns into base64. So net.IP{192,168,0,10} encodes to the Go string "\"wKgACg==\"". In the other direction, json.Unmarshal does refuse to unmarshal into a net.IP in Go 1 and Go 1.1. In all cases, the doc comments make no promise that named []byte types encode or decode at all. Because the current behavior is undocumented and nearly useless, we should change the behavior to produce useful output. Specifically, by defining the MarshalText and UnmarshalText methods on net.IP, we will end up with <IP>192.168.0.10</IP> for XML and "192.168.0.10" for JSON.

I am not sure if there are other common types we should be discussing. An argument could be made for the various net.TCPAddr, net.UDPAddr, and so on, but the counter-argument is that there are too many of them.

BinaryMarshaler

It may be appropriate to define BinaryMarshaler and BinaryUnmarshaler, either now or later. That could be used by “encoding/gob” and other binary encodings. Because there is only one possible use today (just gob), it does not seem worth doing now.

The possibility of BinaryMarshaler is an argument for gob not to recognize TextMarshaler now. Eventually it would probably want to check for interfaces in the order gob.Marshaler, encoding.BinaryMarshaler, and encoding.TextMarshaler. It is only safe to add to the end of the preference list: BinaryMarshaler cannot be inserted in the middle if we add TextMarshaler now.

Like we did for TextMarshaler, if time.Time’s BinaryMarshaler form matches the one it uses today in its gob-specific methods, we can drop the gob-specific methods in Go 2.