Go 1.2 xml.Marshaler and Unmarshaler

Russ Cox

July 2013

Abstract

For Go 1.2, we propose to add methods that values can implement to define custom marshaling or unmarshaling for encoding/xml, analogous to the Marshaler and Unmarshaler interfaces in encoding/json, encoding/gob, and code.google.com/p/goprotobuf/proto.

See also the related proposal for encoding.TextMarshaler and encoding.TextUnmarshaler.

Discussion on golang-dev.

Background

Package encoding/xml used to have a Marshaler interface, but it had significant problems and was removed before Go 1. Among the problems, the Marshaler was responsible for choosing the outer XML tag, but it did not have enough information to make the right choice, and there was no way to do custom marshaling for attribute values (as opposed to XML elements).

We had hoped to define both Marshaler and Unmarshaler for Go 1.1, but it did not happen. We’d like to define them for Go 1.2.

XML differs from Gob, JSON, and Protocol Buffers in important ways. First, values can appear in a few different contexts: as attributes in a start-element tag (<tag attr=”value”>), as character data in a simple element (<tag>value</tag>), or as a sequence of possibly nested elements (<tag><x>value1</x><y>value2</y></tag>). Second, XML is significantly more difficult to parse and to generate by hand, both in syntax and semantics (such as knowing which name space definitions are in effect at a current point in the tree).

Because basic lexing of XML is so difficult, the encoding/xml package already provides a Decoder with methods to retrieve the stream one token at a time.

Any Marshaler and Unmarshaler interface definitions must be able to support the multiple contexts and must provide help for the task being requested. For example, during Marshal it should be possible to emit the XML as a sequence of tokens instead of being asked to return a []byte, and similarly during Unmarshal it should be possible to parse the XML as a sequence of tokens. It makes sense to provide this functionality by making an Encoder or Decoder available.

The current API includes these types and methods (and more):

package xml

type CharData []byte

type Comment []byte

type StartElement struct {

        Name Name

        Attr []Attr

}

type EndElement struct {

        Name Name

}

type Decoder

func (*Decoder) Decode(v interface{}) error

func (*Decoder) DecodeElement(v interface{}, start *StartElement) error

func (*Decoder) Token() (t Token, err error)

The Decoder’s Decode method corresponds to the top-level Unmarshal operation. DecodeElement is like Decode but passes in the opening element tag, which has already been parsed (the rest of the element is read from the decoder stream). Clients that do not want to use Decode or DecodeElement can get lower-level access to the individual tokens of the stream by calling the Token method. For our purposes, a Token is a CharData, Comment, *StartElement, or *EndElement.

type Encoder

func (*Encoder) Encode(v interface{}) error

The Encoder’s Encode method corresponds to the top-level Marshal operation. Encoder does not provide any other relevant methods.


Proposal

First, extend Encoder by adding these methods, which provide operations for writing XML similar to what the Decoder provides for reading it:

func (*Encoder) EncodeElement(v interface{}, start *StartElement) error

func (*Encoder) EncodeToken(t Token) error

That is, EncodeElement is like Encode but allows specifying the outermost start tag (and, by implication, the end tag) for the element. EncodeToken does not operate on elements but on tokens: it writes a CharData, Comment, *StartElement, or *EndElement to the generated XML. (If EncodeToken is misused, such as by writing an EndElement that does not match the most recently opened StartElement, it will return an error, and the encoder will return an error from any future operations too.)

Second, add a helper to *StartElement to produce the corresponding *EndElement:

func (*StartElement) End() *EndElement

Now, with those extensions in place, define Marshaler:

type Marshaler interface {

MarshalXML(e *Encoder, start *StartElement) error

MarshalXMLAttr(name Name) (Attr, error)

}

A value implementing Marshaler must implement both the MarshalXML method, for marshaling as a full XML element, and the MarshalXMLAttr method, for marshaling as an XML attribute. If only one context is expected, the other method can return an appropriate error.

MarshalXMLAttr is the easier method and is for marshaling a value as an attribute (because the struct field has a “,attr” tag). MarshalXMLAttr is passed the expected attribute name (an xml.Name) and returns the attribute name/value pair (an xml.Attr). It may return an xml.Attr containing a different name than was passed in, although that will mean the result of Marshal will not be inverted by Unmarshal. It may also return the zero Attr to signal that the attribute should be omitted from the output.

MarshalXML is for marshaling a value as a full XML element. It is passed the suggested start tag for the element is expected to write the marshaled element directly to the Encoder. MarshalXML might construct a suitable alternate structure and invoke EncodeElement, passing start. It might also edit the StartElement first, perhaps adding attributes or even changing the tag name. (The same caveat about Unmarshal applies.) But instead of using EncodeElement, MarshalXML might also choose to generate the XML as a stream, calling EncodeToken repeatedly with CharData, Comment, *StartElement, and *EndElement tokens.

Unmarshaler is similar:

type Unmarshaler interface {

UnmarshalXML(d *Decoder, start *StartElement) error

UnmarshalXMLAttr(attr Attr) error

}

UnmarshalXMLAttr is straightforward: it parses the string value in attr, and it may use the name as well.

UnmarshalXML unmarshals by reading a full element from the Decoder. The element has already begun with start. The Decoder will refuse reads beyond the corresponding end tag. UnmarshalXML might read the full element by using DecodeElement into a suitable alternate structure, or it can call Token to process the stream one token at a time, or it might start out reading tokens and then use DecodeElement for specific sub-pieces.

Discussion

It is important that these interfaces provide full access to the encoding and decoding state, which includes implicit state like the current name space bindings. An interface that did not provide a Encoder/Decoder would need to pass that state in some other form, and then the methods would need to be able to use the state directly or else use it to construct a fresh Encoder/Decoder. However, that comes with the significant disadvantage of being unable to implement the interfaces without importing encoding/xml.

The import requirement is a problem for low-level packages like time, even though we would like time.Time to implement xml.Marshaler and xml.Unmarshaler, just as it does for gob and json.

To support time.Time and perhaps other types, we propose an additional, more generic interface: TextMarshaler and TextUnmarshaler.