pgsql-4987

Version:

8.4.0/8.3

Bug Link:

http://postgresql.1045698.n5.nabble.com/BUG-4987-registered-in-pg-type-by-a-wrong-table-name-td2130801.html

Symptom:

Encoding problem. When creating a table with long table name of exactly 63 bytes and last charcter is a multi-byte character, client will encounter an encoding error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xe381

when client is examining the table name.

The table name is stored as pg_type, which is an internal data structure (A composite type that is automatically created for each table in the database, to represent the row structure of the table) with a wrong table name. And then, if the client would like to query the value of this pg_type (postgres allows client to inspect this internal variable value), and when client and database encoding are not the same, there will be an encoding error. The correct behavior should be ok for client to see the pg_type even if encoding are different.

How it is diagnosed:

We reproduced the failure.

How to reproduce:

/************************************************

1. long table name's (exactly 63 bytes) last charcter is NOT muli-byte

charcter

*************************************************/

postgres=# select typname from pg_type where typname like '%abcdefghi%'; 

typname

---------

(0 row)

Right now we haven’t created the table yet!.

postgres=# create table abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_123(test int);

CREATE TABLE

postgres=# select typname from pg_type where typname like '%abcdefghi%';

typname

-----------------------------------------------------------------

 abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_123

 _abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_12 

(2 row)

This is correct result!

postgres=#

===============================

From the result above, we can see when create table,2 record registered in pg_type(table type and enum type?).

if table names is 63byte, enum typname names is named by the following rules:

 - "_" is applied first.

 - The last 1 byte is removed.

/********************************************************

*  2. long table name's (exactly 63 byte) last charcter is muli-byte charcter

*******************************************************/

postgres=# create table

abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_あ(test int); 

//(last charcter is JAPANESE HIRAGANA. UTF8 code is "0xE38182")

CREATE TABLE

postgres=# \encoding

UTF8

postgres=# select typname from pg_type where typname like '%abcdefghi%'; 

typname

----------------------------------------------------------------

 abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_縺

 _abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_abcdefghi_

(2 row)

This is still correct, since the encoding of client is still the same as server’s

postgres=# \encoding SJIS  (Now we set a different encoding!)

postgres=# \encoding

SJIS

postgres=# select typname from pg_type where typname like '%abcdefghi%'; 

ERROR:  invalid byte sequence for encoding "UTF8": 0xe381

HINT:  This error can also happen if the byte sequence does not match the

encoding expected by the server, which is controlled by "client_encoding". 

===============================

Root Cause:

In making array type name, there’s an identifier truncation which ensures the name is within 64 bytes (including \0). But it uses strlcpy which is a customized strncpy that always terminate the string with NULL. But strlcpy’s truncation is encoding-unaware, just byte level. This will result into an invalid UTF-8 character if the last character is a multi-bytes utf-8 character -- after the strlcpy call, the last byte of the 3-byte character would be remove, leaving the first 2 bytes still remaining, forming an invalid character according to utf8 standard.

src/backend/catalog/pg_type.c

/*

 * makeArrayTypeName

 *          - given a base type name, make an array type name for it

 *

 * the caller is responsible for pfreeing the result

 */

char *

makeArrayTypeName(const char *typeName, Oid typeNamespace)

{

        char           *arr;

        int                        i;

        Relation        pg_type_desc;

        /*

         * The idea is to prepend underscores as needed until we make a name that

         * doesn't collide with anything...

         */

        arr = palloc(NAMEDATALEN); // NAMEDATALEN is 64
+         int namelen = strlen(typeName);
//namelen is 63

        pg_type_desc = heap_open(TypeRelationId, AccessShareLock);

        for (i = 1; i < NAMEDATALEN - 1; i++)

        {

                arr[i - 1] = '_';

-                strlcpy(arr + i, typeName, NAMEDATALEN - i); //Always NUL terminates! NAMEDATALEN == 64, i == 1, so the length to copy is 63. However, this ‘strlcpy’ function will set the last character to NULL no matter what, so in this case, after the copy, arr[63] will be set to NULL. This results in the following truncate_identifier will do nothing.

-                truncate_identifier(arr, strlen(arr), false);//strlen(arr) is 63, truncate won’t do anything!

+                if (i + namelen < NAMEDATALEN)
+                         strcpy(arr + i, typeName);
+                else
+                 {
+                        memcpy(arr + i, typeName, NAMEDATALEN - i);
// don’t need to be null terminated.
+                        truncate_identifier(arr, NAMEDATALEN, false); // truncate_identifier would do utf-8-aware string truncation!
+                }

                ….

        }

        …

        return arr;

}

Where the error message is printed?

It is printed when postgres is trying to perform string conversion and determines the character is not utf-8 conforming.

/*

 * Verify mbstr to make sure that it is validly encoded in the specified

 * encoding.

 */

int pg_verify_mbstr_len(int encoding, const char *mbstr, int len, bool noError)

{

     … ...

                l = (*mbverify) ((const unsigned char *) mbstr, len);

                if (l < 0)

                {

                        ...

                    // The Error occurs here, that when doing the encoding conversion, the truncated character no longer conforms utf8.

                        report_invalid_encoding(encoding, mbstr, len);

                }

                ...

        }

}

void

report_invalid_encoding(int encoding, const char *mbstr, int len)

{

        ...

        ereport(ERROR,

                        (errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),

                         errmsg("invalid byte sequence for encoding \"%s\": 0x%s",

                                        pg_enc2name_tbl[encoding].name,

                                        buf),

                  errhint("This error can also happen if the byte sequence does not "

                        "match the encoding expected by the server, which is controlled "

                                  "by \"client_encoding\".")));

}

Is there any log message?

Yes.

Can ErrLog insert an error message?

Yes. Postgres is checking for valid input, and we can learn this rule by learning the frequent logging pattern:

Everytime when the return value of: mbverify is less than 0, we should report error!