pgsql-5075

Version:

8.4.0

Bug Link:

http://postgresql.1045698.n5.nabble.com/BUG-5075-Text-Search-parser-does-not-identify-xml-tag-when-attribute-name-s-contains-underscore-td2126048.html

Symptom:

When an xml tag has a underscore attribute name, ts_debug*(see below for what’s ts_debug)  won’t accept it as xml but plain text.

ts_debug displays information about every token of document as produced by the parser and processed by the configured dictionaries.

How it is diagnosed:

Reproduced!

How to reproduce:

$ select * from ts_debug('english', '<img width="182" height="120"

align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>');

   alias   |       description        | token  |  dictionaries  |  dictionary  | lexemes  

-----------+--------------------------+--------+----------------+--------------+----------

 blank     | Space symbols            | <      | {}             |              |

 asciiword | Word, all ASCII          | img    | {english_stem} | english_stem | {img}

 blank     | Space symbols            |        | {}             |              |

 asciiword | Word, all ASCII          | width  | {english_stem} | english_stem | {width}

 blank     | Space symbols            | ="     | {}             |              |

 uint      | Unsigned integer         | 182    | {simple}       | simple       | {182}

 blank     | Space symbols            | "      | {}             |              |

 asciiword | Word, all ASCII          | height | {english_stem} | english_stem | {height}

 blank     | Space symbols            | ="     | {}             |              |

 uint      | Unsigned integer         | 120    | {simple}       | simple       | {120}

 blank     | Space symbols            | "      | {}             |              |

                                      :                                          

 asciiword | Word, all ASCII          | align  | {english_stem} | english_stem | {align}

 blank     | Space symbols            | ="     | {}             |              |

 asciiword | Word, all ASCII          | right  | {english_stem} | english_stem | {right}

 blank     | Space symbols            | "      | {}             |              |

 asciiword | Word, all ASCII          | style  | {english_stem} | english_stem | {style}

 blank     | Space symbols            | ="     | {}             |              |

 asciiword | Word, all ASCII          | margin | {english_stem} | english_stem | {margin}

 blank     | Space symbols            | :      | {}             |              |

 numword   | Word, letters and digits | 0px    | {simple}       | simple       | {0px}

 blank     | Space symbols            |        | {}             |              |

 numword   | Word, letters and digits | 0px    | {simple}       | simple       | {0px}

 blank     | Space symbols            |        | {}             |              |

 numword   | Word, letters and digits | 5px    | {simple}       | simple       | {5px}

 blank     | Space symbols            |        | {}             |              |

 numword   | Word, letters and digits | 5px    | {simple}       | simple       | {5px}

 blank     | Space symbols            | ;"     | {}             |              |

 asciiword | Word, all ASCII          | test   | {english_stem} | english_stem | {test}

 blank     | Space symbols            | _      | {}             |              |

 asciiword | Word, all ASCII          | aa     | {english_stem} | english_stem | {aa}

 blank     | Space symbols            | ="     | {}             |              |

 uint      | Unsigned integer         | 26461  | {simple}       | simple       | {26461}

 blank     | Space symbols            | "      | {}             |              |

 blank     | Space symbols            | />     | {}             |              |

(34 rows)

-- If we remove the underscore of the attribute named test_aa, the result is correct

$ select * from ts_debug('english', '<img width="182" height="120"

align="right" style="margin: 0px 0px 5px 5px;" testaa="26461"/>');

 alias | description |                              token                              | dictionaries | dictionary | lexemes

-------+-------------+-----------------------------------------------------------------+--------------+------------+---------

 tag   | XML tag     | <img width="182" height="120"                                   | {}           |            |

                     : align="right" style="margin: 0px 0px 5px 5px;" testaa="26461"/>                              

(1 row)

Root Cause:

TParserGet is in charge of parsing the text, different pattern or state will be matched to the text, when it tries to recognize it as an xml tag, i.e the parsing state is in TPS_InTag, when it a encounters underscore symbol, no match item can be found in the tag valid symbol table, so parser thinks it violates the xml grammar and does not try to parse it as xml tag, further in the parsing it’s just recognized as plain text and got split according to english words and delimiter.

The patch is to simply add the “_” to the in tag state item table.

backend/tsearch/wparser_def.c

static const TParserStateActionItem actionTPS_InTag[] = {

        {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},

        {p_iseqC, '>', A_NEXT, TPS_InTagEnd, 0, SpecialTags},

        {p_iseqC, '\'', A_NEXT, TPS_InTagEscapeK, 0, NULL},

        {p_iseqC, '"', A_NEXT, TPS_InTagEscapeKK, 0, NULL},

        {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},

        {p_isdigit, 0, A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '=', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '-', A_NEXT, TPS_Null, 0, NULL},

    +         {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '#', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '/', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, ':', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '.', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '&', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '?', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '%', A_NEXT, TPS_Null, 0, NULL},

        {p_iseqC, '~', A_NEXT, TPS_Null, 0, NULL},

        {p_isspace, 0, A_NEXT, TPS_Null, 0, SpecialTags},

        {NULL, 0, A_POP, TPS_Null, 0, NULL}

};

static bool TParserGet(TParser *prs) {

   const TParserStateActionItem *item = NULL;

        …

/* prs is a TParser type, which is a structure containing the string to be parsed and

its position information:

                     typedef struct TParser

                     {

                       /* string and position information */

      char       *str;            /* multibyte string */

       int         lenstr;         /* length of mbstring */

       TParserPosition *state; /* state->posbyte contains the position*/

                       … …

   } TParser;

         */

while (prs->state->posbyte <= prs->lenstr) {

                /* Here, this while loop is to parse the string character by char.

So the current character to be parsed is (prs->str)[prs->state->posbyte] */

 

                item = Actions[prs->state->state].action;

                // item points to the base of actionTPS_InTag...

...

                /* find action by character class */

                /* here it uses the while loop below to iterate through the

    actionTPS_InTag vector.

    When the current character is “_”,

                    it cannot find a matching item, will use the last sentinel         

                    item {NULL, 0, A_POP, TPS_Null, 0, NULL} (item->iscalss == NULL), the action                 

                     taken is just A_POP, it will then clear the TPS_InTag         

                     state, it’s not recognized as xml tag.

    when the patch add the underscore item and action taken      

    is A_NEXT will tell it to proceed parsing as xml tag, it will  

     eventually succeed to find a match on ‘_’.

                */

                while (item->isclass)

                {

                        prs->c = item->c;

                        if (item->isclass(prs) != 0)

                        /* item->isclass is the function such as ‘p_isEOF’,

    ‘p_iseqC’, as defined in each member of

     actionTPS_InTag above. This function essentially tests whether

     prs->c == (prs->str)[prs->state->posbyte] */

                                break;

                        item++;

                }

                ...

                /* do various actions by flags */

                if (item->flags & A_POP)

                {        /* pop stored state in stack */

                        /*underscore  will fall into the default sentinel item,

   which is an action of A_POP, thus the state of InTag         

   is destoyed.

                        */

                        TParserPosition *ptr = prs->state->prev;

                        pfree(prs->state);

                        prs->state = ptr;

                        Assert(prs->state);

                }

                else if (item->flags & A_PUSH)

                {        /* push (store) state in stack */

                        prs->state->pushedAtAction = item;        /* remember where we push */

                        prs->state = newTParserPosition(prs->state);

                }

                else if (item->flags & A_CLEAR)

                {        

                        … ...

                }

                

                … ...        

        }

        return (item && (item->flags & A_BINGO)) ? true : false;

}

So essentially, this function uses a while loop (highlighted above) to iterate over all the valid items in the table. This implies if we put a log message at the ‘break’ point of a loop, might be good idea.

Failure type:

Wrong result

Is there any log message?

No

Can ErrLog inserts a log message?

Yes. Search pattern.

foreach (item within a list) {

   if (A == B)

      break; <-- record at this break.

}