Do you have an idea or suggestion based on your experience with Azure Database for PostgreSQL?

Need test_parser extension

We are using tsvector for full text search.
And we want to use test_parser extension.

Although test_parser's doc says "It doesn't do anything especially useful".
It is actually very useful when using custom tokenizer.

test_parser is available on Amazon RDS

22 votes
Sign in
Sign in with: Microsoft
Signed in as (Sign out)

We’ll send you updates on this idea

Tomohisa Ota shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →


Sign in
Sign in with: Microsoft
Signed in as (Sign out)
  • Saloni Sonpal commented  ·   ·  Flag as inappropriate

    Thanks for your feedback @Tomohisa and also for sharing the details @Ken. The need is clear, even though the community has moved this module out of contrib package after version 9.4. We at Azure Database for PostgreSQL are considering your request and have marked the status of this item as 'Under Review'. Will keep you posted on the status updates.

  • Ken Ueda commented  ·   ·  Flag as inappropriate

    In reality, test_parser extension is critical to those who want to run full text search in non-space-separated languages covering Asian countries (1.6B in population, or 21% of WW population, just by adding China, Japan and Korea). Managed PostgreSQL service is supposed to provide the developers efficiencies, but lacking this test_parser extension forces us to develop our own parser, or result to another cloud solution (obviously developers do not want to build own parser when it’s available on AWS). I should also add that full text search will become even more critical in the coming years for AI solutions focusing natural language and any other big data solutions.

    Let me elaborate a little more on what the problem is. When tokenizing strings in Japanese, some fullwidth characters get thrown away if you use the simple parser (Japanese language parser is not available by default). For example, even though I want to treat “ジョン・スミス” as one word, if I run “SELECT to_tsvector(‘simple’, ‘ジョン・スミス’)” the result will be split into two words.

    As a meta example to help you (assuming an English reader) understand the situation, imagine how painful the world will be if the system only gives you “Micro” and “soft” when you mean is “Microsoft”. Then, “Microsoft Azure Rocks!” becomes “Micro soft Azure Rocks!”, and now I cannot tell whether Azure rocks or it’s a weak, shaky service. As a developer “Microsoft” needs to be treated as “Microsoft”, and not “Micro” and “soft”.

    I don’t know whether AWS offers this out-of-the-box test_parser extension support upon understanding this much reality, but I would really want to see Azure also has this extension so that we can rock with Azure together.

Feedback and Knowledge Base