XWiki Solr component

Solr Search on XWiki in Action.

Solr component indexes all the pages inside Sandbox. To keep it simple, I have indexed pages in english, french and spanish.

I have used the below setup to explain the working of solr component.

Pages:

Test page 1 (Default with XE in Sandbox )

Couple of test strings added to body.

Test page 2 (Default with XE in Sandbox )
Test page 3 (Default with XE in Sandbox )
Test page 4

Test page in English
Test page in French
Test page in Spanish

Few more pages with “test” in the body but not in page title. Other random pages with l'arbre (the tree) to show french text is being parsed well. ( http://jira.xwiki.org/browse/XWIKI-6226 )

Fields:

Title : [ title_en , title_fr, title_es ] - Using Solr Dynamic fields *_en, *_fr, *_es
Full Text : [ ft_en, ft_fr, ft_es ]

1. The apostrophe ( ' ) is now considered as a separator (http://jira.xwiki.org/browse/XWIKI-6226

Searching for arbre and l’arbre returns the same set of results because l’ is treated as the stop word.

Searching with l’arbre

2. Customizing the relevancy score using boost index.

qf=title_en^1.0 ft_en^1.0 - Equals weights.

Test Page1 - 1.36 -> Has ‘Test’ in title_end and ft_en.

Test Page4- 1.33 -> Has ‘Test’ in title_end only

Test Page2 - 1.33 -> Has ‘Test’ in title_end only

Test Page3 - 1.33 -> Has ‘Test’ in title_end only

Tree of gods - 1.09 -> Has ‘Test’ in ft_en, very small document.

WebHome - 0.32 -> Has ‘Test’ in ft_en, this is a large document. hence according to tf*idf, the term frequency is normalized and comes out to be a small value. Therefore a lower score all together.

score by EDISMAX handler in brackets.

qf=title_en^1.0 ft_en^4.0 - making the full text filed more relevant.

Tree of gods - 1.09 -> Has ‘Test’ in ft_en, very small document. Score reduced for title_en fields, as relative weight reduced.

qf=title_en^3.0 title_fr^1.2 ft_en^1.0 ft_fr^1.2

English field has more weight.

Note: In the 3rd and 4th screen shot, the title is in english for both french and english pages but the content is in the respective languages.