Lucene Domain Index

This project was originally sponsored by Lending Club, an online social lending network where people can borrow and lend money among themselves based upon their affinities and/or social connections.
The project is under Apache V2 License:
Copyright 2004 The Apache Software Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
  1. Lucene Domain Index
  2. 1. Introduction
    1. 1.1 What is Lucene
    2. 1.2 What is Lucene Domain Index
    3. 1.3 Why do I use Lucene Domain Index?
  3. 2. Install
    1. 2.1 Requirements
    2. 2.2 Install binary distributions
      1. 2.2.1 11g Binary Distribution
      2. 2.2.2 10g Binary Distribution
    3. 2.3 Install Instructions to compile from sources
      1. 2.3.1 Generating Maven's artifacts
    4. 2.4 Optimizations
      1. 2.4.1 Using NCOMP on 10g
      2. 2.4.2 Using JIT on 11g
    5. 2.5 Testing Lucene Domain Index
      1. Required grants for regular Oracle users
    6. 3. Examples
      1. IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user rather than LUCENE:
      2. 3.1 Create a Lucene Domain Index
      3. 3.1.1 Single column index
      4. IMPORTANT: Lucene Domain Index name can not be larger than 21 characters.
      5. 3.1.2 Multiple columns
      6. 3.1.3 Multiple tables
      7. 3.1.4 Padding and formatting
      8. 3.1.5 Functional columns
      9. 3.1.6 Create OnLine index
      10. 3.1.7 Populate Index
      11. 3.2 Alter
      12. 3.2 Rebuild
      13. 3.2.1 Manual
      14. 3.2.2 On Line
      15. 3.3 Drop
      16. 3.4 Querying
      17. 3.4.1 Simple columns
      18. 3.4.2 Multiple columns
      19. 3.4.3 Pagination
      20. 3.4.4 Sort
      21. 3.4.5 Count Hits Function
      22. 3.4.6 First Rows Optimizer Hint
      23. 3.4.6 Highlighting
      24. 3.4.7 Highlighting using pipeline table functions
      25. 3.4.8 More like this functionality
      26. 3.4.9 Facets
    7. 3.5 Synchronize
    8. 3.6 Optimize
    9. 3.6 XMLDB Export
    10. 3.7 Exporting/Importing functional index with exp/imp Oracle tools
    11. 4. Locking and Performance
      1. 4.1 Lock used by Lucene Domain Index
      2. 4.2 Performance tips
      3. 4.2.1 Index Writer parameters
      4. 4.2.2 Auto Tune Memory functionality
      5. 4.2.3 Keep Index on RAM
      6. 4.2.4 Compare your execution plan
  4. 5 Know caveats
  5. Appendixes
    1. A. Parameter reference and syntax
      1. A.1 Lucene Index Writer parameters
      2. A.1.1 MergeFactor
      3. A.1.2 MaxBufferedDocs
      4. A.1.3  MaxMergeDocs
      5. A.1.4 MaxBufferedDeleteTerms
      6. A.1.5 UseCompoundFile
      7. A.2 Analyzer parameters
      8. A.2.1 Analyzer
      9. A.2.2 Stemmer
      10. A.2.3 PerFieldAnalyzer
      11. A.3 User Data Store parameters
      12. A.3.1 ExtraCols
      13. A.3.2 ExtraTabs
      14. A.3.3 WhereCondition
      15. A.3.4 UserDataStore
      16. A.3.2 FormatCols
      17. A.4 General parameters
      18. A.4.1 SyncMode
      19. A.4.2 AutoTuneMemory
      20. A.4.3 LobStorageParameters
      21. A.4.4 LogLevel
      22. A.4.5 CachedRowIdSize
      23. A.5 Query parameters
      24. A.5.1 DefaultColumn
      25. A.5.2 DefaultOperator
      26. A.5.3 NormalizeScore
      27. A.5.4 PreserveDocIdOrder
      28. A.6 Highlight parameters
      29. A.6.1 Formatter
      30. A.6.2 MaxNumFragmentsRequired
      31. A.6.3 FragmentSize
      32. A.6.4 FragmentSeparator
    2. B Lucene Domain Index Storage
    3. C JUnit test suites explained
      1. C.1 DBTestCase base class
      2. C.2 TestDBIndex
      3. C.3 TestDBIndexAddDoc
      4. C.4 TestDBIndexDelDoc
      5. C.5 TestDBIndexParallel
      6. C.6 TestDBIndexSearchDoc
      7. C.7 TestQueryHits
    4. D Functions, operators and utilities
      1. D.1 lcontains ancillary operator
      2. D.2 lscore ancillary operator
      3. D.3 lhighlight ancillary operator
      4. D.4 phighlight pipeline table function
      5. D.5 rhighlight pipeline table function
      6. D.6 MoreLike.this function
      7. D.7 lfacets aggregate function
    5. E Project Change Log
      1. 2.9.0.1.0 Initial release based on Lucene 2.9.0 core base
      2. 2.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09)
      3. 2.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09)
      4. 2.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09)
      5. 2.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)
      6. 2.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08)
      7. 2.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08)
      8. 2.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07)
      9. 2.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07)
      10. 2.2.0.1.1 (second release, 27/Sep/07 05:39 AM)
      11. 2.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44 AM)
      12. 2.0.0.1.3 (third release, 09/Jan/07 11:40 AM)
      13. 2.0.0.1.2 (second release, 20/Dec/06 02:03 PM)
      14. 2.0.0.1.1 (first release, 28/Nov/06 01:04 PM)
      15. 2.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM)

1. Introduction

1.1 What is Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Apache Lucene is an open source project available for free download.
If Lucene is a pure Java framework why not use it inside Oracle Database JVM environment?

1.2 What is Lucene Domain Index

Lucene Domain Index is full integration of Lucene project running inside the Oracle database using Oracle JVM.

Oracle provides a full featured JVM inside your Oracle Database compliant with JDK 1.4 in 10g release and 1.5 in 11g.

OJVMDirectory is a replacement for Lucene's file system storage by a BLOB based storage, the name is related to the class which overrides (Directory.java), here a simple list of points take into account to choose this storage:
  • Using traditional File System for storing the inverted index is not a good option for some users, you don't have commit or rollback behavior, backup, etc.
  • Using BLOB for storing the inverted index running Lucene outside the Oracle database has a bad performance because there are a lot of network round trips and data marshaling.
  • Indexing relational data stores such as tables with VARCHAR2, CLOB or XMLType with Lucene running outside the database has the same problem as the previous point.
  • The JVM included inside the Oracle database can scale up to 10.000+ concurrent sessions without memory leaks or deadlock and all the operations on tables are in the same memory space!!
More on this, Oracle provides a Data Cartridge API (ODCI), also called Extensible Indexing mechanism because you can write your own Domain Index and integrate it with the Oracle Engine and optimizer.
There are some important points integrating Lucene by using ODCI:
  • Changes on rows are automatically notified to Lucene, now these changes are en-queued using Oracle AQ. User can control if these changes are applied OnLine or Deferred.
  • Oracle optimizer can choose a proper execution plan if there is a Domain Index created.
  • You can mix lcontains(),lhighlight() and lscore() operators in your queries.

1.3 Why do I use Lucene Domain Index?

Oracle include a full feature Enterprise Engine named Oracle Text made in C and fully integrated to the Oracle Engine, but:

on Oracle Text you can not:

  • control which functionality will be included into next release.
  • easily customize it for your needs.
  • index Index Organized Tables (IOT)
  • index joined tables
  • index unlimited extra columns
  • easily highlight text
  • index NCLOB and NVARCHAR data types
on 10g you can not:
  • index multiple columns in a same index
  • sort and filter by using indexed columns at index level
on 11g you can not:
  • filter by / sort by on columns of timestamp with TZ, commonly
    used in XDB because is the official data type for xsd:date type
using Lucene inside Oracle:
  • usually indexes are small because Lucene Domain Index do not store any column, except the rowid, inside Lucene inverted index structure. Using a rowid Oracle can lookup any column value faster than retrieve it from Lucene inverted index.
  • Support padding for Text columns
  • Support formatting (rounding/padding) for Number and Date/Time columns
  • You can create index on-line even in a standard edition databases (feature available en EE for Text)
  • Extending DefaultUserDataStore class an application can implement any data type mapping, specially BLOB which in common cases have non standard encoding
  • An experimental native REST WS can be used to query the index
  • Lucene inverted index is transactional, if a SQL operation is rolled back, the index will be consistent too, avoiding phantom reads or negative hits (rows which should be included as hit but was not included in Lucene index)
  • is a ready to use uptodate solution for any programming language, for example Ruby, .Net, Phyton or PHP.
  • an elegant solution for highlighting text use pipeline table functions.
  • a high level abstraction layer for Lucene IR library, developers only deal with SQL
  • transparent compression and encryption of Lucene storage if you enable Oracle Transparent Data Encryption and Secure File compression

2. Install

2.1 Requirements

  • JDeveloper 11g only if you want to edit Java code.
  • Ant 1.7.0
  • Sun JDK 1.5.0_05/1.4.2 ($ORACLE_HOME/jdk directory works fine as Java Home for compiling on 10g and 11g)
  • Linux/Windows Database Oracle 10g 10.2/11g production

2.2 Install binary distributions

Binary distributions are available at SourceForge.net and provides a very straightforward installation.

2.2.1 11g Binary Distribution

Edit your ~build.properties file with your Database values:
db.str=test
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4


db.str is your SQLNet connect string for your target database, check first with tnsping
This is an example environment setting before installing on 11g database
MAVEN_HOME=/usr/local/maven
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=$ORACLE_BASE/product/11.1.0.6.0/db_1
ORACLE_SID=test
JAVA_HOME=$ORACLE_HOME/jdk
PATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/bin:$PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/lib
CVS_RSH=ssh
umask 022
export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOME CVS_RSH NLS_LANG
Upload, install and test your code into the database
# ant install-ojvm
# ant test-domain-index


For Oracle 11g you can perform a post-installation step:
# ant jit-lucene-classes
This target force to translate all Lucene, Snowball and OJVMDirectory classes to assembler.
Instead of waiting that the database compile it by detecting most used classes or method.

2.2.2 10g Binary Distribution

First edit your ~/build.properties with something like this:
db.str=orcl
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install

javac.debug=true
javac.source=1.4
javac.target=1.4


db.str property is a SQLNet connect string for the target database.
ORACLE_HOME environment setting is required and properly configured to an Oracle 10g database layout, finally execute ant without arguments.
Here an example of environment setting on 10g database
MAVEN_HOME=/usr/local/maven
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=$ORACLE_BASE/product/10.2.0/db_1
ORACLE_SID=orcl
JAVA_HOME=$ORACLE_HOME/jdk
PATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/bin:$PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/lib
CVS_RSH=ssh
umask 022
export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOME CVS_RSH NLS_LANG
If you are re-installing Oracle Lucene OJVM integration first drop any Lucene Domain Index not installed at Lucene's schema.
Default target will drop first Lucene schema if exists, additionaly (Recommended for production system) you can run "ant ncomp-ojvm" which translates all Lucene classes to C using JAccelerator, for example:
# ant ncomp-ojvm
# ant test-domain-index

2.3 Install Instructions to compile from sources

- Unpack or checkout Lucene sources.
- Checkout OJVM sources, by now only Anonymous CVS access is provided you can download from Source Forge servers with:

cd /tmp

cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login

cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm

- Copy to $LUCENE_ROOT/contrib

# cd $LUCENE_ROOT/contrib

# cp -rp /tmp/ojvm .

- Edit $LUCENE_ROOT/common-build.xml adding a target for creating a jar file with test sources.

<target name="jar-test" depends="compile-test">

    <jar destfile="${build.dir}/${final.name}-test.jar" basedir="${build.dir}/classes/test" excludes="**/*.java"/>

</target>

- Also edit above file at the target name test adding db.usr, db.pwd and db.str properties as System properties to be available for Lucene Domain Index JUnit suites.
  <target name="test" depends="compile-test" description="Runs unit tests">
    <fail unless="junit.present">
      ##################################################################
      JUnit not found.
      Please make sure junit.jar is in ANT_HOME/lib, or made available
      to Ant using other mechanisms like -lib or CLASSPATH.
      ##################################################################
    </fail>
............
      <!-- contrib/ojvm uses these system properties to connect to the target database  -->
      <sysproperty key="db.str" value="${db.str}"/>
      <sysproperty key="db.usr" value="${db.usr}"/>
      <sysproperty key="db.pwd" value="${db.pwd}"/>
............
    <delete file="${build.dir}/test/junitfailed.flag" />
   
  </target>

- (OPTIONAL) Update Lucene's BufferedIndexInput.BUFFER_SIZE according to your db_block_size init.ora parameter.
Before compile and upload Lucene core library you can change org.apache.lucene.store.BufferedIndexInput.BUFFER_SIZE constant to the value of your db_block_size init parameter, this change will improve reading performance by using same block size as the physical block size that your database use.
- Compile OJVM Directory sources and tests, these targets automatically copies all Lucene Domain Index required libraries from your $ORACLE_HOME and Internet. Starting with OJVM 2.4.0.1.x build.xml file automatically compiles all Lucene contrib modules dependency.

# cd $LUCENE_ROOT/contrib/ojvm

# ant jar-core

# ant jar-test

- Edit your ~/build.properties file with your Database values:

db.str=orcl

db.usr=LUCENE

db.pwd=LUCENE

dba.usr=sys

dba.pwd=change_on_install

javac.debug=true

javac.source=1.4

javac.target=1.4

db.str is your SQLNet connect string for your target database, check first with tnsping utility, also note that for 11g database user and password are case sensitive, so leave LUCENE in uppercase.
- Upload your code to the database

# ant install-ojvm


2.3.1 Generating Maven's artifacts

You can generate Lucene and OJVM Directory Maven's artifacts following previous one steps, then execute:
# ant generate-maven-artifacts

2.4 Optimizations

2.4.1 Using NCOMP on 10g

Is strongly recommended before going in production that install Oracle Lucene Domain Index NCOMPed in 10g databases. NCOMP automatically translate Lucene and OJVMDirectory Java code to assembler and finally install it as dynamic link library (.so/.dll) in your Oracle home. To do this simply execute this Ant task instead of install-ojvm target:
# ant ncomp-ojvm

2.4.2 Using JIT on 11g

First verify that your database parameter java_jit_enabled is TRUE. Oracle 11g includes a JIT technology which automatically translates most used Java methods to assembler. If you want to pre-compile all Lucene Java code to assembler and not wait for Oracle database detects common used code you can execute this target:
ant jit-lucene-classes
ant jit-oracle-classes

2.5 Testing Lucene Domain Index

Required grants for regular Oracle users

IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user rather than LUCENE:
-- connected as sysdba
begin
  dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission', 'control', '' );
  commit;
end;
/



Lucene Domain Index have two kinds of test suites to check that everything is OK after installation.
First test suite which can be launched using Ant is pure SQL and use SQLPlus to work, to launch it simply execute:
[mochoa@mochoa ojvm]$ ant test-domain-index
Buildfile: build.xml
     [echo] Building ojvm...

test-domain-index:
     [exec]
     [exec] SQL*Plus: Release 11.1.0.6.0 - Production on Wed Dec 5 17:43:24 2007
     [exec]
     [exec] Copyright (c) 1982, 2007, Oracle.  All rights reserved.
     [exec]
     [exec]
     [exec] Connected to:
     [exec] Oracle Database 11g Release 11.1.0.6.0 - Production
     [exec]
     [exec]
     [exec] Table dropped.
     [exec]
     [exec]
     [exec] Table created.
     [exec]
     [exec] SQL> Disconnected from Oracle Database 11g Release 11.1.0.6.0 - Production
     [echo] See output at ../../build/testLuceneDomainIndex.txt
Except for the test which uses test_source_small table which outputs his log at the .trc files other will outputs his log information at ../../build/testLuceneDomainIndex.txt file.


Second test suite is a set of JUnit tests to simulate middle tier environments, it also use a connection pool. To start these suites run:
[mochoa@mochoa ojvm]$ ant -Ddb.usr=scott -Ddb.pwd=tiger -Ddb.str=test "-Djunit.includes=**/AllTests.java" ojvm-test
Buildfile: build.xml
     [echo] Building ojvm...

ojvm-test:
[echoproperties] #Ant properties
[echoproperties] #Wed Dec 05 17:56:30 ART 2007
.........
common.test:
    [junit] Testsuite: org.apache.lucene.index.TestDBIndex
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.883 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
 &nbsp;&nbsp; [junit] Inserted rows: 40 total bytes inserted: 421 avg text length: 10
    [junit] Index synced: IT1 elapsed time: 249 ms.
    [junit] Avg Sync time: 6
    [junit] Index optimized: IT1 elapsed time: 46 ms.
    [junit] Avg Optimize time: 1
    [junit] Row deleted 41, from: 10 to: 50 elapsed time: 2005 ms. Avg time: 48 ms.
    [junit] Index droped: IT1
    [junit] Table droped: T1
    [junit] ------------- ---------------- ---------------
.............
    [junit] Testsuite: org.apache.lucene.indexer.TestQueryHits
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.158 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] iteration from: 13775 to: 13785
    [junit] Step time: 1291 ms.
    [junit] iteration from: 13785 to: 13795
    [junit] Step time: 157 ms.
    [junit] iteration from: 13795 to: 13805
    [junit] Step time: 144 ms.
    [junit] iteration from: 13805 to: 13815
    [junit] Step time: 147 ms.
    [junit] iteration from: 13815 to: 13825
    [junit] Step time: 145 ms.
    [junit] iteration from: 13825 to: 13835
    [junit] Step time: 147 ms.
    [junit] iteration from: 13835 to: 13845
    [junit] Step time: 145 ms.
    [junit] iteration from: 13845 to: 13855
    [junit] Step time: 150 ms.
    [junit] iteration from: 13855 to: 13865
    [junit] Step time: 278 ms.
    [junit] iteration from: 13865 to: 13875
    [junit] Step time: 146 ms.
    [junit] Elapsed time: 3159
    [junit] Hits: 18387
    [junit] Elapsed time: 653
    [junit] ------------- ---------------- ---------------
    [junit] Testsuite: org.apache.lucene.indexer.TestTableIndexer
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.685 sec
    [junit]
   [delete] Deleting: /u01/src/lucene-2.2.0/build/contrib/ojvm/test/junitfailed.flag

BUILD SUCCESSFUL
Total time: 8 minutes 22 seconds

org.apache.lucene.indexer.TestQueryHits use a table which is very big to create and destroy it at setup() and tearDown() methods. Before run this test create the table with:

create table test_source_big as (select * from all_source);

and his index on 10g with:

create index source_big_lidx on test_source_big(text)
         indextype is lucene.LuceneIndex       parameters('AutoTuneMemory:true;Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;FormatCols:line(0000);ExtraCols:line "line"');
Or in 11g with:
create index source_big_lidx on test_source_big(text)
         indextype is lucene.LuceneIndex
         parameters('AutoTuneMemory:true;FormatCols:line(0000);ExtraCols:line "line";Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;LobStorageParameters:PCTVERSION 0 DISABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

If you want to execute only one or some specific test override Ant's property junit.includes, for example:

ant -Ddb.usr=scott -Ddb.pwd=tiger -Ddb.str=test "-Djunit.includes=**/TestDBIndex.java" ojvm-test

Note that this argument is enclosed by "" to prevent Unix shell replacement.


3. Examples

IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user rather than LUCENE:

-- connected as sysdba
begin
  dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission', 'control', '' );
  commit;
end;
/

3.1 Create a Lucene Domain Index

3.1.1 Single column index

Table example:
create table t1 (
f1 number,
f2 varchar2(200),
f3 varchar2(200),
f4 number unique);
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.SimpleAnalyzer');
Create a domain index on table t1 column f2 using SimpleAnalyzer as Lucene Analyzer.
After this DDL command is executed two new tables, one AQ queue and one index are at user's schema, named IT1$.T, IT1$QT, IT1$Q and IT1$DI respectively.
Other example but instead using some Lucene Analyzer, using Snoball Stemmer.
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English');
Create a domain index on table t1 column f2 using English Stemmer.

IMPORTANT: Lucene Domain Index name can not be larger than 21 characters.

This limiis due a limitation in Oracle DBMS AQ table name. Every Lucene Domain Index have associated a AQ queue named idx_name$Q and his queue table  idx_name$QT.

3.1.2 Multiple columns

For previous one table example you can also index extra columns passing the information as parameter to the index due Oracle 10g do not support Domain Index with compound columns, here example:
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English;ExtraCols:F1 "f1"');
Creating an index with ExtraCols parameter cause that Lucene index both columns, master column f2 indexed as F2 and F1 indexed as "f1", as you can see below, at query section examples, lcontains() operator provides Lucene's Query Parser Syntax which have functionality for selecting multiples field using f1:text for example. Using  ExtraCols parameter imply that create  index operator performs a full scan on table t1 with a syntax like: SELECT ROWID,F2,F1 "f1"  FROM T1.
Because ODCI Api will not detect changes on other columns than the master, you need to create a trigger that fire an update on the master column when a change on ExtraCols list is detected. Here an example:
CREATE OR REPLACE TRIGGER L$IT1
BEFORE UPDATE OF f1 ON t1
FOR EACH ROW
BEGIN
      :new.f2 := :new.f2;
END;
/
Any changes on f1 also will force to change f2, then ODCI will notify Lucene that an specific rowid was updated, Lucene Domain Index based on his parameter definition will update the inverted index to reflect changes in both columns.

3.1.3 Multiple tables

Lucene Domain Index supports indexing in multiples column and multiples tables which can be joined in a natural form, it means defining a list of tables with ExtraTabs parameter, and a where condition with WhereCondition parameter.  Here an example:
create table t2 (
        f4 number primary key,
        f5 VARCHAR2(200));
create table t1 (
        f1 number,
        f2 VARCHAR2(4000),
        f3 number,
        CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
           REFERENCES t2(f4) ON DELETE cascade);
You can index both tables using t1 as master index definition with:
create index it1 on t1(f3) indextype is lucene.LuceneIndex
          parameters('ExtraCols:L$MT.f2 "f2",t2.f5 "f5";ExtraTabs:t2;WhereCondition:L$MT.f3=t2.f4');
Note that tables t1 and t2 are joined directly by a foreign key, so t2 could be consider as a satellite table of t1. With this set of parameters when ODCI Api detects a change on it1 master column (f3) a select like this is executed:
select L$MT.f3,L$MT.f2 "f2",t2.f5 "f5" from t1 L$MT,t2 where L$MT.rowid=? and L$MT.f3=t2.f4;
Bold parts of the query are injected by Lucene Domain Index implementation and italics parts are extracted from ExtraCols and ExtraTabs parameters. The table alias L$MT is automatically added by Lucene Domain Index to the master table, this alias is important to create complex joins with Object Tables which uses existsNode or extracValue operator, that functionality was added starting with 2.9.0.1.0 release.
With the above scenario a trigger for getting Lucene Index synced with changes in any columns defined at ExtraCols parameter is a bit complex, it requires a combination of two triggers:
CREATE OR REPLACE TRIGGER L$IT1
BEFORE UPDATE OF f2 ON t1
FOR EACH ROW
BEGIN
      :new.f3 := :new.f3;
END;
/
CREATE OR REPLACE TRIGGER LT$IT1
    AFTER UPDATE OF f5 ON t2
    FOR EACH ROW
    DECLARE
        ridlist sys.ODCIRidList;
    BEGIN
        SELECT ROWID
           BULK COLLECT INTO ridlist
           FROM T1 WHERE F3=:NEW.f4;
        LuceneDomainIndex.enqueueChange(USER||'.IT1',ridlist,'update');
    END;
/
First trigger is similar to the previous example, second trigger at the satellite table looks for all rowid at the master table who have references to satellite row, then it uses LuceneDomainIndex.enqueueChange procedure to notify Lucene Domain Index changes. sys.ODCIRidList is an special ODCI structure to hold a group of rowid.

3.1.4 Padding and formatting

Lucene Domain Index can be customized with a parameter named UserDataStore, it defines which class is responsible for creating Lucene documents. Lucene documents are a list of Field one for each column indexed plus and extra field named rowid stored compressed and untokenized. By default UserDataStore is defined to org.apache.lucene.indexer.DefaultUserDataStore.

Default User Data Store supports left padding for NUMBER or FLOAT columns, and left character padding for VARCHAR2 or CHAR columns. To define padding, FormatCols parameter at create or alter index DDL command can be used. Here an example:

create table t1 (f1 number primary key, f2 varchar2(200), f3 number(4,2)) ORGANIZATION INDEX;
insert into t1 values (1, 'ravi', 3.46);
insert into t1 values (3, 'murthy', 15.87);
commit;

create index it1 on t1(f2) indextype is lucene.LuceneIndex
  parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');
Above example shows that for F2 column all values will be automatically padded to 15 character (z) and F3 column using 00.00, then these rows will be indexed as Lucene documents:
Document<stored/compressed,indexed<rowid:*BAEAPBQCwQL+> indexed,tokenized<F2:zzzzzzzzzzzravi> indexed<F3:03.46>>
Document<stored/compressed,indexed<rowid:*BAEAPBQCwQT+> indexed,tokenized<F2:zzzzzzzzzmurthy> indexed<F3:15.87>>
For columns based on Oracle XMLType, FormatCols parameter can be used to define an XPath expression which controls a subset of XML nodes to be indexed.
create table t1 (f1 VARCHAR2(10), f2 XMLType);
insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));
commit;

create index it1 on t1(f1) indextype is lucene.LuceneIndex
  parameters('ExtraCols:F2;FormatCols:F1(000),F2(/emp/name)');
Above rows will be indexed as:
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA> indexed,tokenized<F1:001> indexed,tokenized<F2:ravi >>
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB> indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >>
For columns of type VARCHAR/CHAR/CLOB and special string NOT_ANALYZED, NOT_ANALYZED_STORED, ANALYZED_WITH_OFFSETSANALYZED_WITH_POSITIONS and ANALYZED_WITH_POSITIONS_OFFSETS can be used as format, this constant tells to User Data Store class that a field will be stored or not as indexed untokenized, untokenized fields can be used then as sort field. Here an example:
create table emails (
 emailFrom VARCHAR2(256),
 emailTo VARCHAR2(256),
 subject VARCHAR2(4000),
 emailDate DATE,
 bodyText CLOB)
/

create index emailbodyText on emails(bodyText) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;ExtraCols:emailDate "emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo"');

-- required to Sort by subject
alter index emailbodyText parameters('LogLevel:INFO;FormatCols:subject(NOT_ANALYZED),emailFrom(NOT_ANALYZED),emailTo(NOT_ANALYZED)');

The translation rules to Lucene Fields is:

  • NOT_ANALYZED - Field(name, value, Field.Store.NO, Field.Index.NOT_ANALYZED)
  • NOT_ANALYZED_STORED - Field(name, value, Field.Store.YES, Field.Index.NOT_ANALYZED)
  • ANALYZED - Field(name, value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO)
  • ANALYZED_WITH_VECTORS - Field(name, value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES)
  • ANALYZED_WITH_OFFSETS - Field(name, value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_OFFSETS)
  • ANALYZED_WITH_POSITIONS - Field(name, value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS)
  • ANALYZED_WITH_POSITIONS_OFFSETS - Field(name, value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)
If no Format value is associated to a column of type VARCHAR/CHAR/CLOB it is converted to Lucene Filed as Field(name, val, Field.Store.NO, Field.Index.ANALYZED)


3.1.5 Functional columns

ExtraCols parameter also have a possibility to define functional column for Lucene Index which means, any SQL function valid in a select section is allowed. For example, using above table definition:
create index it1 on t1(f1) indextype is lucene.LuceneIndex
  parameters('ExtraCols:F2,extractValue(F2,''/emp/@id'') "id";FormatCols:F1(000),F2(/emp/name),id(00)');
Create this set of Lucene documents indexed:
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA> indexed,tokenized<F1:001> indexed,tokenized<F2:ravi > indexed,tokenized<id:01>>
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB> indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >
indexed,tokenized<id:03>>
Note that a virtual column was defined and indexed as "id", this column then is available at lcontains operator.

3.1.6 Create OnLine index

If you put SyncMode:OnLine during create index DDL operation it will cause that Lucene Domain Index enqueues all rowids of the master table for indexing in batchs of BatchCount rows (default is 115). Immediately that the command returns the index is ready and a PLSQL AQ Callback will populate Lucene Index structure in background. For example:
create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('SyncMode:OnLine;LogLevel:WARNING;Stemmer:Spanish;ExtraCols:extractValue(object_value,''/page/title'') "title",extractValue(object_value,''/page/revision/comment'') "comment",extract(object_value,''/page/revision/text/text()'') "text",extractValue(object_value,''/page/revision/timestamp'') "revisionDate";IncludeMasterColumn:false;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

3.1.7 Populate Index

During create index DDL statement using PopulateIndex:false causes that Lucene Index structure is created empty and the Domain Index is ready. Then you can call to alter index rebuild DDL statement to populate it. Here an example:
create index it1 on t1(f2) indextype is lucene.LuceneIndex
  parameters('PopulateIndex:false;LogLevel:ALL;IncludeMasterColumn:false;ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'') "id";FormatCols:F1(000),id(00)');

-- A this point it1 index is ready but not populated
select lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;
-- Populate Index
alter index it1 rebuild parameters('Analyzer:org.apache.lucene.analysis.WhitespaceAnalyzer');
-- query, returns one row
select lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;

3.2 Alter

SQL DDL alter index command can be used with Lucene Domain Index to change any parameter after index creation time. Lucene Domain Index parameters are a simple list of name:value pairs stored into Lucene OJVMDirectory storage. If you want to remove any parameter from the storage pre-pending ~ in a parameter name is used.
Here some examples of alter index:
alter index it1
  parameters('MaxBufferedDocs:500;AutoTuneMemory:false');
Change Lucene Index Writer parameter MaxBufferedDocs to 500 and disable Auto Tune Memory functionality.
alter index it1
  parameters('MaxBufferedDocs:500;AutoTuneMemory:false;SyncMode:OnLine');
Similar to the previous one example but enabling SyncMode to OnLine.
alter index it1 parameters('~SyncMode:OnLine');
Disable SyncMode from the above example, you can get similar functionality setting SyncMode:Deferred which is the default value for SyncMode.

3.2 Rebuild

SQL DDL alter index allow you to rebuild an index from scratch, this is useful when Lucene Domain Index is damaged, corrupted or you need to change some parameter which is necessary to be applied to existing rows already indexed, for example Lucene Analyzer parameter.

3.2.1 Manual

Manual index rebuild is a typical way, here an example:
alter index it1 rebuild
  parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;MaxBufferedDocs:500;AutoTuneMemory:false);
Above example shows how to change Lucene Index Analyzer, if you change your index Analyzer its necessary to rebuild the complete index because you should not query an index with an analyzer different from the index time.

3.2.2 On Line

Alter index rebuild will not return up to the complete operation is finished. Rebuild On Line is a functionality for Oracle Index available in enterprise edition databases, but with a little trick you can rebuild Lucene Domain Index On Line too.
If you are working with SyncMode:Deferred you need to change to SyncMode:OnLine, then you can rebuild the index by using:
alter index it1 rebuild parameters('SyncMode:OnLine;MergeFactor:100;BatchCount:1000');
commit; -- notify change to AQ Callback
Rebuild command enqueues batchs of 1000 rowids of the master table (it1) for addition to Lucene Index structure then Lucene Domain Index AQ Callback will process these messages using background database process and automatically commit changes when it finish.

3.3 Drop

Dropping Lucene Index do not differs from dropping any other index. Just call:
drop index it1;
This operation implies drop Lucene Domain Index table, for above example IT1$T, and an AQ queue IT1$Q with his storage table IT1$QT. If the index is configured with SyncMode:OnLine, first the AQ Callback is disabled.
If something is wrong during index drop command you can add "force" at the end of the command to be sure that System's views will not have any reference to the index.

3.4 Querying

Lucene Domain Index define a new SQL operator named lcontains() with his ancillary operators lscore() and lhighlight(), his functionality is similar to Oracle Text contains and score operators.
Next example shows operator functionality and parameters.

3.4.1 Simple columns

For the table and index defined into 3.1.4/3.1.5 section a simple usage of lcontains and lscore is:
SQL> select lscore(1),f2 from t1 where lcontains(f1, 'F2:ravi',1) > 0;

 LSCORE(1)
----------
F2
--------------------------------------------------------------------------------------------------------------------------------------------
         1
<emp id="1">
  <name>ravi</name>
</emp>
SQL>
First parameter of lcontains operator is the column which have attached Lucene Domain Index, this is the master column of the index and is a default field for Query Parser syntax..
Second parameter is Lucene Query Parser syntax string, above table example have defined Lucene Domain Index at f1 columns, so its not default field for the query, with this definition to query for an string inside F2 column its necessary to explicit defined "F2:".
If you want to use lscore its necessary to specify as third argument in lcontains, a correlation id, it this example "1", this correlation id then match with lscore(1) to associate the ancillary operator to a proper lcontains.
If you are querying for the master column of the index you can simply omits column qualifier, for example:
SQL> select lscore(1) from t1 where lcontains(f1,'001',1)>0;

LSCORE(1)
----------
         1
SQL>
lcontains() operator must always be compared with >0.

3.4.2 Multiple columns

Query Parser Syntax supports many logical operator and term modifier, you can combine any of them with each column indexed. Here a practical example using table and index from section 3.1.4/3.1.5
SQL> select f1,lscore(1) sc,extractValue(f2,'/emp/@id') id from t1
  2  where lcontains(f1, '003 OR (F2:ravi AND id:01)',1)>0;

F1         SC                ID
---------- ---------------- -----------------------------
1            .577350259                                    1
3            .288675129                                    3
Note that first row match against column F2:ravi and functional column id:01, second row match with F1 equal to 003 (remember F1 qualifier its not necessary because is the master column of the index defined in 3.1.5)

3.4.3 Pagination

lcontains operator have an extension to Query Parser syntax to include in-line pagination information at Lucene Domain Index Hits result.
You can select an specific window (pagination) of your query injecting a Query Parser like range inside lcontains() operator. For example:
select /*+ DOMAIN_INDEX_SORT */ ... where lcontains(col,'rownum:[20 TO 40] AND word1',1)>0 order by lscore(1) DESC;
Lucene Domain Index implementation extracts automatically pagination information rownum:[n TO m] AND from the beginning of the query syntax and return to the Oracle optimizer a subset of 20 rowids. This extension provides a lot of performance improvement by eliminating the needs of using Oracle's Top-N syntax which need to collect all rowids and then filter to calculate the window.
Due inline pagination is an home made extension to Query Parser syntax have two limitations or know caveats:
  • rownum:[n TO m] AND must be at the beginning of Query Parser string and as is, we simple use string position of rownum and AND reserved keywords to extract start and stop index of the window.
  • pagination is concatenated using AND boolean operator, but its not strictly and AND operator because is a simple substring splitting operation, it means a priority for grouping AND OR are no applied to the first part, rownum:[n TO m] AND xx OR bb should be evaluated as (rownum:[n TO m] AND xx) OR bb rather than this Lucene will search xx OR bb and ODCI will extract n TO m window.
Note: DOMAIN_INDEX_SORT optimizer hint is required for Sorting, see next section.

3.4.4 Sort

Lucene provides sort over the result of a particular query, Lucene Domain Index provides sorting by using an extra argument at lcontains() operator. Here examples of sorting using emails table created in section 3.1.4:
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
  2  where lcontains(bodytext,'security','subject:ASC',1)>0;

SUBJECT
--------------------------------------------------------------------------------------------------------------------------------------------
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
lucene injection

Elapsed: 00:00:00.04
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
  2  where lcontains(bodytext,'security','subject:DESC',1)>0;

SUBJECT
--------------------------------------------------------------------------------------------------------------------------------------------
lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection

Elapsed: 00:00:00.17
Sort parameter syntax is a coma separated string of field[:ORDER[:TYPE]] values, fields included in sorting spec should be NOT_ANALYZED or NOT_ANALYZED_STORED, see FormatCols argument above. ORDER can be ASC or DESC, default value is ASC. TYPE can be string, float or int, starting with Lucene 2.9.0 default value is string.
Using above table a little complicated sort spec is:
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
  2  where lcontains(bodytext,'security','subject:ASC:string,emailFrom:DESC:string',1)>0;

SUBJECT                                      EMAILFROM
----------------------------------------- -------------------------------------------------------------------------
Re: lucene injection                        lucenelist2005@danielnaber.de
Re: lucene injection                        lucenelist2005@danielnaber.de
Re: lucene injection                        erik@ehatchersolutions.com
Re: lucene injection                        codeshepherd@gmail.com
lucene injection                              codeshepherd@gmail.com

Elapsed: 00:00:00.06
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
  2  where lcontains(bodytext,'security','subject:ASC:string,emailFrom:ASC:string',1)>0;

SUBJECT                                        EMAILFROM
------------------------------------------ ------------------------------------------------------------------------
Re: lucene injection                         codeshepherd@gmail.com
Re: lucene injection                         erik@ehatchersolutions.com
Re: lucene injection                         lucenelist2005@danielnaber.de
Re: lucene injection                         lucenelist2005@danielnaber.de
lucene injection                              codeshepherd@gmail.com

Elapsed: 00:00:00.05
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
  2  where lcontains(bodytext,'security',1)>0;

SUBJECT                                         EMAILFROM
------------------------------------------- ----------------------------------------------------------------------
lucene injection                                codeshepherd@gmail.com
Re: lucene injection                          erik@ehatchersolutions.com
Re: lucene injection                          codeshepherd@gmail.com
Re: lucene injection                          lucenelist2005@danielnaber.de
Re: lucene injection                          lucenelist2005@danielnaber.de

Elapsed: 00:00:00.09
Latest query doesn't include sort so it sorted by score. An abbreviated syntax for sort string is ASC or DESC which means sort by score ascending or descending, this short format is equivalent to use order by syntax with lscore operator, for example:
SQL> SELECT lscore(1),subject FROM emails
  2  where lcontains(bodytext,'security',1)>0;

 LSCORE(1)    SUBJECT
--------------   ------------------------------------------------------------------------------------------------------
.241440386    lucene injection
.22763218      Re: lucene injection
.199178159    Re: lucene injection
.140840232    Re: lucene injection
.140840232    Re: lucene injection

Elapsed: 00:00:00.10
SQL> SELECT lscore(1),subject FROM emails
  2  where lcontains(bodytext,'security',1)>0 order by lscore(1) asc;

 LSCORE(1)    SUBJECT
-------------    -----------------------------------------------------------------------------------------------------
.140840232    Re: lucene injection
.140840232    Re: lucene injection
.199178159    Re: lucene injection
.22763218      Re: lucene injection
.241440386    lucene injection

Elapsed: 00:00:00.11
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
  2  where lcontains(bodytext,'security','subject:DESC',1)>0;

 LSCORE(1)    SUBJECT
-------------    ----------------------------------------------------------------------------------------------------
.241440386    lucene injection
.22763218      Re: lucene injection
.199178159    Re: lucene injection
.140840232    Re: lucene injection
.140840232    Re: lucene injection

Elapsed: 00:00:00.07
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
  2  where lcontains(bodytext,'security','subject:ASC',1)>0;

 LSCORE(1)     SUBJECT
-------------     -------------------------------------------------------------------------------------------------
.140840232     Re: lucene injection
.140840232     Re: lucene injection
.199178159     Re: lucene injection
.22763218       Re: lucene injection
.241440386     lucene injection

Elapsed: 00:00:00.07
First example uses default sort by score descend, second example uses order by syntax overriding default sort and change it to score ascend, the other ones are equivalent but using lcontains sort argument string.
Note that if you are using lcontains sort string, you has to add DOMAIN_INDEX_SORT optimizer hint, this hint tells Oracle optimizer that the order of the rows will be dictated by Lucene Domain Index.
The usage of lscore(anc_id) in conjunction with lcontains(column,query,sort_str,anc_id) make not sense and produce an extra overhead on the score computation which can be avoided, it means if you are querying Lucene Domain Index and want to get the result ordered by other columns rather than the relevance order why to compute it, AVOID lscore() function in the select list and you will get a query faster. For example:
SQL> SELECT /*+ DOMAIN_INDEX_SORT BAD */ lscore(1),subject FROM emails
2  where lcontains(bodytext,'security','subject:ASC',1)>0;
 LSCORE(1)     SUBJECT
-------------     -------------------------------------------------------------------------------------------------
.140840232     Re: lucene injection
.140840232     Re: lucene injection
.199178159     Re: lucene injection
.22763218       Re: lucene injection
.241440386     lucene injection
Elapsed: 00:00:00.07

SQL> SELECT /*+ DOMAIN_INDEX_SORT GOOD */ subject FROM emails
2  where lcontains(bodytext,'security','subject:ASC')>0;
 SUBJECT
 -------------------------------------------------------------------------------------------------
 Re: lucene injection
 Re: lucene injection
 Re: lucene injection
 Re: lucene injection
 lucene injection
Elapsed: 00:00:00.02

3.4.5 Count Hits Function

Count hits function is a Lucene Domain Index optimization to replace SQL count(*) functionality. This is extremely fast because there is no rowid information passed from Lucene Data Cartridge to Oracle Engine to count matching rows. Here an example:
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits  from dual;

      HITS
----------
         5

Elapsed: 00:00:00.02
First argument of count hits function is an string with Lucene Domain Index syntax (IDX_NAME), second argument is Query Parser syntax string equals to second argument of lcontains function, optionally you can use a three argument version of countHits function to use index in another schemas, first argument is the schema, second argument is the index name and last one is the Query Parser syntax string. After a count hits function call you can use a select with lcontains function, if count hits query matchs with lcontains query, lcontains will have a cached information for returning matching rowids. Following some examples of count hits an his correlated query using caching results:
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits  from dual;

      HITS
----------
         5

Elapsed: 00:00:00.02
SQL> select emailFrom FROM emails
  2  where lcontains(bodytext,'security',1)>0;

EMAILFROM
--------------------------------------------------------------------------------------------------------------------------------------------
codeshepherd@gmail.com
erik@ehatchersolutions.com
codeshepherd@gmail.com
lucenelist2005@danielnaber.de
lucenelist2005@danielnaber.de

Elapsed: 00:00:00.08
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') from dual;

LUCENEDOMAININDEX.COUNTHITS('EMAILBODYTEXT','SECURITY')
------------------------------------------------------------------------------
                                                                             5

Elapsed: 00:00:00.02
SQL> select emailFrom FROM emails
  2  where lcontains(bodytext,'security','emailFrom:ASC',1)>0;

EMAILFROM
--------------------------------------------------------------------------------------------------------------------------------------------
codeshepherd@gmail.com
codeshepherd@gmail.com
erik@ehatchersolutions.com
lucenelist2005@danielnaber.de
lucenelist2005@danielnaber.de

Elapsed: 00:00:00.04
In both queries lcontains found a cached hits structure evaluated by count hits function. Lucene Domain Index stores cached hits information, to localize it uses a key compounded by sort_string(QueryParser.toString()) so both arguments of count hits and lcontains should match to re-use a cached hits structure. For last query example the string emailFrom:(security) is used as key.

3.4.6 First Rows Optimizer Hint

Starting with 2.4.0.1.0 release we have replaced deprecated Lucene Hits class by TopDocs class. If you use FIRST_ROWS optimizer hint in conjuction with lcontains inline pagination Lucene Domain Index will call to TopDocs to get only the first M hits. For example:
SQL> select /*+ FIRST_ROWS DOMAIN_INDEX_SORT */ lhighlight(1),extractValue(object_value,'/page/title')
         from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musica tango rock)',1)>0;

FIRST_ROWS and rownum:[1 TO 10] tells to Lucene Domain Index that performs a Lucene Query for the first 10 hits only. Next query with rownum:[10 TO 20] will have most of the Lucene structures cached in memory such as the Searcher and the ROWID<->Lucene DocID association, but it will re-query Lucene Index to get first 20 Hits (1..20), this cache miss behavior of Hits could be interpreted as bad solution but is extremely useful if 90% of query only visits the first page of the hits, typical behavior on Internet Search.
In the other hand if you omits FIRST_ROWS optimizer hint, Oracle by default switch to ALL_ROWS mode which means, if you are using pagination  (rownum:[n TO m]) with m greater than 2000, Lucene Domain Index will fetch m first hits, but if m is lower than 2000, Lucene Domain Index will try to fetch  by default 2000 hits. The magic number 2000 is due Oracle ODCI API calls to ODCIFetch routine in batch of 2000 rowids.
If FIRST_ROWS and in-line pagination are not included in query, Lucene Domain Index works in batch of 2000 hits causing several cache miss in a full scan mode. For example given a query:
SQL> select count(*) from pages where lcontains(object_value,'musica tango rock')>0;
causes that Lucene Domain Index fetch the first 2000 hits, finally with the information that the hits length is 2736 it re-fetch (cache miss) the 2736 hits. Obviously you can use LuceneDomainIndex.countHits() function to count hits faster than the previous query.

3.4.6 Highlighting

lhighlight ancillary operator works as lscore but returning a VARCHAR2 text with the words highlighted during the evaluation of lcontains function, the tag used to remark matching words is not customizable yet and is <B>, also the fragment separator and the maximum number of fragments are constant (... and 4, respectively). Starting with 2.4.1.1.0 release it have parameters customizable through alter index ... parameters() DDL command to change. Highlighting example:
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject
  2  FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

TXT                                                                                                                    SC                SUBJECT
On Dec 21, 2006, at 4:56 AM, Deepan wrote:> I am bothered about <B>security</B>  .27477634      Re: lucene injection
problems with lucene. Is it vulnerable to> any kind of injection like <B>mysql</
B> injection? many times the query from> user is passed to lucene for search wit
hout validating.Rest easy.  There are no known <B>security</B> issues with Lucen
e, and ithas even undergone a recent static code analysis by Fortify (see theluc
ene-dev e-mail list
Highlighting only works with columns of type VARCHAR2, CLOB and XMLType. You can perform highlighting operation even if your master columns is not indexed/stored, for example for an index created with:

create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:INFO;
Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
ExtraCols:extractValue(object_value,''/page/title'') "title",
extractValue(object_value,''/page/revision/comment'') "comment",
extract(object_value,''/page/revision/text/text()'') "text",
extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
FormatCols:revisionDate(day);IncludeMasterColumn:false;
LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');
SQL> select /*+ DOMAIN_INDEX_SORT */ lhighlight(1),extractValue(object_value,'/page/title') from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musica tango rock)',1)>0;

<page xmlns="http://www.mediawiki.org/xml/export-0.3/">
  <title><B>Música</B> de Argentina... [[Latinoamérica|latinoamericanos]] con más desarrollo en su [[<B>música</B>]].

Se encuentra una gran... argentinos, un instrumento tradicional andino]]
Aún se mantiene la <B>música</B> de los [[Indígenas_en_Argentina... de grandes corrientes de [[inmigración|inmigrantes]] europeos, la <B>música</B> argentina se enriqueció

Música de Argentina
 musical emparentado con la [[habanera]] y el [[<B>tango</B> (<B>música</B>)|<B>tango</B>]].

==Diferencias con el <B>tango</B>==

Aunque tanto la milonga como el <B>tango</B> están en [[compás]] de 2/4, las 8 [[semicorchea]]s de la milonga están distribuidas en 3 + 3 + 2 en cambio el <B>tango</B> posee un ritmo más «cuadrado». Las letras...]] criticó en algún momento el <B>tango</B> y prefirió la milonga, que no trasmite la melancolía

Milonga (género musical)
Index creation DDL have IncludeMasterColumn:false, which means the whole XMLType representation of the Spanish Wikipedia page dump is not indexed only the virtual columns title, comment, text and revisionDate are processed by Lucene, but TextHighlight Java function attached to lhighlight operator receives the XMLType from RDBMS engine, so it call to Lucene Highlighter class with the whole XMLType object (note that page titles are in bold only to separate rows at the output).
Parameters supported by highlighting functions are:
  • Formatter, a valid class name which implements Lucene Interface Formatter and with a constructor with no arguments, default value org.apache.lucene.search.highlight.SimpleHTMLFormatter.
  • MaxNumFragmentsRequired, number of text fragments returned by Highlight function, default value 4.
  • FragmentSize, size of each fragment returned, default value 100.
  • FragmentSeparator, String used as fragment separator, default is "...". Note that you can not use ";" or ":" as fragment separator because are used as parameter and value delimiters into alter index ... parameters(..) DDL statement.
There is no customization allowed by passing constructor arguments to Formatter class, but you can easily creates your own Formatter which call to SimpleHTMLFormatter with arguments, your Formatter will look like:
create or replace and compile java source named "org.apache.lucene.search.highlight.MyHTMLFormatter" as
package org.apache.lucene.search.highlight;

public class MyHTMLFormatter extends SimpleHTMLFormatter {
  public MyHTMLFormatter() {
    super("<span class=\"myhighlightclass\">","</span>");
  }
}
/
show errors

alter index emailbodyText
parameters('Formatter:org.apache.lucene.search.highlight.MyHTMLFormatter;MaxNumFragmentsRequired:3;FragmentSeparator:...;FragmentSize:50');

3.4.7 Highlighting using pipeline table functions

phighlight and rhighlight provides a more general usage patterns of Lucene highlighting functionality. phighlight receives an SQL query as string and performs highlighting in set of user defined columns on the query result. rhighlight receives a SYS_REFCURSOR argument and performs highlighting in a set of user defined query columns, unlike phighlight, rhighlight requires that the user defined a return type of the query, usually a TABLE OF collection, because with a SYS_REFCURSOR argument there is no option to know at compilation time the return type of the query.
Both phighlight and rhighlight support highlighting parameters defined during create index or alter index DDL statements, see 3.4.6 section for more information.
Here two examples of highlighting features by using pipeline table functions, table emails is the example table/index of the section 3.1.4:
SELECT * FROM
TABLE(phighlight(
        'EMAILBODYTEXT',
        'lucene OR mysql',
        'SUBJECT,BODYTEXT',
        'select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
         from eMails e where lcontains(bodytext,''security OR mysql'',''subject:ASC'',1)>0'
    ));
SELECT * FROM
TABLE(rhighlight(
        'EMAILBODYTEXT',
        'lucene OR mysql',
        'SUBJECT,BODYTEXT',
        'EMAILRSET',
        CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
        from eMails e where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0)
    ));

First three arguments of both pipeline function are equals, the Lucene Domain Index used, the Lucene Query Syntax argument (should match with lcontains argument) and finally the columns of the query which will be highlighted.
Last arguments are, for phighlight is a VARCHAR2 type with SQL query to be executed by DBMS_SQL package, note the double single quote used as escape character sequence to encode SQL single quotes char.
For rhighlight two arguments are required the type returned by the cursor, in this example is EMAILRSET defined as:
CREATE TYPE EMAILR AS OBJECT
(
 sc NUMBER,
 emailFrom VARCHAR2(256),
 emailTo VARCHAR2(256),
 subject VARCHAR2(4000),
 emailDate DATE,
 bodyText CLOB
);

CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;
Note that EMAILR is record which holds all columns of table EMAILS plus the score returned by lscore() function, then EMAILRSET is simple collection type TABLE OF EMAILR which is required type for CURSOR value.
And finally last argument is of CURSOR type which means any SQL query.

3.4.8 More like this functionality

More like this Lucene Functionality is wrapped using function:
MoreLike.this(index_name IN VARCHAR2,
                   x IN ROWID,
                   f IN NUMBER DEFAULT 1,
                   t IN NUMBER DEFAULT 10,
                   minTermFreq IN NUMBER DEFAULT 2,
                   minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist
where index_name could be owner,index_name pair as in other Lucene Domain Index procedures. A typically uses case is:
select rowid,lscore(1),text from test_source_big where lcontains(text,'"procedure java"~10',1)>0 order by lscore(1) desc;
AAAOaPAAEAAAAnnABV 1.00000003 procedure (C, Java or PL/SQL), optionally qualified
AAAOaPAAEAAAA0aAAV  .84852819  STATIC PROCEDURE refreshParameterCache as LANGUAGE JAVA NAME
.........
declare
  ridlist sys.odciridlist;
begin
  ridlist := MoreLike.this(index_name=>'SOURCE_BIG_LIDX',x=>'AAAOaPAAEAAAAnnABV',minTermFreq=>1);
  FOR i IN (select rowid,text from test_source_big where rowid in (select * from table(ridlist_table(ridlist)))) LOOP
    dbms_output.put_line('rowid: '||i.rowid||' text: '||i.text);
  END LOOP;
end;
/
rowid: AAAOaPAAEAAAAhLAAc text:   --  after issuing insert, update, delete or anonymous PL/SQL calls
rowid: AAAOaPAAEAAAAjrAAo  text:   -- QUALIFIED_SQL_NAME
rowid: AAAOaPAAEAAAAk5AAe text:   --    ORA-06502: PL/SQL: numeric or value error: character string buffer
.....
rowid: AAAOaPAAEAAAAtXAAb text:   --  The name of the Java class, PL/SQL package or object type implementing
Note that the anonymous PL/SQL block gets the first ROWID returned from the first query as pivot, then expands the result set with other rows which also includes terms like "procedure (C, Java or PL/SQL), optionally qualified", "C" is not take into account due was eliminate as stop word.
Refers to the Appendix D.6 for a full explanation of each parameter.

3.4.9 Facets

Starting with Lucene Domain Index 2.4.1.1.0, Lucene Facets functionality is available through an SQL aggregate function lfacets():
lfacets(index_name_and_categories IN VARCHAR2
) RETURN LUCENE.agg_tbl
where index_name_and_categories is encoded string with the Lucene Index name and categories, aggregated function only accepts one scalar value as argument so we need to encode the index and categories in a coma separated list, for example:
SQL> select lfacets('SOURCE_BIG_LIDX,TEXT:procedure,TEXT:java') from dual;
Using the index created on the example of section 2.5 Testing Lucene Domain Index index name can be SCHEMA.IDX_NAME sintax, categories can be one or two and are expressed in Lucene Query Syntax, in the above example TEXT is the index column procedure is the main category and java the sub category.
Creating a table with categories and linking the rows with parent is an option to automatically generate facets, for example:
create table source_categories (
 cat_code    number(4),
 cat_name    varchar2(256),
 cat_parent  number(4),
 CONSTRAINT PK_SOURCE_CATEGORIES PRIMARY KEY (cat_code),
 CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
      REFERENCES source_categories (cat_code)
);
insert into source_categories values (1,'TEXT:procedure',null);
insert into source_categories values (2,'TEXT:function',null);
...
insert into source_categories values (6,'TEXT:java',1);
insert into source_categories values (7,'TEXT:(pl sql)',1);
insert into source_categories values (8,'TEXT:wrapped',1);
...
insert into source_categories values (21,'line:[1 TO 1000]',1);
insert into source_categories values (22,'line:[1001 TO 2000]',1);
insert into source_categories values (23,'line:[2001 TO 3000]',1);

Now we can query above table calling to lfacets with the category and sub category:
SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||
   case level when 1 then cat_name
   ELSE  PRIOR cat_name||','|| cat_name
   END
   )), cat_code,level
   FROM source_categories
   start with cat_parent is null
   CONNECT BY PRIOR cat_code = cat_parent
   group by cat_code,level;

LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))    CAT_CODE    LEVEL
TEXT:procedure(5116)                                                                                                                                                                  1          1
TEXT:function(5574)                                                                                                                                                                      2          1
TEXT:trigger(96)                                                                                                                                                                            3          1
TEXT:package(860)                                                                                                                                                                       4          1
TEXT:(object type)(5140)                                                                                                                                                              5          1  
TEXT:procedure,TEXT:java(9)                                                                                                                                                        6          2
.....
TEXT:procedure,line:[1 TO 1000](3)                                                                                                                                             21          2
TEXT:procedure,line:[1001 TO 2000](615)                                                                                                                                   22          2
...
SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||          
   case level when 1 then cat_name                      
   ELSE  PRIOR cat_name||','|| cat_name                 
   END                                                  
   )), cat_parent                                       
   FROM source_categories                               
   start with cat_parent is null                        
   CONNECT BY PRIOR cat_code = cat_parent               
   group by cat_parent;                                 

LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))       CAT_PARENT
-----------------------------------------------------------------------------------------------------------------------------------------       ---------------
TEXT:procedure,TEXT:java(11),TEXT:procedure,TEXT:(pl sql)(70),TEXT:procedure,line:[1 TO 1000](3),TEXT:procedure,TEXT:wrapped(21),TEXT:proced
ure,line:[1001 TO 2000](675),TEXT:procedure,line:[3001 TO 4000](105),TEXT:procedure,line:[4001 TO 5000](10),TEXT:procedure,line:[2001 TO 300
0](199)    1
TEXT:function,TEXT:java(22),TEXT:function,TEXT:wrapped(85),TEXT:function,line:[1 TO 1000](0),TEXT:function,TEXT:(pl sql)(87),TEXT:function,l
ine:[1001 TO 2000](835),TEXT:function,line:[3001 TO 4000](21),TEXT:function,line:[4001 TO 5000](0),TEXT:function,line:[2001 TO 3000](338)    2
TEXT:trigger,TEXT:java(1),TEXT:trigger,line:[1 TO 1000](0),TEXT:trigger,TEXT:wrapped(0),TEXT:trigger,TEXT:(pl sql)(1),TEXT:trigger,line:[100
1 TO 2000](33),TEXT:trigger,line:[3001 TO 4000](0),TEXT:trigger,line:[4001 TO 5000](0),TEXT:trigger,line:[2001 TO 3000](0)    3
TEXT:package,TEXT:java(7),TEXT:package,line:[1 TO 1000](0),TEXT:package,TEXT:(pl sql)(25),TEXT:package,TEXT:wrapped(137),TEXT:package,line:[
1001 TO 2000](54),TEXT:package,line:[3001 TO 4000](5),TEXT:package,line:[4001 TO 5000](0),TEXT:package,line:[2001 TO 3000](5)    4
TEXT:(object type),TEXT:java(56),TEXT:(object type),TEXT:(pl sql)(106),TEXT:(object type),line:[1 TO 1000](1),TEXT:(object type),TEXT:wrappe
d(76),TEXT:(object type),line:[1001 TO 2000](441),TEXT:(object type),line:[4001 TO 5000](0),TEXT:(object type),line:[3001 TO 4000](28),TEXT:
(object type),line:[2001 TO 3000](119)    5
TEXT:procedure(5574),TEXT:(object type)(5584),TEXT:package(868),TEXT:trigger(114),TEXT:function(6167)

6 rows selected.

Note that we are using ljoin() function which convert agg_tbl type to a coma separated string plus his cardinality. First row do not have a sub category because parent column is null, so 5116 is a number of rows which includes the text procedure, last row showed included a category and sub category, TEXT:procedure,line:[1001 TO 2000] implies the bit AND intersection between the set of rows which includes procedure against a set of rows which match with line[1001 TO 2000], the group by cat_code causes that the oracle ODCI API call first to calculate the bit set for procedure and iterate over all his sub categories, java, pl sql, wrapped, doing the bit AND intersections, this is fast and once the facets is computed is stored as Filter in Lucene Domain Index memory structures.
When a number of rows or the amount of categories is big we can use a materialized view to work as cache of the facets computation. For example:
CREATE MATERIALIZED VIEW source_facets
AS
select ljoin(lfacets('SOURCE_BIG_LIDX,'||
   case level when 1 then cat_name
   ELSE  PRIOR cat_name||','|| cat_name
   END
   )), cat_code,level
   FROM source_categories
   start with cat_parent is null
   CONNECT BY PRIOR cat_code = cat_parent
   group by cat_code,level;
Now source_facets materialized view can be queried as any other table and his access will be too fast. The materialized view then can be refreshed by the application at an specific point in time.

3.4.10 Terms

Starting with Lucene Domain Index 2.9.1.1.0, a pipeline table function high_freq_terms():
FUNCTION high_freq_terms(index_name VARCHAR2,
                       term_name 
VARCHAR2,
                       num_terms 
NUMBER) RETURN term_info_set
is available for getting the Top-N used terms on the index or in a particular field. term_info_set is defined as:
TYPE term_info AS OBJECT (
  term       
VARCHAR2(4000),
  docFreq 
NUMBER(10)
);
TYPE term_info_set AS TABLE OF term_info;
You can query your index using:
select * from table(high_freq_terms('SOURCE_BIG_LIDX','TEXT',10)) order by docFreq desc;
select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));
select * from table(high_freq_terms('SOURCE_BIG_LIDX','line',100));



3.5 Synchronize

Working with SyncMode:Deferred you has to manually synchronize your index, it means update Lucene Domain Index structure applying pending changes such as insert and update. Deletes operations are always applied due ODCI Api do not accept rowid of deleted rows.
Here an example:
begin
  LuceneDomainIndex.sync(USER||'.IT1');
  commit; -- release locks
end;
LuceneDomainIndex.sync procedure requires an argument of type VARCHAR2 with the index object name, index object name are usually capitalized and have the syntax SCHEMA_OWNER.IDX_NAME.
Synchronize operation could raise an exception if some rows being indexed are locked for update, in that case you have release  first locked rows and  re-sync the index.
An exclusive lock at Lucene Index storage is obtained during index synchronization, so you has to commit or rollback the connection immediately after this operation to release exclusive lock.
Since Lucene Domain Index 2.4.0.1.0 you can use LuceneDomainIndex.sync('IT1') or LuceneDomainIndex.sync(USER,'IT1'), both procedure are equivalent.
Note: Due a limitation on SYS.ODCIRidList() array you can enqueue more than 32767 additions or deletions, an update is counted as one deletion plus one addition by Lucene implementation code. This limitation will be removed in future releases of Lucene Domain Index.

3.6 Optimize

Optionally you can optimize Lucene Index storage, for doing that execute:
begin
  LuceneDomainIndex.optimize(USER||'.IT1');
  commit; -- release locks
end;
Like sync operation this procedure get an exclusive lock at Lucene Index storage table and perform an optimization of Lucene Index merging multiples segment in new one for example. You can still performing select operation (read-only) using Lucene Domain Index during optimization time, Oracle concurrency system (redo logs) provides you this functionality, once you perform a commit operation any other concurrent session will automatically see index changes.

3.6 XMLDB Export

You can perform an XMLDB Export operation for your Lucene Domain Index, this operation provides an easy way to get Lucene Domain Index information available as Lucene File option. Once the operation is done you can get all the files from you database using WebDAV explorer or FTP. For example:
begin
  LuceneDomainIndex.xdbExport('IT1');
  commit; -- makes change visible to Ftp or WebDAV
end;
Your index will be visible at /public/lucene/SCOTT.IT1 directory for example.
Once you copy this files into the file system you can open it with any Lucene compatible application like Luke. Here some screen shots of Luke using Lucene Domain Index export information.


3.7 Exporting/Importing functional index with exp/imp Oracle tools

You can perform an Oracle exp operation for your Lucene Domain Index. Oracle exp tool performs by default functional index for every table being exported during the backup process. As I mention early Lucene Domain Index creates a table named IDX_NAME$T which have Lucene file storage replaced by BLOB, also a DBMS AQ is created during the index creation time, this queue is associated to a table IDX_NAME$QT, both tables have a flag marked as SECONDARY, which means that you can not export these tables alone, but they are automatically includes when Lucene Domain Index is included into the export.
During import operation Oracle re-create the index using a create index ... parameters('your lucene parameters') DML statement, all Lucene Domain Index parameters are included except for the parameter PopulateIndex which always is stored as false into Oracle System's views. This parameter is altered intentional by Lucene Domain Index because if its set to true, during import operation Lucene Domain Index will try to re-create the Lucene Index structure instead of using the information restored into IDX_NAME$T table.
Alternative to XMLDB Export or Oracle exp tool you can also exports your Lucene Domain Index storage using a create table as ... DML statement. For example:
SQL> create table SOURCE_BIG_LIDX$T$BK as (select * from SOURCE_BIG_LIDX$T);
... you can export now using exp tool SOURCE_BIG_LIDX$T$BK because is regular table ...
-bash-3.2$  exp                                                       

Export: Release 10.2.0.3.0 - Production on Fri Mar 27 02:46:18 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.


Username: scott/tiger

Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production
Enter array fetch buffer size: 4096 >                            

Export file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp

(2)U(sers), or (3)T(ables): (2)U > 3

Export table data (yes/no): yes > yes

Compress extents (yes/no): yes >

Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)       

About to export specified tables via Conventional Path ...
Table(T) or Partition(T:P) to be exported: (RETURN to quit) > SOURCE_BIG_LIDX$T$BK

. . exporting table           SOURCE_BIG_LIDX$T$BK         19 rows exported
Table(T) or Partition(T:P) to be exported: (RETURN to quit) >              

Export terminated successfully without warnings.
.... Now you can drop your index and re-create again without populating it ....
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

  COUNT(*)
----------
      6167

SQL> drop index SOURCE_BIG_LIDX;

Index dropped.

SQL> create index source_big_lidx on test_source_big(text)
  2  indextype is lucene.LuceneIndex                      
  3  parameters('PopulateIndex:false;AutoTuneMemory:true;Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;FormatCols:line(0000);ExtraCols:line "line"');

Index created.

SQL> drop table SOURCE_BIG_LIDX$T$BK;

Table dropped.
....  Restore your .dmp now and check again if your index returns a correct result ....
-bash-3.2$ imp scott/tiger                                                  

Import: Release 10.2.0.3.0 - Production on Fri Mar 27 02:49:40 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.


Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production

Import file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp

Enter insert buffer size (minimum is 8192) 30720>

Export file created by EXPORT:V10.02.01 via conventional path
import done in US7ASCII character set and AL16UTF16 NCHAR character set
import server uses AL32UTF8 character set (possible charset conversion)
List contents of import file only (yes/no): no >                       

Ignore create error due to object existence (yes/no): no >

Import grants (yes/no): yes >

Import table data (yes/no): yes >

Import entire export file (yes/no): no > yes

. importing SCOTT's objects into SCOTT
. importing SCOTT's objects into SCOTT
. . importing table         "SOURCE_BIG_LIDX$T$BK"         19 rows imported
Import terminated successfully without warnings.                          
.... Check first that your index do not have information and populate them with Lucene Index information ....
SQL> conn scott/tiger
Connected.           
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

  COUNT(*)
----------
         0

SQL> truncate table SOURCE_BIG_LIDX$T;

Table truncated.

SQL> insert into SOURCE_BIG_LIDX$T (select * from SOURCE_BIG_LIDX$T$BK);

19 rows created.
SQL> exit
..... and connect again to refresh Lucene Domain Index in memory structures ....
SQL> conn scott/tiger
Connected.
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

  COUNT(*)
----------
      6167

As you can see the Lucene Domain Index structure can be export alone without exporting the master table, this is useful when you are upgrading Lucene Domain Index that requires that all index need to be dropped first and you don't want to re-create a very big index.

4. Locking and Performance

4.1 Lock used by Lucene Domain Index


Operation
Base Table (row/table)
Index Table (SCHEMA.IDX$T)
Queue Table (SCHEMA.IDX$QT)
Insert
X/RX (1)
NONE
NONE
Update
X/RX
NONE
NONE
Delete
X/RX X/RS(updateCount)|X/RX (3)
NONE
Manually Sync
X/RS (2)
X/RS(updateCount)|X/RX
DBMS_AQ.BLOCKED (4)
Automatically Sync
X/RS X/RS(updateCount)|X/RX DBMS_AQ.BLOCKED
Optimize
NONE
X/RS(updateCount)|X/RX NONE
  1. X = Row exclusive lock at the row being inserted, RX = Table row exclusive lock.
  2. X = Row exclusive lock at the row being indexed, RS = Table row share lock. A select ... for update no wait is performed at all rows being added to Lucene Index.
  3. X/RS is performed at the row where name='updateCount', this is writer lock semaphore of Lucene Index and provide serialize write operations. X/RX is performed at many rows of this table because Lucene is created and deleting many files.
  4. To perform massive dequeue operations at DBMS AQ queue Sync scan this queue with DBMS_AQ.BLOCKED option.


4.2 Performance tips

4.2.1 Index Writer parameters

Lucene Index Writer class uses several parameters to control index structure. Lucene Domain Index pass to Index Writer several parameters such as MergeFactor, MaxBufferedDocs among others.
As best practice if you want to index thousands of rows you can override default Lucene parameters for other which speed up indexing time. With create index or alter index rebuild you can set MergeFactor to 100 and MaxBufferedDocs to 4000.
This parameters increase index performance but then DML operations at the base table will batch small set of rows, so after DDL commands change MergeFactor to 2 and MaxBufferedDocs to 100. A good place to start knowing these parameters behavior is the Wiki page Improving Indexing Speed.

4.2.2 Auto Tune Memory functionality

Lucene Domain Index have a parameter called AutoTuneMemory a true value means that for Index Writer operations it will try to use up to 90% of the Java Pool Size configured at the Oracle SGA to adjust how many documents are buffered (MaxBufferedDocs) before call IndexWritter.flush().
With AutoTuneMemory:true MaxBufferedDocs its not required, its calculated using free ram at the SGA, but you has to set MergeFactor.
Due Java Pool Size is global parameter the rule is not valid if you want to create many index with parallel connexions, two connections will try to use 90% of the SGA, so one of them will ran out of memory.

4.2.3 Keep Index on RAM

OJVMDirectory replaces Lucene file system storage by a table storage with BLOBs. For every Lucene Domain Index created there is a new table which stores every Lucene file as a row with a BLOB column, see section 6 for more detail, using similar strategy as Oracle Text you can keep this table in RAM. Unlike Oracle Text which uses multiples tables for storing the inverted index, Lucene Domain Index use one table, execute this DDL command to keep Lucene Index on RAM:
create index source_small_lidx on test_source_small(text)
indextype is lucene.LuceneIndex
parameters('FormatCols:line(0000);ExtraCols:line "line";Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:100');

alter index source_small_lidx parameters('MergeFactor:2');

alter table source_small_lidx$t storage (buffer_pool keep) modify lob (data) (storage (buffer_pool keep));
During Index creation use AutoTuneMemory:true (default value) and a MergeFactor high because many rows will be indexed at this time. Then change MergeFactor to 2 to work better after each DML/sync operation. Finally change OJVMDirectory storage table and LOB to keep them in RAM.
Be sure that your SGA has a enough RAM to keep it. To know how big your index you can query the table:
SQL> select sum(file_size) from source_small_lidx$t where deleted='N';

SUM(FILE_SIZE)
--------------
        147444
Finally as Tom Kyte say, tkprof, tkprof, .... ;)
You can see Lucene Domain Index IO operations with an "alter session set events '10046 trace name context forever, level 12'; then you can find operations at Lucene Domain Index table SCHEMA.IDX_NAME$T. Using TKPROF information you can alter table and lob storage parameters manually.

4.2.4 Compare your execution plan

To be sure that your Lucene Domain Index is properly used compare your executions plans and try to avoid non necessary filter by or sort order by predicates by using in-line sort or multiples field Query Parser conditions.
Here examples of sorting using emails table created in section 3.1.4:
SQL> explain plan for
  2  SELECT subject FROM emails where lcontains(bodytext,'security',1)>0
  3  order by subject ASC;

Explained.

Elapsed: 00:00:00.58
SQL> set echo off

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------
Plan hash value: 1542204867
Id
Operation
Name
Rows
Bytes
Cost (%CPU) Time
0
SELECT STATEMENT
1
4016 3 (34) 00:00:01
1
 SORT ORDER BY
1
4016 3 (34) 00:00:01
2
  TABLE ACCESS BY INDEX ROWID EMAILS 1
4016 2  (0) 00:00:01
* 3
   DOMAIN INDEX EMAILBODYTEXT



Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security',1)>0)

Above execution plan tells that you are using Lucene Domain Index but you can get a better optimizer plan by using lcontains sort:
SQL> explain plan for
  2  SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
  3  where lcontains(bodytext,'security','subject:ASC',1)>0;

Explained.

Elapsed: 00:00:00.01
SQL> set echo off

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------
Plan hash value: 1450245214
Id
Operation
Name
Rows
Bytes
Cost (%CPU) Time
0
SELECT STATEMENT
1
4016 2 (0) 00:00:01
1
 TABLE ACCESS BY INDEX ROWID EMAILS 1
4016 2 (0) 00:00:01
* 2
   DOMAIN INDEX EMAILBODYTEXT



Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security','subject:ASC',1)>0)

Here we have a better optimizer plan and lower cost.

5 Know caveats

  1. Lucene Domain Index uses Java Util Logging API it means that a grant is required to create and operate any index:
    dbms_java.grant_permission( 'USER_NAME', 'SYS:java.util.logging.LoggingPermission', 'control', '' )
  2. SyncMode:OnLine should be reserved only for index which a number of update/insert/delete operation are too small compared to select operations, because each message process requires almost open an IndexWriter/IndexReader on the associated Lucene Index by a background process, except for bulk collect operation or "insert into ... select ... from" which are processed in batch off 150 rows. Tables with many insert/update operations by seconds should use LuceneDomainIndex.sync(idx) procedure called by DBMS_JOB periodically or by the application.
  3. Syntax for Inline pagination is only supported at the beginning of the Query, it means that if you want to perform pagination lcontains() query syntax must start with "rownum:[n TO m] AND" note that this syntax is case sensitive. Also this extraction is performed by splitting the query by position and does not take into account grouping operator, so this query "rownum:[1 TO 10] AND word1 OR word2" will be passed to Lucene's Query Parser as "word1 OR word2" which is not semantically the original one if you look precedence operator. We can try to modify Query Parser class in a future to solve this semantic issues.
  4. Since October 25 column name are case sensitive in ExtraCols and FormatCols parameters using traditional SQL behavior, it means that for this DDL index creation:
    create index it1 on t1(f2) indextype is lucene.LuceneIndex
    parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');
    You can use ExtraCols with f3 or F3 but FormatCols should be F3 because f3 is returned by the SQL select operation as F3 during the table full scan, also Lucene Index will have a document with a Field F3 instead of f3. If you want to use f3 as is you can re-write DDL index creation with:
    create index it1 on t1(f2) indextype is lucene.LuceneIndex
    parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),f3(00.00);ExtraCols:F3 "f3"');
    With this sentence Lucene will create documents with two field F2 and f3, F2 is uppercase because is the master column of the index and his passed as "F2" by ODCI API but, due is the default Field of the query, you can omit his name at lcontains syntax, F3 now is lowercase and will be indexed as a Field "f3".
  5. Since November Index parameters are pre-cached in memory for faster response. Due isolation behaviour of Oracle JVM sessions, if you call to alter index or re-create a new one in another session you need to close all SQL session that are already pre-load an index parameter storage.
    Calling to LuceneDomainIndex.getParameter('owner.index_name','parameter_name') you can see the values of any parameter passed to the ODCI API either by calling create index or alter index.
    Otherwise you can call to LuceneDomainIndex.refreshParameterCache stored procedure.
  6. If you re-install Lucene Domain Index without previous deleting existing indexes you can manually drop resources associated to and old index. For example:
    SQL> drop index source_big_lidx force;
    Index dropped.
    SQL> select table_name from tabs;

    TABLE_NAME
    ------------------------------
    DEPT
    EMP
    BONUS
    SALGRADE
    SOURCE_BIG_LIDX$QT
    DR$SOURCE_BIG_IDX$I
    DR$SOURCE_BIG_IDX$R
    SOURCE_BIG_LIDX$T
    TEST_SOURCE_BIG
    DR$SOURCE_BIG_IDX$N
    DR$SOURCE_BIG_IDX$K

    11 rows selected.
    SQL> drop table SOURCE_BIG_LIDX$T;

    Table dropped.
    SQL> conn / as sysdba
    connected.
    SQL>exec DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q')
    BEGIN DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q'); END;

    *
    ERROR at line 1:
    ORA-01403: no data found
    ORA-06512: at "SYS.DBMS_AQADM_SYS", line 3359
    ORA-06512: at "SYS.DBMS_AQADM", line 167
    ORA-06512: at line 1

    SQL> exec DBMS_AQADM.DROP_QUEUE_TABLE(queue_table  => 'SCOTT.SOURCE_BIG_LIDX$QT', force=>true);

    PL/SQL procedure successfully completed.

    SQL> exit
    Note that "drop index ... force" will de-register Lucene Domain Index from Oracle's system views, then Lucene Domain Index storage's table is manually dropped, finally connected as SYS Lucene Domain Index AQ's table is dropped.
  7. Oracle 11g have a know bug "6445561 - ORA-00600 [26599] [62] DUE TO INCORRECT PERSISTENCE OF BY INVOKER PIN" please apply patch number p6445561_111060_LINUX.zip available at Metalink, this bug affects select count(*) with a large results.

Appendixes

A. Parameter reference and syntax

Lucene Domain Index accept several parameters which can be passed using create index or alter index DDL commands. This parameters are divided into four categories, Index Writer, Analyzer, User Data Store and General parameters.

A.1 Lucene Index Writer parameters

This section covers Lucene Index Writer parameters for more information about this parameter see Lucene docs and Wiki.

A.1.1 MergeFactor

Determines how often segment indices are merged by addDocument(). If you are creating a new index over a table with thousands of rows a value of 100 to 500 is good value.

A.1.2 MaxBufferedDocs

Determines the minimal number of documents required before the buffered in-memory documents are merged and a new Segment is created. This value can cause an out of memory exception you provide a value larger than user space available. A typical SGA configuration can accept values of 4000 or 5000 depending how big are your rows being indexed. If you are not sure of how megabytes can consume your rows you can use AutoTuneMemory:true parameter which is a default value, so you choose true MaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of Oracle Java Pool Size value.

A.1.3  MaxMergeDocs

Determines the largest number of documents ever merged by addDocument().

A.1.4 MaxBufferedDeleteTerms

Determines the minimal number of delete terms required before the buffered in-memory delete terms are applied and flushed.

A.1.5 UseCompoundFile

Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished. This is done regardless of what directory is in use. By default Lucene Domain Index do not use compound file format because its not affected by max open file descriptors.

A.2 Analyzer parameters

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.

Analyzer, PerFieldAnalyzer or Stemmer parameter affects indexing and query expressions, so if you want to change this parameter on a exists index you to must rebuild it, the priority of these three parameters is first check for the Stemmer if its not present check for PerFieldAnalyzer if its not present checks for Analyzer parameter, finally if none of them are defined will use SimpleAnalyzer.

A.2.1 Analyzer

This parameter is fully qualified Java class name which extends org.apache.lucene.analysis.Analyzer. For example:

  • BrazilianAnalyzer
  • ChineseAnalyzer
  • CJKAnalyzer
  • CzechAnalyzer
  • DutchAnalyzer
  • FrenchAnalyzer
  • GermanAnalyzer
  • GreekAnalyzer
  • KeywordAnalyzer
  • PatternAnalyzer
  • RussianAnalyzer
  • SimpleAnalyzer
  • StandardAnalyzer
  • StopAnalyzer
  • ThaiAnalyzer
  • WhitespaceAnalyzer
See Lucene Java Docs for more details. A default analyzer is SimpleAnalyzer.

A.2.2 Stemmer

Stemmer is another kind of analyzer which divides words, stop words and another term related object based on an specific language. Stemmer parameter use Snowball Analyzer, possible values for Stemmer parameter using Lucene 2.2.0 distribution are:
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • German2
  • Italian
  • Kp
  • Lovins
  • Norwegian
  • Porter
  • Portuguese
  • Russian
  • Spanish
  • Swedish
Stemmer parameter override Analyzer parameter.

A.2.3 PerFieldAnalyzer

PerFieldAnalyzer is a wrapper of other analyzers which provides an independent analyzer for each column being indexed, see PerFieldAnalyzerWrapper class in Lucene documentation. Each column could have his own analyzer which extends org.apache.lucene.analysis.Analyzer. If a column is not in the list StandardAnalyzer will be used as default. For example:
create table t1 (f1 VARCHAR2(10), f2 XMLType);
insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));

create index it1 on t1(f2) indextype is lucene.LuceneIndex
   parameters('IncludeMasterColumn:false;
   ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/@id'') "id";
   FormatCols:F1(000),id(00)');

alter index it1 rebuild parameters('PerFieldAnalyzer:F1(org.apache.lucene.analysis.KeywordAnalyzer),id(org.apache.lucene.analysis.KeywordAnalyzer)');
In the above example four columns are being indexed by Lucene Domain Index rowid (added by default) using KeywordAnalyzer, F1 and id (added by ExtraCols parameter) using KeywordAnalyzer too, and finally name which is not included into PerFieldParameter and then using StandardAnalyzer.


A.3 User Data Store parameters

Lucene Domain Index implements a User Data Store functionality, this functionality provides many parameters to control which column will be included into a Lucene Document which is inserted into the index.
and First three parameters are used to choose which columns will added to the index in addition to the master column. Oracle Domain Index are bound to a single column, this is a limitation with Oracle 10g version. To avoid this problem passing ExtraCols, ExtraTabsWhereCondition you can easily build a set of new column from the master table and others. Basically a select DML statement is built using these parameters. To clarify this Lucene Domain Index will performs a query like:
Full table scan (create index statement):
SELECT rowid,MasterTable.MasterColumn,ExtraCols FROM MasterTable,ExtraTabs where WhereCondition;

Find a particular rowid (insert,update operations):
SELECT MasterTable.MasterColumn,ExtraCols FROM MasterTable,ExtraTabs where MasterTable.rowid=:rowid AND WhereCondition;

Text in italic are injected by Lucene Domain Index and text in bold are user defined.

A.3.1 ExtraCols

A coma separated list of columns of the Master table of table being indexed or the tables defined into ExtraTabs parameter. Note that  if you don't define columns alias column name are capitalized by default on Oracle databases. For example 'ExtraCols:F2 "f2",T2.F3 "f3"' note that you can omit master table name if there is no collisions

A.3.2 ExtraTabs

A coma separated list of table name and alias for this tables. For example 'ExtraTabs:T2 aliasT2,T3 aliasT3'. Note that ODCI API only will detect changes at index master column, to notify changes based on ExtraCols list you need to attach triggers, see section examples above for more detail.

A.3.3 WhereCondition

An SQL where condition used to join index's master table with ExtraTabs tables. For example: 'WhereCondition:T1.f1=T2.f2(+) AND T1.F1=aliasT3.f3'. Be careful to produce a correct join condition to guaranty single row result; multiple or zero row result based on the master table values are not allowed.
Note: Up to Lucene Domain Index 2.9.0, if you use a WhereCondition which have an OR operator put this where condition enclosed with () because the precedence of the OR over the AND operator makes that some queries returns more rows that the correct behavior, for example instead of:
WhereCondition:T1.F1='AA' OR T1.F1='BB'
put:
WhereCondition:(T1.F1='AA' OR T1.F1='BB')
this workaround fix some problems when working in OnLine mode. Starting with 2.9.1 version this extra () are not required.

A.3.4 UserDataStore

This is a fully Java Class name which implements org.apache.lucene.indexer.UserDataStore interface, you can create your own Data Store class implementing this interface. By default Lucene Domain Index provides an implementation which covers most of the typical scenarios, this class is org.apache.lucene.indexer.DefaultUserDataStore and use FormatCols parameter to create Lucene Fields.

A.3.2 FormatCols

A coma separated list of column(format) strings interpreted by User Data Store class to control how an specific database column will be transformed in a Lucene Field. For example you can choose padding, un-tokenized values and so on.
Supported formats by Default Data Store class are:
  • Number padding for numeric columns using java.text.DecimalFormat class syntax, default is 0000000000.
  • Date rounding for timestamp and date columns using org.apache.lucene.document.DateTools, default is day.
  • Character left padding for VARCHAR2 or CHAR columns using org.apache.lucene.util.StringUtils class (leftPad method), default is no left char padding. Any char can be used for left padding.
  • XPath expression for XMLType columns, this XPath string will be passed to XMLType.extract("format","") method, the result of the XPath extraction will be a new XMLType object over getStringVal() will executed. If you want to perform more user defined XMLType to Field extraction extend DefaultUserDataStore class or use virtual column indexing.
  • For columns of type VARCHAR2 or CHAR you can use an special string NOT_ANALYZED or NOT_ANALYZED_STORED as format which tell to Default User Data Store class that this column will be indexed but un-tokenized, this is useful with columns which will be used for sorting.

A.4 General parameters

This set of parameters are Lucene Domain Index specific parameters.

A.4.1 SyncMode

SyncMode tells to Lucene Domain Index which strategy is used to update the index. SyncMode:Deferred (default) left to the application when the index is synced either by calling LuceneDomainIndex.sync procedure after a set of changes pending or by DBMS_SCHEDULER process at an specific time. With SyncMode:Deferred update and insert operations are queued using DBMS_AQ package. Delete operations are never enqueued because require an update on Lucene Index to not return rowid of deleted rows.
SyncMode:OnLine is implemented by using DBMS_AQ PLSQL callback, so immediately after a commit operation which involves insert or update rows a parallel process dbms_j* is automatically started by DBMS_AQ package to applied pending changes. SyncMode:OnLine should be reserved for index which update, insert or delete operations are much lower than select, AQ callbacks can not handle very well exceptions during sync time, for example when a row being index is locked by another session, so some changes can be lost with this scenario.

A.4.2 AutoTuneMemory

AutoTuneMemory:true (default) overrides MaxBufferedDocs parameter, it defines dynamically MaxBufferedDocs based on how much memory is reported by OracleRuntime.getJavaPoolSize() method.
After each document is added to the index it calls to writer.ramSizeInBytes() and test that is not over a 90% of the ram free.
This parameter works in most of the common cases, but you can get a Java out of memory error in multiuser environments because Java Pool Size is common parameter for all the sessions. If you get an exception during index creation time set AutoTuneMemory:false and adjust MaxBufferedDocs to a value which not raise an out of memory exception.

A.4.3 LobStorageParameters

Lucene Domain Index uses a BLOB column named "data" for storing Lucene Inverted index files. You can control any LOB storage parameter with this parameter during index creation time, his default value is 'LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CACHE READS NOLOGGING' for 11g databases you can use a better optimize storage by using newest Secure LOB parameter, for example: 'LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING'

A.4.4 LogLevel

Lucene Domain Index uses JDK Java Util Logging package, LogLevel parameter is any of the string defined by Level.parse() method, for example: LogLevel:ALL. By default logging level is defined to WARNING.
Lucene Domain Index uses:
  • SEVERE for non recoverable error conditions
  • FINER for debugging purpose such as ODCI API arguments
  • INFO for checking index operations such as value being indexed
  • WARNING for error messages which are reported as ERROR through ODCI API
  • CONFIG to see user parameters changed by users
Logging information is sent by default to Oracle .trc files, but you can redirect this output using dbms_java.set_output procedure for example.
If you are not sure which field and how these fields are added to the index change LogLevel to INFO and check for lines starting with: "INFO: Document<"
exiting and throwing methods does not print messages also with log level defined to ALL. This is because logging level used by these methods are controlled by ConsoleHandler level.
To get these methods work copy logging.properties file from your JAVA_HOME/jre/lib to ORACLE_HOME/javavm/lib directory and edit the line which includes level property:
    # Limit the message that are printed on the console to INFO and above.
    java.util.logging.ConsoleHandler.level = ALL
    java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
Then shutdown and startup your Oracle database.

A.4.5 CachedRowIdSize

CachedRowIdSize is used by an LRU cached used to maintain the association between Lucene Doc ID and a particular Oracle ROWID. For very big table using an array to store this association can consume a lot of SGA RAM, starting with Lucene Domain Index 2.9.0.1.0 only 10.000 ROWID are stored in this cache, tables with high frequency of updates can use this LRU small due every caused that LRU is completed flushed, but tables with low frequency of updates/deletes can get a lot of performance improvement by using larger LRU cached size.

A.5 Query parameters

This set of parameters which affects QueryParser and search functionality.

A.5.1 DefaultColumn

DefaultColumn defines which columns is used as default column in QueryParser syntax, if this parameter is not set master column of the index is used, this name is a Lucene Field name. Here an example:
create index pages_lidx_all on pages p (value(p))
indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;ExtraCols:extractValue(object_value,''/page/title'') "title",extractValue(object_value,''/page/revision/comment'') "comment",extract(object_value,''/page/revision/text/text()'') "text",extractValue(object_value,''/page/revision/timestamp'') "revisionDate";FormatCols:revisionDate(day);IncludeMasterColumn:false;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

Note the correlation between DefaultColumn and ExtraCols. ExtraCols defines a Lucene Field named "text" with a value calculated by the SQL expression extract(object_value,''/page/revision/text/text()''), then you can use a Lucene Field text as default Field in QueryParser syntax.

A.5.2 DefaultOperator

DefaultOperator defines which Boolean operator is used in QueryParser syntax, if this parameter is not set OR operator is his default value.

A.5.3 NormalizeScore

NormalizeScore is used during Lucene Index scan to know if they need to track the maximum score, the maximum score then used to normalize the result of lscore() operator to return only values between 0 to 1. If you don't need a normalized range of the score you can avoid this computation and your query will be fast. Note that a not normalized score not implied that the document are not in order of relevance.

A.5.4 PreserveDocIdOrder

PreserveDocIdOrder is an internal parameter which is used by Lucene in some kind of operator, if you don't need that result preserve Lucene Doc ID in order rather than the relevance, you can put this value to false (default) and some operator will be fast.

A.6 Highlight parameters

This set of parameters which affects lhighlight, phighlight and rhighlight functionality.

A.6.1 Formatter

Formatter defines a valid class name which implements Lucene Interface Formatter and with a constructor with no arguments, default value org.apache.lucene.search.highlight.SimpleHTMLFormatter.

A.6.2 MaxNumFragmentsRequired

MaxNumFragmentsRequired defines a number of text fragments returned by Highlight function, default value is 4.

A.6.3 FragmentSize

FragmentSize defines the size of each fragment returned in characters of each fragment, default value is 100.

A.6.4 FragmentSeparator

FragmentSeparator defines a String used as fragment separator, default value is "...". Note that you can not use ";" or ":" as fragment separator because are used as parameter and value delimiters into alter index ... parameters(..) DDL statement.

B Lucene Domain Index Storage

OJVMDirectory class creates a set of Oracle objects to represent Lucene Inverted Index and Domain Index functionality. First it creates a table named IDX_NAME$T (IDX_NAME is your Lucene Domain Index used at create index DDL statement) with this structure:
Name
Null?
Type
NAME
NOT NULL
VARCHAR2(30)
LAST_MODIFIED

TIMESTAMP(6)
FILE_SIZE

NUMBER(38)
DATA

BLOB
DELETED

CHAR(1)

Also have and index based on IDX_NAME$T.DELETED column to speedy up purge operations.
To enqueue operation at the index it defines a DBMS_AQ Queue IDX_NAME$Q with his storage table IDX_NAME$QT.
IDX_NAME$Q queue have payload defined as LUCENE_MSG_TYP object. This object type is defined as:
Name
Null?
Type
RIDLIST

SYS.ODCIRIDLIST
OPERATION

VARCHAR2(32)

SYS.ODCIRIDLIST is an special structure defined by ODCI API to hold a list of rowid changed by an DML operation. OPERATION is one of insert, delete, update, rebuild or optimize reserved keyword. rebuild and optimize operations are used with SyncMode:OnLine to perform these tasks automatically using a background process.

C JUnit test suites explained

C.1 DBTestCase base class

This is base class for most of the test suites includes.
It provides a connection pool using OracleDataSource with a minimum of two ready to use connection and growing to 5, after this it will wait up to 20 seconds for free connection. This connection pool is created at the class constructor.
Utility methods provided by this class, each method use is own SQLConnection, so they are autonomous transactions:
  • createTable(), create a test table as follow, (T1 is a constant value defined as TABLE):
create table T1 (
  f1 number primary key,
  f2 varchar2(200),
  f3 varchar2(200),
  f4 number)

  • dropTable(), drop table created above.
  • createIndex(), add a Lucene Domain Index to previous one created table as follow, (LogLevel,Analyzer,MergeFactor,ExtraCols and FormatCols are customizable at class level, after index creation MergeFactor is reduced to 2):
create index IT1 on T1(f2) indextype is lucene.LuceneIndex
  parameters('LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;ExtraCols:F1;FormatCols:F1(0000)')

  • dropIndex(), drop previous one index.
  • int insertRows(int startIndex, int endIndex), insert a set of rows at above table with F1 column varying from startIndex to endIndex. F2 column is an english text representation of F1, F4 is F1*10 and F3 is an english text representation of F1*10.
    Return a number of rows inserted. If there are problems such as primary key violation it rollback the transaction.
  • int deleteRows(int startIndex, int endIndex), delete a set of rows where F1 between startIndex and endIndex.
    Return a number of rows deleted. If there are problems rollback the transaction.
    Note that deleting rows automatically update Lucene Index.
  • int updateRows(int startIndex, int endIndex), update F2 column with his own value to fire ODCI update method on each row between startIndex and endIndex.
    Return a number of rows updated.
  • findRows(int n), find rows which F2 match again a text representation of n using lcontains operator. It only test for a result having 0 or more rows.
  • long syncIndex(), perform a sync operation at Lucene Domain Index applying pending changes (inserts, updates). If there are errors, usually caused by another transaction having an exclusive lock in a row being indexed, it rollback the operation. Next successful sync will apply pending changes of failed operations.
    Return a long value with the amount of milliseconds spent during sync.
  • long optimizeIndex(), perform an optimize operation at Lucene Domain Index merging segments in a new one. If there are errors, usually caused by another transaction having an exclusive lock on the index, it rollback the operation.
    Return a long value with the amount of milliseconds spent during optimize.

C.2 TestDBIndex

Simple test which create a table his index and performs insertions, sync, optimize and deletions, finally drop index and table. His output look like:
    [junit] Testsuite: org.apache.lucene.index.TestDBIndex
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.836 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 40 total char inserted: 415 avg text length: 10
    [junit] Index synced: IT1 elapsed time: 265 ms.
    [junit] Avg Sync time: 6
    [junit] Index optimized: IT1 elapsed time: 40 ms.
    [junit] Avg Optimize time: 1
    [junit] Row deleted 40, from: 10 to: 49 elapsed time: 1303 ms. Avg time: 32 ms.
    [junit] Index droped: IT1
    [junit] Table droped: T1

C.3 TestDBIndexAddDoc

Performs several insertions and sync, starting with 10 rows, then 90 and so on, ending with 3.000 insertions using insertRow method of DBTestCase base class. After each batch of insertions calls to syncIndex method calculating average time of sync method for each row inserted. His output look like:
    [junit] Testsuite: org.apache.lucene.index.TestDBIndexAddDoc
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 64.696 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Index synced: IT1 elapsed time: 126 ms.
    [junit] Inserted rows: 10 total char inserted: 49 avg text length: 4
    [junit] Index synced: IT1 elapsed time: 142 ms.
    [junit] Avg Sync time: 14
    [junit] Inserted rows: 90 total char inserted: 988 avg text length: 10
    [junit] Index synced: IT1 elapsed time: 374 ms.
    [junit] Avg Sync time: 4
    [junit] Inserted rows: 400 total char inserted: 9201 avg text length: 23
    [junit] Index synced: IT1 elapsed time: 1276 ms.
    [junit] Avg Sync time: 3
    [junit] Inserted rows: 500 total char inserted: 11726 avg text length: 23
    [junit] Index synced: IT1 elapsed time: 1601 ms.
    [junit] Avg Sync time: 3
    [junit] Inserted rows: 1000 total char inserted: 35950 avg text length: 35
    [junit] Index synced: IT1 elapsed time: 4675 ms.
    [junit] Avg Sync time: 4
    [junit] Inserted rows: 3000 total char inserted: 110851 avg text length: 36
    [junit] Index synced: IT1 elapsed time: 25480 ms.
    [junit] Avg Sync time: 8
    [junit] Index droped: IT1
    [junit] Table droped: T1

C.4 TestDBIndexDelDoc

At setup method this test case a create a table and fill it with 500 rows. Then performs deletions batch of 10, 90 and 400 rows each calculating average time for each row deleted. His output look like:
    [junit] Testsuite: org.apache.lucene.index.TestDBIndexDelDoc
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 20.543 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 500 total char inserted: 10238 avg text length: 20
    [junit] Index synced: IT1 elapsed time: 1643 ms.
    [junit] Row deleted 10, from: 1 to: 10 elapsed time: 356 ms. Avg time: 35 ms.
    [junit] Row deleted 90, from: 11 to: 100 elapsed time: 2535 ms. Avg time: 28 ms.
    [junit] Row deleted 400, from: 101 to: 500 elapsed time: 11526 ms. Avg time: 28 ms.
    [junit] Index droped: IT1
    [junit] Table droped: T1

C.5 TestDBIndexParallel

This is more complex test case to check concurrent access to Lucene Domain Index. To do this creates several threads, some for simulating batch insertions of 10 rows, others for simulating batch deletions of 10 rows, another for simulating batch updates of 10 rows and finally many threads searching for rows each 0.5 seconds.
By default creates 3 threads for each kind of operations and each thread perform:
  • 20 inserts
  • 5 deletes
  • 5 update
  • 100 search
Each thread takes his own connection from the connection pool and do his job, if fastSync constant is true after each successful insert and update it calls to syncIndex method to update Lucene Index, if fastSync is false another thread is started performing sync index each 1 second.
It end when all threads (inserts, deletes, updates) finish.
Here some part of his output:
    [junit] Testsuite: org.apache.lucene.index.TestDBIndexParallel
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 97.7 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] FastSync: true
    [junit] Deleter 1 deleting at block 70
    [junit] Updater 1 updating at block 70
    [junit] Inserter 2 inserting at block 90
    [junit] No Row deleted at: 70 to: 79 elapsed time: 131 ms.
    [junit] No Row updated at: 70 to: 79 elapsed time: 12 ms.
    [junit] Searcher 2 searching row 30
    [junit] Searcher 1 searching row 77
    [junit] Not Found rows with: thirty  elapsed time: 211 ms.
    [junit] Not Found rows with: seventy-seven  elapsed time: 170 ms.
    [junit] Inserted rows: 10 total char inserted: 115 avg text length: 11
    [junit] Searcher 2 searching row 62
    [junit] Searcher 0 searching row 63
    [junit] Searcher 1 searching row 49
    [junit] Not Found rows with: sixty-two  elapsed time: 64 ms.
    [junit] Index synced: IT1 elapsed time: 283 ms.
    [junit] Not Found rows with: sixty-three  elapsed time: 215 ms.
    [junit] Searcher 2 searching row 74
    [junit] Not Found rows with: seventy-four  elapsed time: 39 ms.
    [junit] Not Found rows with: forty-nine  elapsed time: 137 ms.
    [junit] Searcher 1 searching row 95
    [junit] Searcher 2 searching row 46
    [junit] Found rows with: ninety-five  elapsed time: 103 ms.
....
    [junit] Updater 2 updating at block 20
    [junit] No Row updated at: 20 to: 29 elapsed time: 3 ms.
    [junit] Inserted rows: 10 total char inserted: 80 avg text length: 8
    [junit] Searcher 0 searching row 97
    [junit] Found rows with: ninety-seven  elapsed time: 60 ms.
    [junit] Index synced: IT1 elapsed time: 147 ms.
.....
    [junit] Searcher 2 searching row 39
    [junit] Searcher 1 searching row 84
    [junit] Not Found rows with: thirty-nine  elapsed time: 33 ms.
    [junit] Not Found rows with: eighty-four  elapsed time: 38 ms.
    [junit] Updater 0 updating at block 90
    [junit] Row updated 10, from: 90 to: 99 elapsed time: 16 ms. Avg time: 1 ms.
    [junit] Index synced: IT1 elapsed time: 162 ms.
......
    [junit] Inserted rows: 10 total char inserted: 125 avg text length: 12
    [junit] Searcher 0 searching row 57
    [junit] Searcher 1 searching row 28
    [junit] Deleter 1 deleting at block 80
    [junit] Searcher 2 searching row 64
    [junit] No Row deleted at: 80 to: 89 elapsed time: 58 ms.
    [junit] Not Found rows with: twenty-eight  elapsed time: 112 ms.
    [junit] Not Found rows with: fifty-seven  elapsed time: 155 ms.
    [junit] Index synced: IT1 elapsed time: 242 ms.
    [junit] Searcher 0 searching row 98
    [junit] Found rows with: ninety-eight  elapsed time: 72 ms.
    [junit] Not Found rows with: sixty-four  elapsed time: 175 ms.
    [junit] Searcher 0 searching row 27
    [junit] Not Found rows with: twenty-seven  elapsed time: 75 ms.
    [junit] Searcher 1 searching row 5
    [junit] Deleter 2 deleting at block 50
    [junit] Searcher 2 searching row 84
    [junit] Not Found rows with: eighty-four  elapsed time: 20 ms.
    [junit] Updater 2 updating at block 10
    [junit] No Row deleted at: 50 to: 59 elapsed time: 28 ms.
    [junit] Row updated 10, from: 10 to: 19 elapsed time: 36 ms. Avg time: 3 ms.
    [junit] Found rows with: five  elapsed time: 216 ms.
.................
    [junit] Inserter 1 inserting at block 50
    [junit] Found rows at: 50 position, ignoring insertions
    [junit] Index droped: IT1
    [junit] Table droped: T1

C.6 TestDBIndexSearchDoc

This test check some special features of lcontains operator such as in-line pagination, sort by and filter by expressions.

First create a table with 200 rows and then query them, his output look like:

    [junit] Testsuite: org.apache.lucene.index.TestDBIndexSearchDoc
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 14.001 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
    [junit] Index synced: IT1 elapsed time: 746 ms.
    [junit] testFilterAll()
    [junit] Excecution time: 129 ms.
    [junit] 120 Score: 0.9606395 str: one hundred twenty
    [junit] 119 Score: 0.25453204 str: one hundred nineteen
    [junit] 118 Score: 0.25453204 str: one hundred eighteen
    [junit] 117 Score: 0.25453204 str: one hundred seventeen
    [junit] 116 Score: 0.25453204 str: one hundred sixteen
    [junit] 115 Score: 0.25453204 str: one hundred fifteen
    [junit] 114 Score: 0.25453204 str: one hundred fourteen
    [junit] 113 Score: 0.25453204 str: one hundred thirteen
    [junit] 112 Score: 0.25453204 str: one hundred twelve
    [junit] 111 Score: 0.25453204 str: one hundred eleven
    [junit] Index droped: IT1
    [junit] Table droped: T1
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
    [junit] Index synced: IT1 elapsed time: 721 ms.
    [junit] testFilterBy()
    [junit] Excecution time: 162 ms.
    [junit] 103 Score: 1.0 str: one hundred three
    [junit] 120 Score: 0.9606395 str: one hundred twenty
    [junit] 101 Score: 0.28600293 str: one hundred one
    [junit] 100 Score: 0.27352643 str: one hundred
....
    [junit] 115 Score: 0.25453204 str: one hundred fifteen
    [junit] 116 Score: 0.25453204 str: one hundred sixteen
    [junit] Index droped: IT1
    [junit] Table droped: T1
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
    [junit] Index synced: IT1 elapsed time: 751 ms.
    [junit] testFilterByOrderBy()
    [junit] Excecution time: 138 ms.
    [junit] 120 Score: 0.9606395 str: one hundred twenty
    [junit] 119 Score: 0.25453204 str: one hundred nineteen
....
    [junit] 103 Score: 1.0 str: one hundred three
    [junit] 102 Score: 0.25453204 str: one hundred two
    [junit] 101 Score: 0.28600293 str: one hundred one
    [junit] 100 Score: 0.27352643 str: one hundred
    [junit] Index droped: IT1
    [junit] Table droped: T1
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
    [junit] Index synced: IT1 elapsed time: 761 ms.
    [junit] testPagination()
    [junit] Excecution time: 193 ms.
    [junit] 117 Score: 0.03489425 str: one hundred seventeen
    [junit] 118 Score: 0.03489425 str: one hundred eighteen
....
    [junit] 132 Score: 0.03489425 str: one hundred thirty-two
    [junit] 134 Score: 0.03489425 str: one hundred thirty-four
    [junit] Index droped: IT1
    [junit] Table droped: T1
    [junit] Table created: T1
    [junit] Index created: IT1
    [junit] Index altered: IT1
    [junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
    [junit] Index synced: IT1 elapsed time: 743 ms.
&nbsp;   [junit] testCountHits()
    [junit] Excecution time: 53 ms.
    [junit] Hits: 126
    [junit] Index droped: IT1
    [junit] Table droped: T1

C.7 TestQueryHits

This test is not autonomous because requires an additional step to run. Before run it create a table and his Lucene Index with:
create table test_source_big as (select * from all_source);
create index source_big_lidx on test_source_big(text)
         indextype is lucene.LuceneIndex
         parameters('AutoTuneMemory:true;MergeFactor:500;FormatCols:line(0000);ExtraCols:line "line"');
For 11g databases you can create a best optimize Lucene Index using some new Secure LOB features:
create index source_big_lidx on test_source_big(text)
         indextype is lucene.LuceneIndex
    parameters('FormatCols:line(0000);ExtraCols:line "line";Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

On 10g running it as SCOTT, TEST_SOURCE_BIG table will have 220731 rows using a typical installation based on database templates.
Using above table two test checks performance with a query which returns 18387 hits, once call to LuceneDomainIndex.countHits function and another iterate over the result in pages of ten rows, typical scenario of web applications. His output look like:
    [junit] Testsuite: org.apache.lucene.indexer.TestQueryHits
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.656 sec
    [junit]
    [junit] ------------- Standard Output ---------------
    [junit] iteration from: 13775 to: 13785
    [junit] Step time: 791 ms.
    [junit] iteration from: 13785 to: 13795
    [junit] Step time: 49 ms.
    [junit] iteration from: 13795 to: 13805
    [junit] Step time: 40 ms.
    [junit] iteration from: 13805 to: 13815
    [junit] Step time: 44 ms.
    [junit] iteration from: 13815 to: 13825
    [junit] Step time: 40 ms.
    [junit] iteration from: 13825 to: 13835
    [junit] Step time: 42 ms.
    [junit] iteration from: 13835 to: 13845
    [junit] Step time: 41 ms.
    [junit] iteration from: 13845 to: 13855
    [junit] Step time: 50 ms.
    [junit] iteration from: 13855 to: 13865
    [junit] Step time: 41 ms.
    [junit] iteration from: 13865 to: 13875
    [junit] Step time: 41 ms.
    [junit] Elapsed time: 1877
    [junit] Hits: 18387
    [junit] Elapsed time: 564
Note that first iteration took more time because it includes parsing time and caching, also to simulate a real word web application an SQLConnection is take and returned to the pool on each iteration.


D Functions, operators and utilities

D.1 lcontains ancillary operator

lcontains operator is similar to Oracle Text score operator, but differs in query argument and support another one argument to define in-line sorting.
Syntax
LCONTAINS(
         [schema.]column,
         text_query VARCHAR2
         [,sort       VARCHAR2]
         [,label       NUMBER])
RETURN NUMBER;
[schema.]column
Specify the Lucene text column to be searched on. This column must have a Lucene Domain Index associated with it.
text_query
Specify a Lucene Query Parser syntax argument. In addition to Lucene Query Parser syntax, Lucene Domain Index support in-line pagination at lcontains, to do that this query must start with rownum[nn TO mm] AND where nn and mm are rownum values of the result query which will be returned, in Oracle syntax rownum start with 1, and this boundary are inclusive which means that for 20 to 30 we get 11 rows.
Follwing and excerpt of Lucene Query Parser Syntax.
Terms
A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases.
A Single Term is a single word such as "test" or "hello".
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
Multiple terms can be combined together with Boolean operators to form a more complex query (see below).
Note: The analyzer used to create the index will be used on the terms and phrases in the query string. So it is important to choose an analyzer that will not interfere with the terms used in the query string.
Fields
Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific.
You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.
As an example, let's assume a Lucene index contains two fields, title and text and text is the default field. If you want to find the document entitled "The Right Way" which contains the text "don't go this way", you can enter:
title:"The Right Way" AND text:go
or
title:"Do it right" AND right
Since text is the default field, the field indicator is not required.
Note: The field is only valid for the term that it directly precedes, so the query
title:Do it right
Will only find "Do" in the title field. It will find "it" and "right" in the default field (in this case the text field).
Term Modifiers
Lucene supports modifying query terms to provide a wide range of searching options.
Wildcard Searches
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
te?t
Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
Note: You cannot use a * or ? symbol as the first character of a search.
Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
The default that is used if the parameter is not given is 0.5.
Proximity Searches
Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
"jakarta apache"~10
Range Searches
Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.
mod_date:[20020101 TO 20030101]
This will find documents whose mod_date fields have values between 20020101 and 20030101, inclusive. Note that Range Queries are not reserved for date fields. You could also use range queries with non-date fields:
title:{Aida TO Carmen}
This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen.
Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
Boosting a Term
Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for
jakarta apache
and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:
jakarta^4 apache
This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:
"jakarta apache"^4 "Apache Lucene"
By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)
Boolean Operators
Boolean operators allow terms to be combined through logic operators. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).
The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR.
To search for documents that contain either "jakarta apache" or just "jakarta" use the query:
"jakarta apache" jakarta
or
"jakarta apache" OR jakarta
AND
The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND.
To search for documents that contain "jakarta apache" and "Apache Lucene" use the query:
"jakarta apache" AND "Apache Lucene"
+
The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document.
To search for documents that must contain "jakarta" and may contain "lucene" use the query:
+jakarta lucene
NOT
The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT.
To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
"jakarta apache" NOT "Apache Lucene"
Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:
NOT "jakarta apache"
-
The "-" or prohibit operator excludes documents that contain the term after the "-" symbol.
To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
"jakarta apache" -"Apache Lucene"
Grouping
Lucene supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.
To search for either "jakarta" or "apache" and "website" use the query:
(jakarta OR apache) AND website
This eliminates any confusion and makes sure you that website must exist and either term jakarta or apache may exist.
Field Grouping
Lucene supports using parentheses to group multiple clauses to a single field.
To search for a title that contains both the word "return" and the phrase "pink panther" use the query:
title:(+return +"pink panther")
Escaping Special Characters
Lucene supports escaping special characters that are part of the query syntax. The current list special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2
sort
Sort string is with syntax sortField1[[:(ASC|DESC)]:[type]] for example revisionDate:DESC:string, ASC or DESC is optional as type which is either string, int or float. Multimples fields can be used for sorting, sort string must be separated by , for example revisionDate:DESC:string,title:ASC.
If you don't include sort argument at lcontains operator, a Lucene natural order which is score descending will be used. For any other field ASC is the default sort order.
label
Is an string used in conjuntion with lscore operator to identified which is the lcontains operators is used for each lscore.

D.2 lscore ancillary operator

Use the LSCORE operator in a SELECT statement to return the score values produced by a LCONTAINS query. The LSCORE operator can be used in a SELECT, ORDER BY, or GROUP BY clause.
Syntax
LSCORE(label NUMBER)
label
Specify a number to identify the score produced by the query. Use this number to identify the LCONTAINS clause which returns this score.
Example
SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
where lcontains(bodytext,'security',1)>0;

D.3 lhighlight ancillary operator

Use the LHIGHLIGHT operator in a select statement to return a highlighted version of the master column of the index associated to the LCONTAINS query. By now only highlighting functionality is supported for the master column of the index and the return value of this function is a VARCHAR2 data type with the text highlighted. VARCHAR2 limitation is not a big problem because highlighted text usually is an small part of the original text of the column showed to user as a preview of the original document.
Syntax
LHIGHLIGHT(label NUMBER):VARCHAR2
label
Specify a number to identify the score produced by the query. Use this number to identify the LCONTAINS clause which returns this score.
Example
SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject
FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

D.4 phighlight pipeline table function

PHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2 or CLOB of the input query. Columns not included into cols argument will not be affected and they will be returned as is.
Syntax
PHIGHLIGHT(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt IN VARCHAR2) RETURN ANYDATASET
index_name
Specify a Lucene Index to use.
qry
Lucene Query Parser syntax, same as the second argument of lcontains, except for the Lucene Domain Index extension for pagination.
cols
A coma separated list of columns to highlight, note that are capitalized if you not use columns alias.
stmt
Any SQL text of the query to execute by DBMS_SQL package. Remember to use double single quote to represent a SQL single quote inside the string. Columns returned by this query should be mapped as String, BigDecimal, Timestamp, CLOB, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for example that for table with a column VARCHAR2(40) the associated Java type inside the OJVM is String, then it can be highlighted or returned by this pipeline table function.
Example
SELECT * FROM
TABLE(phighlight(
        'EMAILBODYTEXT',
        'lucene OR mysql',
        'SUBJECT,BODYTEXT',
        'select lscore(1) sc,e.* from eMails e where lcontains(bodytext,''rownum:[1 TO 10] AND (security OR mysql)'',''subject:ASC'',1)>0'
    ));

D.5 rhighlight pipeline table function

RHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2 or CLOB of the input query. Columns not included into cols argument will not be affected and they will be returned as is. This is a variant of PHighlight which requires an additional argument (rType) telling to this function the type that will be returned. This version is free to any kind of SQL injection and can start several invocation in parallel by the RDBMS based on the information of the last argument.
Syntax
RHIGHLIGHT(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType IN VARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET
index_name
Specify a Lucene Index to use.
qry
Lucene Query Parser syntax, same as the second argument of lcontains, except for the Lucene Domain Index extension for pagination.
cols
A coma separated list of columns to highlight, note that are capitalized if you not use columns alias.
rType
A collection to be returned by RHighlight table function, usually is "colType TABLE OF aRowType".
rws
Any SQL query wrapped by the function CURSOR if you are using SQLPlus for example, or a JDBC ResultSet passed as setObject(n,rs), if you are using an application in Java. Columns returned by this query should be mapped as String, BigDecimal, Timestamp, CLOB, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for example that for table with a column VARCHAR2(40) the associated Java type inside the OJVM is String, then it can be highlighted or returned by this pipeline table function.
Example
CREATE TYPE EMAILR AS OBJECT
( sc NUMBER,
  emailFrom VARCHAR2(256),
  emailTo VARCHAR2(256),
  subject VARCHAR2(4000),
  emailDate DATE,
  bodyText CLOB);

CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;

SELECT * FROM
TABLE(rhighlight(
        'EMAILBODYTEXT',
        'lucene OR mysql',
        'SUBJECT,BODYTEXT',
        'EMAILRSET',
        CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
            from eMails e where lcontains(bodytext,'rownum:[1 TO 10] AND (security OR mysql)','subject:ASC',1)>0)
    ));

D.6 MoreLike.this function

MoreLike.this function have two declarations, once using index_name argument which uses current connected users and owner,index_name pair for using index in another database schema.
Syntax
FUNCTION this(index_name IN VARCHAR2,
                      x IN ROWID,
                      f IN NUMBER DEFAULT 1,
                      t IN NUMBER DEFAULT 10,
                      minTermFreq IN NUMBER DEFAULT 2,
                      minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist
index_name
Specify a Lucene Index to use.
x
ROWID used as pivot, it defines which row is used to extract the text with term used for More Like This Lucene functionality. DefaultColumn parameter of the index is used to define the column used to get the text, only columns of type VARCHAR2, CLOB or XMLType are supported.
f,t
From to pagination information, default values are 1 to 10.
minTermFreq,minDocFreq
minTermFreq is the frequency below which terms will be ignored in the source doc, minDocFreq is the frequency at which words will be ignored which do not occur in at least this many docs, default values are 2 to 5.
SYS.odciridlist
Is an array of ROWIDs which can be wrapped with a pipeline table function ridlist_table for selecting his values, for example (select * from table(ridlist_table(ridlist))).

FUNCTION this(owner IN VARCHAR2,
                      index_name IN VARCHAR2,
                      x IN ROWID,
                      f IN NUMBER DEFAULT 1,
                      t IN NUMBER DEFAULT 10,
                      minTermFreq IN NUMBER DEFAULT 2,
                      minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist

owner
Database schema owner of the index, for example SCOTT. Previous function fill this argument using SQL USER function which returns a current connected user.
All the arguments are similar to the previous version.

D.7 lfacets aggregate function

lfacets() aggregate function have one argument which is a coma separated list of the index name and the category and sub category to be queries.
Syntax
FUNCTION lfacets(input IN VARCHAR2) RETURN LUCENE.agg_tbl
input
A coma separated list including index name, category and optional a sub category. Category and sub category are in Lucene Query Parser syntax including the column name indexed (Lucene Field), for example text:(Ciencias naturales y formales), line:[1 TO 10] and so on. When category and sub category are present the ODCI API start the computation by calculating the bit set of the main category and then iterate over each sub category doing a bit and operation between the two bit set.
LUCENE.AGG_TBL
Is a TABLE OF agg_attributes and AGG_ATTRIBUTES is an object type with two field qryText VARCHAR2(4000) and hits NUMBER when a category and sub category is passed as argument, this return value will be a table with each row representing the cardinality of intersection between the category and the sub category, it means a table with a number of rows equal to the number of sub categories.
To help formatting the output in a traditional query there is function ljoin() which receives as input an agg_tbl type plus a char separator and returns an string with all the rows, here the syntax:
FUNCTION ljoin(i_tbl  in agg_tbl,
                       i_glue IN VARCHAR2 := ',') RETURN VARCHAR2
i_tbl
A LUCENE.agg_tbl table to scan.
i_glue
A VARCHAR2 string to use as separator, default value ",".

E Project Change Log

2.9.1.1.0 Maintenance Release based on Lucene 2.9 (2.9.1) core base

2.9.0.1.0 (initial release based on Lucene 2.9.0 core base, 29/Sep/09) 

 

2.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09)

  • Do not store internal parameters into system's views and force to PopulateIndex:false
  • After every sync, now files marked as deleted are purged to free BLOB storage
  • Added lfacets aggregated function for doing facets
  • CountHits function no longer requires sort argument
  • Filter are stored/retrived only using QueryParser.toString() key
  • UN_TOKENIZED format string at DefaultUserDataStore class was replaced by NOT_ANALYZED or NOT_ANALYZED_STORED according to new Lucene definitions.
  • Fix bug when sync try to process more than 32767 rowids enqueued.
  • Added parameters for highlighting functions Formatter, MaxNumFragmentsRequired, FragmentSeparator and FragmentSize.
  • Added PerFieldAnalyzer parameter to use independent Analyzer for each columns.
  • Added sample of a custom Formatter org.apache.lucene.search.highlight.MyHTMLFormatter

2.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09)

  • Fix compatibility problem between 10g/11g SQL Date representation on pipeline table function.

2.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09)

  • Added Rhighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType IN VARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET pipeline table function
  • Added Phighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt IN VARCHAR2) RETURN ANYDATASET pipeline table function
  • Added lhighlight(NUMBER):VARCHAR2 ancilliary operator
  • Removed usage of Lucene deprecated API (Hits and IndexWriter for example)
  • Usage of FIRST_ROWS optimizer hits to decide how many rows load at first time
  • sync, optimize and rebuild interfaces now use index_name or [owner,index_name] arguments
  • A better build system to build Lucene Domain Index from sources
  • More tests
  • Tested against 11.1.0.7 and 10.2.0.3
  • See online docs to see usage of FIRST_ROWS and lhighlight() operator

2.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)

  • Added parameter for CLOB enconding
  • More Like this function
  • NGram analyzer
  • EnglishWikipediaAnalyzer
  • DataStore interface include API for setting current connection
  • Now analyzers, queries, snowball and WikiPedia contrib packages are required
 

2.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08)

  • Compiled against Lucene 2.3.2 production release
  • Used latest API for merging based on RAM usage
  • Use Writer for deleting during Sync
  • Confirm 4x improvement during indexing reported by Lucene dev group
  • Fix workaround which changes order of the rowids in ODCRIDList
  • Added an Spanish WikiPedia Analyzer for testing
  • Reports IOException instead of RunTimeException to signal EOF or File Not Found
  • Decouple Flush functionality from TableIndexer
 

2.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08)

  • Added Rowid to lucene doc id caching.
  • Usage of LoadFirstFieldSelector during Document loading to only load rowid field.
  • Added a test suite which index a wikipedia dump inside the OJVM.

2.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07)

  • DefaultUserDataStore requires usage of XPath text() expresion for getting only textual value
  • Added logging info SQL being executed at table indexer
  • Change document logging to FINER level
  • More pre-defined mapping at DefaultUserDataStore for NUMBER, BINARY_FLOAT, BINARY_DOUBLE, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Oracle types.
  • New parameter PopulateIndex:[true|false] for populating or not Lucene Index at creation time.
  • New parameter IncludeMasterColumn:[true|false], to choose whether or not index master column, useful with Virtual Columns and XMLType.
  • New parameter BatchCount:integer, to choose how many rows count are enqueued for indexing using create ... index ... parameters('SyncMode:OnLine');
  • Creating an index with SyncMode:OnLine causes that LuceneDomain index will enqueue batchs of "BatchCount" rows for index by AQ PLSQL callback in background. Lucene Domain Index is intermediately ready for querying after create.
  • Batch rowid indexing is doing using a pipeline function.

2.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07)

Binary download (see package ojvm):
http://sourceforge.net/project/showfiles.php?group_id=56183
CVS access:
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm
  • sort by column passed at lcontains(col,query_parser_str,sort_str,corr_id) syntax
  • Logging support using Java Util Logging package
  • JUnit test suites emulating middle tier environment
  • Support for rebuild and optimize online for SyncMode:OnLine index
  • XMLDB Export
  • AutoTuneMemory parameter for replacing MaxBufferedDocs parameter
  • Functional column support

2.2.0.1.1 (second release, 27/Sep/07 05:39 AM)

Binary download:
https://issues.apache.org/jira/secure/attachment/12366661/ojvm-09-27-07.tar.gz
CVS access:
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm
 
  • LuceneDomainIndex.countHits() function to replace select count from .. where lcontains(..)>0 syntax.
  • support inline pagination at lcontains(col,'rownum:[n TO m] AND ...") function
  • rounding and padding support for columns date, timestamp, mumber, float, varchar2 and char
  • ODCI API array DML support
  • BLOB parameter support

2.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44 AM)

CVS access:
cvs -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism login
cvs -z3 -d:pserver:anonymous@dbprism.cvs.sourceforge.net:/cvsroot/dbprism co -P ojvm
 
  • Synchronized with latest Lucene 2.2.0 production
  • Replaced in memory storage using Vector based implementation by direct BLOB IO, reducing memory usage for large index.
  • Support for user data stores, it means you can not only index one column at time (limited by Data Cartridge API on 10g), now you can index multiples columns at base table and columns on related tabled joined together.
  • User Data Stores can be customized by the user, it means writing a simple Java Class users can control which column are indexed, padding used or any other functionality previous to document adding step.
  • There is a DefaultUserDataStore which gets all columns of the query and built a Lucene Document with Fields representing each database columns these fields are automatically padded if they have NUMBER or rounded if they have DATE data, for example.
  • lcontains() SQL operator support full Lucene's QueryParser syntax to provide access to all columns indexed, see examples below.
  • Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if you want to get rows order by lscore() operator (ascending,descending) the optimizer hint will assume that Lucene Domain Index will returns rowids in proper order avoided an inline-view to sort it.
  • Automatic index synchronization by using AQ's Call Back.
  • Lucene Domain Index creates extra tables named IndexName$T and an Oracle AQ named IndexName$Q with his storage table IndexName$QT at user's schema, so you can alter storage's preference if you want.
  • ojvm project is at SourceForge.net CVS, so anybody can get it and collaborate
  • Tested against 10gR2 and 11gR1 database.

2.0.0.1.3 (third release, 09/Jan/07 11:40 AM)

https://issues.apache.org/jira/secure/attachment/12348574/ojvm-01-09-07.tar.gz
  • The Data Cartridge API is used without column data to reduce the data stored on the queue of changes and speedup the operation of the synchronize method.
  • Query Hits are cached associated to the index search and the string returned by the QueryParser.toString() method.
  • If no ancillary operator is used in the select, do not store the score list.
  • The "Stemmer" argument is recognized as parameter given the argument for the SnowBall analyzer, for example: 
create index it1 on t1(f2) indextype is lucene.LuceneIndex parameters('Stemmer:English');.
  • Before installing the ojvm extension is necessary to execute "ant jar-core" on the snowball directory.
  • The IndexWriter.setUseCompoundFile(false) is called to use multi file storage (faster than the compound file) because there is no file descriptor limitation inside the OJVM, BLOBs are used instead of File.
  • Files are marked for deletion and they are purged when calling to Sync or Optimize methods.
  • Blob are created and populated in one call using Oracle SQL RETURNING information.
  • A testing script for using OE sample schema, with query comparisons against Oracle Text ctxsys.context index.

2.0.0.1.2 (second release, 20/Dec/06 02:03 PM)

https://issues.apache.org/jira/secure/attachment/12347614/ojvm-12-20-06.tar.gz
This new release of the OJVMDirectory Lucene Store includes a fully functional Oracle Domain Index with a queue for update/insert massive operations and a lot of performance improvement.

2.0.0.1.1 (first release, 28/Nov/06 01:04 PM)

https://issues.apache.org/jira/secure/attachment/12345967/ojvm-11-28-06.tar.gz
  • The complet API for the Oracle Domain index was completed, but the solution for the operator contains outside the where clause is not good.
  • I will implement a singleton solution for the OJVMDirectory object when is used in read only mode, typically when user performs select operations against tables which have columns indexed with Lucene. This implementation will increase a lot the final performance because the index reader will be ready for each select operation. Obviously I will check if another user or thread makes a write operation on the index to reload the read-only singleton.
  • The queue for storing the changes on the index is not implemented yet, I'll add it in a short time. 

2.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM)