Full-text Search Report

2006011291 卿培

view the latest version of this report at:

http://docs.google.com/View?id=dhbwwj2z_1956c3hdgtdq

Dev environment


  • Mac OS X 10.6.1
  • NetBeans 6.7.1
  • Eclipse 3.5.1
  • JDK 1.6

Using Guide


  1. Use console to start the app with parameter: java -Xmx200m -jar 2006011291.jar
    • The batch file to run this command is ./dist/run.bat for Windows or ./dist/run for Unix/Linux.
    • -Xmx200m: This application requires a large amount of memory. Without this parameter, you might encounter "out-of-memory" exception.
      200MB is enough for the sample data set of 3148 files. More memory might be necessary for a larger data set.

  2. The data files should be stored in the ./data folder. Sub-folders are allowed, contents in sub-folders will also be included in the search results.
  3. When the app has started or the data folder has changed, click the "init" button first to parse the text files and to create the inverted list of all the characters and there occurrences in documents.

  4. When you see the messages as the following, the initialization has completed. Now you can see the memory consumption to do either kind of search.

  5. Proceed with entering query words in the text-box and click the "search" button to start searching with 2 different algorithms.
  6. The search results are listed in the list-box while the searching time is shown below them.
  7. When you click on a result, the URI of the document will be shown in the URI box. The first 100 characters are shown in the preview box.
  8. In the console window, you will see more details of the search functions.
    The following information are given:
    • time consumption for different part of the function
    • total time consumption
    • number of results
    • speedup\ rate = {Linear\ search\ time\over Indexed\ search\ time}

Project Overview



Details on search algorithms

  • Documents are read into an ArrayList called index. Each item of index is an IndexNode, which contains docID, docURI, docTitle, docString storing the pre-processed plain-text of current document and another HashSet docContent storing every unique character of the document.
  • Inverted list is a Hashtable, in which the key is a Character while the value is an HashSet containing all the docIDs where the key is in.
    • In earlier versions, this list is stored in two ArrayLists, keys and values separately. Searching in the list is so slow that creating the list of about 3000 documents cost more than 30 minutes.
    • With Hashtable, the creation time of the list rapidly decrease to seconds.
  • In linear search, the app:
    1. Get the current time start
    2. Find in each IndexNode whether all the query characters appear in the content of current document
      • Yes
        1. Add current doc to result.
        2. Continue searching.
      • No
        1. Continue searching in next IndexNode.
    3. Get the current time end
    4. Calc time consumption (end - start) ns and store it in time1
    5. Return the result
  • In indexed search, the app:
    1. Get the current time start
    2. See if all the query characters appear as keys in invlist
      • Yes
        1. Get the shortest value of all the query characters as keys
        2. Find whether each docID in shortest appear in all the values of other query characters
          NOTE: In newer versions of code, this is done by calling .retainAll() method.
          • Yes
            1. This docID is a result
            2. Add this docID to result
          • No
            1. This docID must not be a result
        3. Go to next docID in shortest
      • No
        1. Return empty result
    1. Get the current time end
    2. Calc time consumption (end - start) ns and store it in time2
    3. Return the result

Optimizing the search speed

IndexNode Structure

Storage of document content changed from
    ArrayList<Character> docContent
to
    String docString + HashSet<Character> docContent

String offers a better performance when doing indexOf() operation while HashSet does better for .containsAll() operation.

This makes linear search about 10x faster.

Inverted List Structure

The structure of inverted list changed from
    ArrayList<Character> invIndexKey + ArrayList<ArrayList<Integer>> invIndexValue
to
    Hashtable<Character, HashSet<Integer>> invlist

This makes the creation time of the list drop dramatically from over 20 minutes to ~2 second.

HashSet Intersection

A probe-based algorithm is used to calculate the intersection of results for every character of the query.
.retainAll() is used to substitute a for statement here.

This trivial change normally leads to 10~20% faster intersection.

Linear v.s. Indexed




One more test



Duplicating the sample data set 4 times to test the application, the results are:

Reading files.
File loading time: 2.0433588E10
# of files processed: 12596
Creating index.
Index creation time: 5.887865E9
# of index entries: 4708

Staring search.
    Query splitter time: 1000
    Looking up time: 12638000
Linear search time: 1.2639E7
Found 528 results.
    Query splitter time: 1000
    Finding shortest time: 12000
    Intersection time: 559000 with 528 results
Indexed search time: 572000.0
Found 528 results.
Speedup rate = 22.096153846153847

The larger data set, the larger speedup rate.


Problems and solutions

Chinese character encoding

All the characters are stored using UTF-8 encoding.

Positioning of character in documents

Document contents are converted to a String during pre-process. The position of a character in a String could be easily obtained by methods implemented in java.lang.String.

Physical address and Logical address

The texts files are organized by the operation system's filesystem. The demo app only gets the URI of every file and call the Java VM to get the resources it needs. Therefore, this app does not operate with physical addresses.

Extra-large files and extremely small files

Loading text files costs an amount of time in this app. By using the BufferedReader, loading time is sharply reduced. If the app can access the hardware directly, however, more time might be saved by reading small files together in blocks. This is just a guess~

Compression

In linear search, we just need to know whether a character appears in the document. Therefore, a HashSet, (instead of an ArrayList in earlier versions,) is used to store all the characters in the document. This reduces the space need by eliminating the repetition of commonly used characters.

Logic operation

In this character-based search, a query is broken into characters and searched individually. Results are merged by intersecting all the sub-results.

Sub-folder processing

The text file loading method takes a File as input. It will check first whether the input is a folder. If so, it will call itself to process all the contents of that folder recursively. As a result, Any text file, as long as it is in the "data path", which is "./data/", will be processed and indexed.