© 2014 tiesky.com

dbreezebased.codeplex.com

dbreeze.tiesky.com based projects

DBreezeBased.DocumentsStorage 

Definitions

DocumentGroup - all versions of one document identified by one document ID.

InternalID - DocumentGroup ID given by the system (document ID)

ExternalID - supplied from outer system to control InternalID

DocumentSpace - Logical grouping of documents (correctly would be "document groups") to distinguish search space.

Prerequisite

DBreezeBased.dll needs correspondent version of DBreeze and protobuf-net. They are included in dbreezebased.codeplex.com download archives.

Quick Start  (reading of complete documentation is desirable)

Code listing link

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using DBreeze;

namespace VisualTester

{

    public class DocumentStorageTester

    {

        DBreezeEngine engine = null;

        DBreezeBased.DocumentsStorage.Storage DocuStorage = null;

        /// <summary>

        /// Main test

        /// </summary>

        public void TestStart()

        {

            if (engine == null)

            {

                DBreezeConfiguration conf = new DBreezeConfiguration()

                {

                    DBreezeDataFolderName = @"D:\temp\DBreezeTest\DBR1",                

                    Storage = DBreezeConfiguration.eStorage.DISK,                    

                };

               

                engine = new DBreezeEngine(conf);

            }

            if (DocuStorage == null)

            {

                DocuStorage = new DBreezeBased.DocumentsStorage.Storage(this.engine);

                DocuStorage.VerboseConsoleEnabled = true;   //Make it write in console document processing progress

            }

            //this.Test_SingleInsert();

            this.Test_Search();

        }

        void Test_SingleInsert()

        {

            DBreezeBased.DocumentsStorage.Document doc = null;

            List<DBreezeBased.DocumentsStorage.Document> docs = new List<DBreezeBased.DocumentsStorage.Document>();

            doc = new DBreezeBased.DocumentsStorage.Document()

            {

                DocumentSpace = "space1",

                Content = new byte[1000],

                DocumentName = "name 1",

                ExternalId = "e1",

                Searchables = "table song",

                Description = "descr 1"

            };

            docs.Add(doc);

            doc = new DBreezeBased.DocumentsStorage.Document()

            {

                DocumentSpace = "space1",

                Content = new byte[1000],

                DocumentName = "name 2 ",

                ExternalId = "e2",

                Searchables = "table plash",

                Description = "descr 2"

            };

            docs.Add(doc);

            var retdocs = DocuStorage.AddDocuments(docs);

            DocuStorage.StartDocumentsIndexing();

        }

        void Test_Search()

        {

            var res = DocuStorage.SearchDocumentSpace(new DBreezeBased.DocumentsStorage.SearchRequest()

            {

                IncludeDocuments = false,

                SearchLogicType = DBreezeBased.DocumentsStorage.SearchRequest.eSearchLogicType.AND,

                DocumentSpace = "space1",

                Quantity = 200,              

     SearchWords = "ash"

 //    SearchWords = "able lash"

               // SearchWords = "table"

                //SearchWords = "ухгалтер енгерско ауряд"                

            });

            Console.WriteLine(res.VisualizeSearch());

        }

    }

}

Main

The main target of the DBreezeBased.DocumentsStorage.Storage class is to provide a mechanism of the search of a document space by words or words parts.

Document space can contain many different documents.

One document can be stored by one ID many times. All versions of this document will be stored and there will be an ability to rollback to any previously existed version, though searchable will be only the last applied (inserted or restored) document version.

DBreezeBased.dll needs actual version of DBreeze and correspondent to the .NET Framework version of protobuf-net.

It works under .NET and MONO.

Having this class we can think that we got document database or a help of text search for any existing object model.

DBreezeBased.DocumentsStorage.Storage class can be used with existing in your project DBreeze instance and adding searchable documents can be integrated into existing transaction, so that inserting of other DBreeze entities and searchable entities could occur in one DBreeze transaction.

Let’s start from initializing 

using DBreeze;

DBreezeEngine engine = null;

DBreezeBased.DocumentsStorage.Storage DocuStorage = null;

then we initialize DBreeze engine and bind to it Storage class

        string DbPath = @"D:\temp\DBreezeTest\DBR1\";

        void initDB()

        {

            if (engine != null)

                return;

            engine = new DBreezeEngine(new DBreezeConfiguration()

            {

                 DBreezeDataFolderName = DbPath

            });

            DocuStorage = new DBreezeBased.DocumentsStorage.Storage(this.engine);

//we can follow in the output window internal DocumentStorage processing:

            DocuStorage.VerboseConsoleEnabled = true;   //Make it write in console document processing progress

        }

There are 2 modes how DocumentsStorage can work: standalone mode and  in-transaction mode.

Standalone mode

Standalone mode we can use if document, which we want to search later must not be saved together with other entites in one transaction. In Standalone mode we got function AddDocuments:

DBreezeBased.DocumentsStorage.Document doc = null;

            List<DBreezeBased.DocumentsStorage.Document> docs = new List<DBreezeBased.DocumentsStorage.Document>();

            doc = new DBreezeBased.DocumentsStorage.Document()

            {

                DocumentSpace = "space1",

                Content = new byte[1000],

                DocumentName = "name 1",

                ExternalId = "e1",

                Searchables = "table song",

                Description = "descr 1"

            };

            docs.Add(doc);

            doc = new DBreezeBased.DocumentsStorage.Document()

            {

                DocumentSpace = "space1",

                Content = new byte[1000],

                DocumentName = "name 2 ",

                ExternalId = "e2",

                Searchables = "table plash",

                Description = "descr 2"

            };

            docs.Add(doc);

            var retdocs = DocuStorage.AddDocuments(docs);

//initiating search routine in a parallel thread

            DocuStorage.StartDocumentsIndexing();

AddDocuments itself is executed quite fast, because does nothing as just stores document in the special DBreeze table for the further processing. Further processing must be initiated by calling  DocuStorage.StartDocumentsIndexing() command. Document processing, indexing, preparation for the future search is performed in a parallel thread and takes time proportional to the document content.

After  AddDocuments is executed as a result we receive List<DBreezeBased.DocumentsStorage.Document> the same list that we have supplied, but filled with extra necessary information, also property Searchables will be cleared.

Closer look to the class DBreezeBased.DocumentsStorage.Document:

DocumentSpace - (Document Search Space Name) is a container for many documents to be searched by one SearchRequest. User can store all his documents inside of one DocumentSpace, so later could find there something. If document must be searched in several document spaces it must be stored several times under different document spaces.

DocumentSpaceId - automatically internally assigned and returned documentspace id. Must not be supplied, while adding document.

DocumentName - free text for the programmer, which will be returned after SearchRequest routines

ExternalId - Id which we can be assigned to the document and later be used for retrieving or identifying the documents. Uniqueness of the ExternalId must be guaranteed by the programmer self.

InternalId - Id which will be returned by the system after first insert of the document. Can be used in case if there is no ExternalId. To update the document it must have the same InternalId.

DocumentSequentialId - is an internal document monotonically grown identificator. Even if we update one document, using its Internal or ExternalId, DocumentSequentialId will be unique, because system support document versioning.

InternalStructure - for internal usages only.

Content - is a byte[] where programmer can store the original representation of the document as byte[]. We can just use DBreezeBased.DocumentsStorage for storing documents. It can be pdf, jpg, text..whatever converted into byte[].

Searchables - every document must be searched by words. This field is designed especially to supply space separated searchable words. System doesn’t know which document type is supplied in the content and can’t distinguish between them. It’s up to programmer preliminary to parse the document (if it’s not the text document, but pdf, doc or other object), to retrieve all searchable words and supply them to Searchables field space separated. Words can be doubled - no problem. Searchables can contain the whole book without any special parsing, system itself will leave only chars in words and will remove other symbols.

For example we can supply a pdf document in a content, retrieve words from it and put into searchables:

held a meeting with Government members to discuss measures to ensure sustainable development of the national economy and stability in the social sector

After the data is stored, we can search by words (and their parts):

Government

measures

sustainable

or

even

vernm

ainable

meas

and we will have this document in the found documents list.

Description - free text which will be stored (and later returned) together with the document.

To initiate the indexing routine the StartDocumentsIndexing comannd must be called. It will not block the thread, because is called in a parallel thread.

Output window can show extra information:

DBreezeBased.DocumentStorage. Processing has started.

DBreezeBased.DocumentStorage. Block processing is in progress...

DBreezeBased.DocumentStorage. Processed 2 documents with 11 words in DocuSpace 1. Took 688 ms; Select 0 ms; Insert 400 ms; UniqueWords: 11

DBreezeBased.DocumentStorage. Processing has finished

If we update same documents, than previous document version search indexes will be removed first then new will be added. So, in our search result we will never have the trash documents.

In-Transaction mode

Designed for the case when we need to store object in database and give ability to make searchable by words its properties. We need to be sure that it happens in one transaction.

When we were initializing DBreeze we missed such lines, to make DBreeze automatically work with Protobuf serializer

    //Setting up DBreeze to work with protobuf

            DBreeze.Utils.CustomSerializator.ByteArraySerializator = DBreezeBased.Serialization.ProtobufSerializer.SerializeProtobuf;

            DBreeze.Utils.CustomSerializator.ByteArrayDeSerializator = DBreezeBased.Serialization.ProtobufSerializer.DeserializeProtobuf;

We will need it for following examples.

Let’s create object Customer:

[ProtoBuf.ProtoContract]

    public class Customer

    {

        public Customer()

        {

            Id = 0;

            Name = String.Empty;

            Surname = String.Empty;

            Phone = String.Empty;

        }

        [ProtoBuf.ProtoMember(1, IsRequired = true)]

        public long Id { get; set; }

               

        [ProtoBuf.ProtoMember(2, IsRequired = true)]

        public string Name { get; set; }

        [ProtoBuf.ProtoMember(3, IsRequired = true)]

        public string Surname { get; set; }

        [ProtoBuf.ProtoMember(4, IsRequired = true)]

        public string Phone { get; set; }

    }    

Now we want to write a function who will store customer in database and make searchable its properties like Name, Surname and Phone as a string.

DBreezeBased.DocumentStorage. Processed 3 documents with 23 words in DocuSpace 1. Took 634 ms; Select 0 ms; Insert 359 ms; UniqueWords: 23

Initialized DocuStorage has property DocuStorage.DocumentsStorageTablesPrefix. The fact is that DBreeze tables answering for DocumentStorage subsystem (indexing processing, meta tables) will reside in the same DBreeze Scheme as other entities in the project, to share the same transactions. For that we have added DocumentsStorageTablesPrefix to distinguish between user defined DBreeze tables and DocumentsStorage table prefixes. by default this prefix is “dcstr” and it can be changed by necessity.

Let’s create a procedure who stores Customers and creates index for Customers properties search in one transaction:

        void AddCustomers(List<Customer> customers,long documentSpaceId)

        {

            try

            {

                DBreezeBased.DocumentsStorage.Storage.InTran_DocumentAppender docAppender = null;

                DBreezeBased.DocumentsStorage.Document doc = null;

           

                using (var tran = engine.GetTransaction())

                {

                     List<string> tbls = new List<string>();

                    tbls.Add("c");  //Customer table

                    //Blocking on write tables concerning DBreezeBased.DocumentsStorage

                    tbls.Add(DocuStorage.DocumentsStorageTablesPrefix + "d" + documentSpaceId.ToString());      //blocking documentSpace

                    tbls.Add(DocuStorage.DocumentsStorageTablesPrefix + "p");   //processing table

                    tran.SynchronizeTables(tbls);

                    //Initializing docAppender, supplying transaction and DocuStorage.DocumentsStorageTablesPrefix

                    docAppender = new DBreezeBased.DocumentsStorage.Storage.InTran_DocumentAppender(tran, DocuStorage.DocumentsStorageTablesPrefix);                    

                    foreach (var customer in customers)

                    {                      

                        doc = new DBreezeBased.DocumentsStorage.Document()

                        {                            

                            ExternalId = customer.Id.ToString(),

                            Searchables = GetSearchablesFromCustomer(customer),

                            DocumentSpaceId = documentSpaceId

                        };

                        docAppender.AppendDocument(doc);

                        tran.Insert<long, Customer>("c", customer.Id, customer);

                    }

                    tran.Commit();

                }

//Start parallel indexing procedure

                DocuStorage.StartDocumentsIndexing();

            }

            catch (Exception ex)

            {

                throw ex;

            }

        }

Now let’s add some customers

  void Test_AddCustomers()

        {

//Getting document spaceId

            long docSpaceId = DocuStorage.GetDocumentSpaceId("SearchableCustomers", true);

            List<Customer> customers = new List<Customer>();

            Customer customer = new Customer()

            {

                Id = 1,    //Obtaining and handling this ID is out of the scope, please refer to DBreeze documentation

                Name = "Liu",

                Surname = "Kang",

                Phone = "040 411 51 51"

            };

            customers.Add(customer);

            customer = new Customer()

            {

                Id = 2,    //Obtaining and handling this ID is out of the scope, please refer to DBreeze documentation

                Name = "Johny",

                Surname = "Cage",

                Phone = "040 411 51 53"

            };

            customers.Add(customer);

            customer = new Customer()

            {

                Id = 3,    //Obtaining and handling this ID is out of the scope, please refer to DBreeze documentation

                Name = "Kung",

                Surname = "Lao",

                Phone = "040 411 51 56"

            };

            customers.Add(customer);

            //We assume that we add all this customers for search under one Document Search Space

            AddCustomers(customers, docSpaceId);

        }

string GetSearchablesFromCustomer(Customer customer)

        {

            if (customer == null)

                return String.Empty;

            return ((customer.Name ?? "") + " " + (customer.Surname ?? "") + " " + (customer.Phone ?? "")).Trim();

        }

Searching data

It’s time to perform search. It’s clear that search will not be available at the same time when DocuStorage.StartDocumentsIndexing is called, but with a small delay, proportional to the quantity of the inserted data.

To perform search two special classes were developed  DBreezeBased.DocumentsStorage.SearchRequest and DBreezeBased.DocumentsStorage.SearchResponse, they can be configured in different way to achieve necessary result.

void Test3()

        {

            try

            {

               // long docSpaceId = DocuStorage.GetDocumentSpaceId("SearchableCustomers", false);

                using (var tran = engine.GetTransaction())

                {

                    DBreezeBased.DocumentsStorage.SearchResponse res = null;

                    res = this.DocuStorage.SearchDocumentSpace(new DBreezeBased.DocumentsStorage.SearchRequest()

                    {

                        IncludeDocuments = true,

                        SearchLogicType = DBreezeBased.DocumentsStorage.SearchRequest.eSearchLogicType.AND,

                        DocumentSpace = "SearchableCustomers",

                        Quantity = 200,

                        MaximalExcludingOccuranceOfTheSearchPattern = 100,  //Must stay lower for low RAM systems (Mobile Phones), and bigger for servers

                        SearchWords = "ang",  //returns ExternalId 1 where SearchLogicType is AND

                        //SearchWords = "ang 51",   //returns ExternalId 1 where SearchLogicType is AND

                        //SearchWords = "ang 53",   //returns ExternalId 1,2 where SearchLogicType is OR

                        //SearchWords = "ang 53",   //returns nothing where SearchLogicType is AND

                        //SearchWords = "040",   //returns ExternalId 3 where SearchLogicType is AND

                        IncludeDocumentsContent = false,

                        IncludeDocumentsSearchanbles = false

                    }

,tran      //optional, for the case when used inside of transaction, added on [20160127]

);

                    foreach (var rs in res.Documents)

                    {

                        //Here we can use rs.ExternalId to get Customers from “c” table

                    }

                }

             

            }

            catch (Exception ex)

            {

                throw ex;

            }

        }

MaximalExcludingOccuranceOfTheSearchPattern

Default value is 10000. It means if such word's Starts With found more than  “MaximalExcludingOccuranceOfTheSearchPattern" times, it will be excluded from the search. For example, after uploading 122 russian books and having 700000 unique words, we try to search combination of "ал".

         We have found it 3240 times:

         ал

         ала

         алабан

         алаберная

         алаберный

         алаболки

         алаболь

         ...

         etc.

!This is not the quantity of documents where such pattern exists, but StartsWith result of all unique words in the document space!

If MaximalExcludingOccuranceOfTheSearchPattern of our request is 100, it can mean that we want to keep only 100 from the found words and retrieve documents bound to those 100 words. Normally it can mean that search word “ал” is not very good for search and we must supply more chars for that search.

Rollback to previous document version

Rolling ack to previous document version is possible via DocuStorage.RollbackToVersion command.

Standalone removing of documents

Removing of documents in standalone version is possible via DocuStorage.RemoveDocumentByInternalID or DocuStorage.RemoveDocumentByExternalID 

In-Transaction removing of documents

  void RemoveCustomers(List<Customer> customers, long documentSpaceId)

        {

            try

            {

                DBreezeBased.DocumentsStorage.Storage.InTran_DocumentAppender docAppender = null;

                DBreezeBased.DocumentsStorage.Document doc = null;

                using (var tran = engine.GetTransaction())

                {

                    List<string> tbls = new List<string>();

                    tbls.Add("c");  //Customer table

                    //Blocking on write tables concerning DBreezeBased.DocumentsStorage

                    tbls.Add(DocuStorage.DocumentsStorageTablesPrefix + "d" + documentSpaceId.ToString());      //blocking documentSpace

                    tbls.Add(DocuStorage.DocumentsStorageTablesPrefix + "p");   //processing table

                    tran.SynchronizeTables(tbls);

                    //Initializing docAppender, supplying transaction and DocuStorage.DocumentsStorageTablesPrefix

                    docAppender = new DBreezeBased.DocumentsStorage.Storage.InTran_DocumentAppender(tran, DocuStorage.DocumentsStorageTablesPrefix);

                    foreach (var customer in customers)

                    {                        

                        doc = new DBreezeBased.DocumentsStorage.Document()

                        {                            

                            ExternalId = customer.Id.ToString(),                            

                            DocumentSpaceId = documentSpaceId

                        };

                        docAppender.RemoveDocument(doc);

                        tran.RemoveKey<long>("c", customer.Id);

                    }

                    tran.Commit();

                }

                DocuStorage.StartDocumentsIndexing();

            }

            catch (Exception ex)

            {

                throw ex;

            }

        }

Updating the document

…. inserting

 doc = new DBreezeBased.DocumentsStorage.Document()

            {

                DocumentSpace = "space1",

                // Content = new byte[1000],

                DocumentName = "name 3",

                ExternalId = "e3",

                Searchables = "New England is digging out this morning after receiving more than 30 inches of snow in some areas from a major northeast storm. A travel ban was lifted at midnight in Massachusetts, but authorities are urging drivers to stay off the roads if necessary as cleanup efforts continue.",

                Description = "descr 3"

            };

            docs.Add(doc);

            var retdocs = DocuStorage.AddDocuments(docs);

            DocuStorage.StartDocumentsIndexing();

…. updating

 doc = new DBreezeBased.DocumentsStorage.Document()

            {

                DocumentSpace = "space1",

                // Content = new byte[1000],

                DocumentName = "name 3",

                ExternalId = "e3",

                Searchables = "My Bonnie lies over the ocean",

                Description = "descr 3"

            };

            docs.Add(doc);

            var retdocs = DocuStorage.AddDocuments(docs);

            DocuStorage.StartDocumentsIndexing();

After document update, searching the “New England is digging” will not return document with externalId “e3”, but searching “over the ocean” will.

[20160111]

How to store searchable for full-text (or full-text StartsWith) only, economizing storage space.

Setting DocuStorage.SearchWordMinimalLength = 0  

will let system to store complete words definitions only (e.g “mynewords to search” ) and later only full-text or full-text.StartsWith search will be successful, like this:

OK for search: “mynewords” or “mynewor” or “mynew sear”.

NOT OK for search: “ynewords earch”

Notes (for future explanation)

-Maximal word’s length before split is 50 chars/digits/symbols

-VerboseConsoleEnabled = true;

Chars quantity per processing block (this value must be hold in RAM while data block processing). Default value is 10.000.000 chars.

Android settings

DocuStorage.QuantityOfWordsInBlock = 100;

                DocuStorage.MaxCharsToBeProcessedPerRound = 1000000;

                DocuStorage.MinimalBlockReservInBytes = 1000;

-ReindexDocuments

Copyright © 2014  dbreeze.tiesky.com Alexey Solovyov < hhblaze@gmail.com > Ivars Sudmalis < zikills@gmail.com >