1 of 70

INFORMATION RETRIEVAL SYSTEM

2 of 70

3 of 70

INTRODUCTION

Definition:

  • An Information Retrieval System is a system that is capable of storage, retrieval and maintenance of information.
  • → Information in this context can be composed of following:
  • Text (including numeric and date data) Images
  • Audio Video
  • Other multimedia objects
  • → Of all the above data types, Text is the only data type that supports full functional processing.
  • → The term “item” is used to represent the smallest complete unit that is processed and manipulated by the system.
  • → The definition of item varies by how a specific source treats information. A complete document, such as a book, newspaper or magazine could be an item. At other times each chapter or article may be defined as an item.
  • → The efficiency of an information system lies in its ability to minimize the overhead for a user to find the needed information.

4 of 70

Overhead from a user’s perspective is the time required to find the information needed, excluding the time for actually reading the relevant data. Thus search composition, search execution, and reading non-relevant items are all

aspects of information retrieval overhead.

Objectives

The general objective of an Information Retrieval System is to minimize the overhead of a user locating needed information. Overhead can be expressed as

the time a user spends in all of the steps leading to reading an item containing the needed information. For example:

Query Generation Query Execution

Scanning results of query to select items to read Reading non-relevant items etc ...

5 of 70

→ The two major measures commonly associated with information systems are precision and recall.

→Pr ecision

Number _ Re trieved _ Re

levant /

Number _ Total _ Re trieved

Re call

Numbe _ Re trieved _ Re levant /

Number _ Possible _ Re levant

6 of 70

7 of 70

Functional Overview

8 of 70

Item Normalization: The first step in any integrated system is to normalize the incoming items to a standard format. In addition to translating multiple external formats that might be received into a single consistent data structure that can be manipulated by the functional processes, item normalization provides logical restructuring of the item

Relationship to DBMS, Digital Libraries and Data Warehouses:

→ An Information Retrieval System is software that has the features and function required to manipulate “information” items where as a DBMS that is optimized to handle “structured” data. → All the three systems (IRS, Digital Libraries & Data Warehouses) are repositories of information and their primary goal is to satisfy user information needs.

Digital Libraries value original sources where as these are of less concern to IRS.

→ With direct electronic access available to users the social aspects of congregating in a library and learning from librarians, friends and colleagues will be lost and new electronic collaboration equivalencies will come into existence.

9 of 70

→ With the proliferation of information available in electronic form, the role of search intermediary

will shift from an expert in search to being an expert in source analysis.

→ Information storage and retrieval technology has addressed a small subset of the issues associated with digital

libraries.

→ The term Data Warehouse comes more from the commercial sector than academic sources.

→ Its goal is to provide to the decision makers the critical information to answer future direction questions.

→ A data warehouse consists of the following

Data warehouse is more focused on structured data & decision support technologies.

→ Data mining is a search process that automatically analyzes

data and extract relationships and dependencies that were not part of the database design.

Information Retrieval System Capabilities Search:

→ Major functions that are available in an Information Retrieval System

10 of 70

The objective of the search capability is to allow for a mapping between a user’s specified need and the items in the information database that will answer that need. Based upon the algorithms used in a system many different functions are associate with the system’s understanding the search statement. The functions define the relationships between the terms in the search statement

Boolean Logic: Allows a user to logically relate multiple Concepts together to define what information is needed

Input function (captures data and moves it to data warehouse)

Continuous Word Phrases :( CWP) is two or more words that are treated as a single semantic unit.

Fuzzy Searches: provide the capability to locate spellings of words that are similar to the entered search term. Primary used to compensate for errors in spelling of words. Fuzzy searching increases recall at the expense of decreasing precision

11 of 70

Term Masking: The ability to expand a query term by masking a portion of the term and accepting as valid any processing token that maps to the unmasked portions of the term

Numeric and Date Ranges: Allows for specialized numeric or date range processing against those words.

Concept/ Thesaurus Expansion : Thesaurus is typically a one-level or two-level expansion of a

term to

other terms that are similar in meaning where as a concept class is a tree structure that expands each meaning of a word into potential concepts that are related to the initial term.

12 of 70

Natural Language Queries : Allow a user to enter a prose statement that describes the information that the user wants to find. The longer the prose, the more accurate the results returned

→ Multimedia Queries : The user interface becomes far More complex with introduction of the

availability

of multimedia items.

→ All of the previous discussions still apply for search of the textual portions of a multimedia database. But in addition, the user has to be able to specify search terms for the other modalities

Browse:

→ Once the search is complete, browse capabilities provide the user with the capability to determine which items are of interest and select those to be displayed.

→ There are two ways of displaying a summary of the items that are associated with a query: Line Item Status and Data Visualization.

13 of 70

Different browsing capabilities are:

  • Miscellaneous: There are many additional functions that facilitates the user’s ability to input queries, reducing the time it takes to generate the queries, and reducing a priori the probability of entering a poor query.
  • Different ways are:
  • Vocabulary Browse: This provides the capability to display in alphabetical sorted order words from the document database. Logically, all unique words (processing
  • tokens) in the database are kept in sorted order along with a count of the number of unique items in which the word is found.
  • → Iterative Search and Search History Log : Frequently a search returns a
  • Hit file containing many more items than the user wants to review. Rather

14 of 70

Multimedia : Once a list of potential items that Satisfy the query are discovered, the techniques for displaying them when they are multimedia

Introduces new Challenges. To display more aggregate data textual interfaces sometimes allow for

clustering of the hits and then use of graphical display to show a higher level view of

the information. Neither of these techniques lends themselves when the information is multimodal. The textual aspect of the multimedia can use to apply all of the techniques described.

15 of 70

CATALOGING AND INDEXING

16 of 70

�INDEXING: �

  • the transformation from received item to searchable data structure is called indexing.
  • Process can be manual or automatic.
  • Creating a direct search in document data base or indirect search through index files.
  • Concept based representation: instead of transforming the input into a searchable format some systems transform the input into different representation that is concept based . Search and return item as per the incoming items.

17 of 70

�History of indexing

  • the dependency of information processing capabilities on manual and then automatic processing systems .
  • Indexing originally called cataloguing : oldest technique to identity the contents of items to assist in retrieval.
  • Items overlap between full item indexing , public and private indexing of files

18 of 70

the dependency of information processing capabilities on manual and then automatic processing systems .

• Indexing originally called cataloguing : oldest technique to identity the contents of items to assist in retrieval.

• Items overlap between full item indexing , public and private indexing of files

1.Decide the scope of indexing and the level of detail to be provided. Based on usage scenario of users .

2.Second decision is to link index terms together in a single index for a particular concept.

19 of 70

TEXT PROCESSING

20 of 70

Text process phases

  • Document Parsing. Documents come in all sorts of languages, character sets, and formats; often, the same document may contain multiple languages or formats, e.g., a French email with Portuguese PDF attachments. Document parsing deals with the recognition and “breaking down” of the document structure into individual components. In this pre processing phase, unit documents are created; e.g., emails with attachments are split into one document representing the email and as many documents as there are attachments.
  • 2. Lexical Analysis. After parsing, lexical analysis tokenizes a document, seen as an input stream, into words. Issues related to lexical analysis include the correct identification of accents, abbreviations, dates, and cases. The difficulty of this operation depends much on the language at hand: for example, the English language has neither diacritics nor cases, French has diacritics but no cases, German has both diacritics and cases. The recognition of abbreviations and, in particular, of time expressions would deserve a separate chapter due to its complexity and the extensive literature in the field For current approaches

21 of 70

Contd…

  • Stop-Word Removal. A subsequent step optionally applied to the results of lexical analysis is stop-word removal, i.e., the removal of high-frequency words. For example, given the sentence “search engines are the most visible information retrieval applications” and a classic stop words set such as the one adopted by the Snowball stemmer,1 the effect of stop-word removal would be: “search engine most visible information retrieval applications”.
  • 4. Phrase Detection. This step captures text meaning beyond what is possible with pure bag- of-word approaches, thanks to the identification of noun groups and other phrases. Phrase detection may be approached in several ways, including rules (e.g., retaining terms that are not separated by punctuation marks), morphological analysis , syntactic analysis, and combinations thereof. For example, scanning our example sentence “search engines are the most visible information retrieval applications” for noun phrases would probably result in identifying “search engines” and “information retrieval”.
  • 5. Stemming and Lemmatization. Following phrase extraction, stemming and lemmatization aim at stripping down word suffixes in order to normalize the word. In particular, stemming is a heuristic process that “chops off” the ends of words in the hope of achieving the goal correctly most of the time; a classic rule based algorithm for this was devised by Porter [280]. According to the Porter stemmer, our example sentence “Search engines are the most visible information retrieval applications” would result in: “Search engin are the most visibl inform retriev applic”.

22 of 70

Lemmatization is a process that typically uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word, thereby collapsing its inflectional forms (see, e.g., [278]). For example, our sentence would result in “Search engine are the most visible information retrieval application” when lemmatized according to a WordNet-based lemmatizer

Weighting. The final phase of text pre processing deals with term weighting. As previously mentioned, words in a text have different descriptive power; hence, index terms can be weighted differently to account for their significance within a document and/or a document collection. Such a weighting can be binary, e.g., assigning 0 for term absence and 1 for presence.

23 of 70

SCOPE OF INDEXING

  • When perform the indexing manually, problems arise from two sources the author and the indexer the author and the indexer .
  • Vocabulary domain may be different the author and the indexer.
  • This results in different quality levels of indexing.
  • The indexer must determine when to stop the indexing process.
  • Two factors to decide on level to index the concept in a item.
  • The exhaustively and how specific indexing is desired.
  • Exhaustively of index is the extent to which the different concepts in the item are indexed.
  • For example, if two sentences of a 10-page item on microprocessors discusses on-board caches, should this concept be indexed
  • Specific relates to preciseness of index terms used in indexing.
  • For example, whether the term “processor” or the term “microcomputer” or the term “Pentium” should be used in the index of an item is based upon the specificity decision.
  • Indexing an item only on the most important concept in it and using general index terms yields low exhaustively and specificity.
  • Another decision on indexing is what portion of an item to be indexed Simplest case is to limit the indexing to title and abstract(conceptual ) zone .
  • General indexing leads to loss of precision and recall.

24 of 70

�PREORDINATION AND LINKAGES �

  • Another decision on linkages process whether linkages are available between index terms for an item .
  • Used to correlate attributes associated with concepts discussed in an item .’this process is called preordination .
  • When index terms are not coordinated at index time the coordination occurs at search time. This is called post coordination , implementing by “AND” ing index terms .
  • Factors that must be determined in linkage process are the number of terms that can be related.
  • Ex., an item discusses ‘the drilling of oil wells in Mexico by CITGO and the introduction of oil refineries in Peru by the U.S.’

25 of 70

AUTOMATIC INDEXING

  • Automatic indexing requires few seconds based on the processor and complexity of algorithms to generate indexes.’
  • • Adv. is consistency in index term selection process.
  • • Index resulting form automated indexing fall into two classes , weighted and un weighted .
  • • Un weighted indexing system : the existence of an index term in a document and some times its word location are kept as part of searchable data structure .
  • • Weighted indexing system: a attempt is made to place a value on the index term associated with concept in the document . Based on the frequency of occurrence of the term in the item .
  • • Values are normalized between 0 and 1.
  • • The results are presented to the user in order of rank value from highest number to lowest number .
  • • Indexing By term
  • • Terms (vocabulary) of the original item are used as basis of index process .
  • • There are two major techniques for creation of index statistical and natural language.
  • • Statistical can be based upon vector models and probabilistic models with a special case being Bayesian model(accounting for uncertainty inherent in the model selection process) .
  • • Called statistical because their calculation of weights use information such as frequency of occurrence of words .
  • • Natural language also use some statistical information , but perform more complex parsing to define the final set of index concept.
  • • Other weighted systems discussed as vectorised information system .
  • • The system emphasizes weights as a foundation for information detection and stores these weights in a vector form.

26 of 70

�Bayesian approach �

  • based on evidence reasoning( drawing conclusion from evidence )
  • Could be applied as part of index term weighing. But usually applied as part of retrieval process by calculating the relation ship between an item and specific query.
  • Graphic representation each node represents a random variable arch between the nodes represent a probabilistic dependencies between the node and its parents .
  • Two level Bayesian network
  • “ c”“ represents concept in a query

• “f” representing concepts in an item

27 of 70

Another approach is natural language processing.

• DR-LINK( document retrieval through linguistics knowledge )

• Indexing by concept

• Concept indexing determines a canonical set of concept based upon a test set of terms and uses them as base for indexing all items. Called latent semantics indexing .

• Ex: match plus system developed by HNC inc

• Uses neural NW strength of the system word relationship (synonyms) and uses the information in generating context vectors.

• Two neural networks are used one to generated stem context vectors and another one to perform query.

• Interpretation is same as the weights.

• Multimedia indexing:

• Indexing video or images can be accomplished at raw data level.

• Positional and temporal (time) search can be done.

28 of 70

INFORMATION EXTRACTION

  • There are two processes associated with information extraction:
  • determination of facts to go into structured fields in a database and
  • extraction of text that can be used to summarize an item.
  • The process of extracting facts to go into indexes is called Automatic File Build.
  • In establishing metrics to compare information extraction, precision and recall are applied with slight modifications.

29 of 70

DATA STRUCTURES

  • Introduction to Data Structures
  • Stemming Algorithms
  • Inverted File Structure
  • N-Gram Data Structure
  • PAT Data Structure
  • Signature File Structure
  • Hypertext and XML Data Structures

30 of 70

�Data structure �

  • 1. One structure stores and manages received items in their normalized form is called document manger
  • 2. The other data structure contains processing tokens and associated data to support search.

31 of 70

The other data structure contains processing tokens and associated data to support search.

Item normalization

Document File creation

Document manager

Document Search manager

Original document file

Major data structure

Processing token data structure that support search function

Stemming : is the transformation often applied to data before placing it in the searchable data structure

Stemming represents concept(word) to a canonical (authorized; recognized; accepted)morphological (the patterns of word formation in a particular language ) representation .

Risk with stemming : concept discrimination information may be lost in the process. Causing decrease in performance.

Advantage : has a potential to increase recall.

32 of 70

PORTER STEMMING ALGORITHM

  • Based on a set condition of the stem
  • A consonant in a word is a letter other than A, E, I, O or U, some important stem conditions are
  • The measure m of a stem is a function of sequence of vowels (V) followed by a sequence of consonant ( C ) .
  • C (VC)mV. m is number VC repeats The case m = 0 covers the null word.
  • *<X> - stem ends with a letter X 3.*v* - stem contains a vowel
  • *d - stem ends in double consonant (e.g. -TT, -SS).

*o - stem ends in consonant vowel sequence where the final consonant is not w,x,y(e.g. -WIL, -HOP).

  • Suffix cond.s takes the form current _suffix = = pattern Actions are in the form old_suffix ->. New_suffix
  • Rules are divided into steps to define the order for applying the rule.

33 of 70

Successor stemmers

  • Based on length of prefixes .
  • The smallest unit of speech that distinguishes on word from another
  • The process uses successor varieties for a word .
  • Uses information to divide a word into segments and selects on of the segments to stem.
  • Successor

34 of 70

INVERTED FILE STRUCTURE

  • Most common data structure
  • Inverted file structures are composed of three files
  • 1. The document file
  • 2. The inversion list (Posting List)
  • 3. Dictionary
  • The inverted file : based on the methodology of storing an inversion of documents.
  • For each word a listof documents in which the word is found is stored(inversion of document
  • Each document is given a unique the numerical identifier that is stored in inversion list . Dictionary is used to located the inversion list for a particular word.

Which is a sorted list( processing tokens) in the system and a pointer to the location of its inversion list.

35 of 70

36 of 70

N-GRAM DATA STRUCTURE

  • N-Grams can be viewed as a special technique for conflation (stemming) and as a unique data structure in information systems.
  • N-Grams are a fixed length consecutive series of “n” characters.
  • Unlike stemming that generally tries to determine the stem of a word that represents the semantic meaning of the word, n-grams do not care about semantics.
  • The searchable data structure is transformed into overlapping n-grams, which are then used to create the searchable database.
  • Examples of bigrams, trigrams and pentagrams for the word phrase “sea colony.” se ea co ol lo on ny Bigrams (no interword symbols)
  • sea col olo lon onyTrigrams (no interword symbols) #se sea ea# #co col olo lon ony ny# Trigrams
  • (with interword symbol #)
  • #sea# #colo colon olony lony# Pentagrams (with interword symbol #)
  • The symbol # is used to represent the interword symbol which is anyone of a set of symbols (e.g., blank, period, semicolon, colon, etc.).

37 of 70

PAT data structure

  • (practical algorithm to retrieve information coded in alphanumeric)
  • PAT structure or PAT tree or PAT array : continuous text input data structures(string like N- Gram data structure).
  • The input stream is transformed into a searchable data structure consisting of substrings, all substrings are unique.
  • Each position in a input string is a anchor point for a sub string.

In creation of PAT trees each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input.

  • Binary tree, most common class for prefix search,But Pat trees are sorted logically which facilitate range search, and more accurate then inversion file .
  • PAT trees provide alternate structure if supporting strings search.

38 of 70

39 of 70

40 of 70

Signature file structure

  • The coding is based upon words in the code.
  • The words are mapped into word signatures .
  • A word signature is fixed length code with a fixed number of bits set to 1.
  • The bit positions that are set to one are determined via a hash function of the word.
  • The word signatures are Ored together to create signature of an item..
  • Partitioning of words is done in block size ,Which is nothing but set of words, Code length is 16 bits .
  • Search is accomplished by template matching on the bit position .
  • Provide a practical solution applied in parallel processing , distributed environment etc.

  • To avoid signatures being too dense with “1”s, a maximum number of words is specified and an item is partitioned into blocks of that size.
  • The block size is set at five words, the code length is 16 bits and the number of bits that are allowed to be “1” for each word is five.

41 of 70

HYPERTEXT AND XML DATA STRUCTURES

  • The advent of the Internet and its exponential growth and wide acceptance as a new global information network has introduced new mechanisms for representing information.
  • This structure is called hypertext and differs from traditional information storage data structures in format and use.
  • The hypertext is Hypertext is stored in HTML format and XML .
  • Bot of these languages provide detailed descriptions for subsets of text similar to the zoning.
  • Hypertext allows one item to reference another item via a embedded pointer .
  • HTML defines internal structure for information exchange over WWW on the internet.
  • XML: defined by DTD, DOM, XSL, etc.

42 of 70

AUTOMATIC INDEXING

  • Automatic indexing is the process of analyzing an item to extract the information to be permanently kept in index.
  • This process is associated with the generation of the searchable data structures associated with an item.
  • The left hand side of the figure includes identifying processing tokens, apply stop list algorithm , characterize tokens, apply stemming, and creating searchable data structure is part of indexing process.
  • Data Flow in Information Processing System (

43 of 70

44 of 70

Statistic approach – stores a single statistics, such as how often each word occurs in an item

– used for generating relevance scores after a Boolean search .

Probabilistic indexing -stores the information that are used in calculating a probability that a particular item satisfies (i.e., is relevant to) a particular query.

Bayesian and vector space – stores the information that are used in generating a relative confidence level of an items relevancy to a query.

Neural networks –dynamic learning structures– comes under concept indexing – that determine concept class.

Natural language approach perform the similar processing token identification as in statistical techniques - additional level of parsing of the item (present, past, future action ) enhance search precision.

Concept indexing uses the words within an item to correlate to concepts discussed in the index item.

When generating the concept classes automatically, there may not be a name applicable to the concept but just a statistical significance.

Finally, a special class of indexing can be defined by creation of hypertext linkages.

These linkages provide virtual threads of concepts between items versus directly defining the concept within an item.

Each technique has its own strengths and weaknesses.

45 of 70

Probabilistic weighting

  • Probabilistic systems attempt to calculate a probability value that should be invariant to both calculation method and text corpora(large collection of written/spoken texts).
  • The probabilistic approach is based on direct application of theory of probability to information retrieval system.
  • Advantage: uses the probability theory to develop the algorithm.
  • This allows easy integration of the final results when searches are performed across multiple databases and use different search algorithms.
  • The use of probability theory is a natural choice – the basis of evidential reasoning ( drawing conclusions from evidence).
  • This is summarized by PRP( probability ranking principle) and Plausible corollary (reasonable result)

46 of 70

Natural Language Processing

  • Natural language processing not only produces more accurate term phrases, but can provide higher level semantic information identifying relationships between concepts.
  • System adds the functional processes Relationship Concept Detectors, Conceptual Graph Generators and Conceptual Graph Matchers that generate higher level linguistic relationships including semantic and is course level relationships.
  • During the first phase of this approach, the processing tokens in the document are mapped to Subject Codes.
  • These codes equate to index term assignment and have some similarities to the concept-based systems.
  • The next phase is called the Text Structurer, which attempts to identify general discourse level areas within an item.
  • The next level of semantic processing is the assignment of terms to components, classifying the intent of the terms in the text and identifying the topical statements.
  • The next level of natural language processing identifies interrelationships between the concepts.

47 of 70

CONCEPT INDEXING

  • Natural language processing starts with a basis of the terms within an item and extends the information kept on an item to phrases and higher level concepts such as the relationships between concepts.
  • Concept indexing takes the abstraction a level further.
  • Its goal is use concepts instead of terms as the basis for the index, producing a reduced dimension vector space.
  • Concept indexing can start with a number of unlabeled concept classes and let the information in the items define the concepts classes created.
  • A term such as “automobile” could be associated with concepts such as “vehicle,” “transportation,” “mechanical device,” “fuel,” and “environment.”

The term “automobile” is strongly related to “vehicle,” lesser to “transportation” and much lesser the other terms.

  • Thus a term in an item needs to be represented by many concept codes with different weights for a particular item.
  • The basis behind the generation of the concept approach is a neural network model

48 of 70

DOCUMENT AND TERM CLUSTERING

  • Cluster analysis is a technique for multivariate analysis that assigns items to automatically
  • created groups based on a calculation of the degree of association between items and groups. In the information retrieval (IR) field, cluster analysis has been used to create groups of documents with the goal of improving the efficiency and effectiveness of retrieval, or to determine the structure of the literature of a field. The terms in a document collection can also be clustered to show their relationships. The two main types of cluster analysis methods are the non-hierarchical, which divide a data set of N items into M clusters, and the hierarchical, which produce a nested data set in which pairs of items or clusters are successively linked. The non-hierarchical methods such as the single pass and reallocation methods are heuristic in nature and require less computation than the hierarchical methods. However, the hierarchical methods have usually been preferred for cluster-based document retrieval. The commonly used hierarchical methods, such as single link, complete link, group average link, and Ward's method, have high space and time requirements. In order to cluster the large data sets with high dimensionality that are typically found in IR applications

49 of 70

CLUSTER ANALYSIS

  • Cluster analysis is a statistical technique used to generate a category structure which fits a set of observations.
  • The groups which are formed should have a high degree of association between members of the same group and a low degree between members of different groups (Anderberg 1973).
  • While cluster analysis is sometimes referred to as automatic classification, this is not strictly accurate since the classes formed are not known prior to processing, as classification implies, but are defined by the items assigned to them.

50 of 70

Applications in Information Retrieval

  • Documents may be clustered based on co-occurring citations in order to provide insights into the nature of the literature of a field (e.g., Small and Sweeney [1985]).
  • Terms may be clustered on the basis of the documents in which they co-occur, in order to aid in the construction of a thesaurus or in the enhancement of queries (e.g., Crouch [1988]).
  • Although cluster analysis can be easily implemented with available software packages, it is not without problems. These include:
  • Selecting the attributes on which items are to be clustered and their representation. Selecting an appropriate clustering method and similarity measure from those available .
  • Creating the clusters or cluster hierarchies, which can be expensive in terms of computational resources.
  • Asessing the validity of the result obtained.
  • If the collection to be clustered is a dynamic one, the requirements for update must be considered.

51 of 70

Search Statements&Binding

  • The Search statements are the statements of an Information and generated by users to specify the concepts, they are trying to locate in items
  • It uses traditional Boolean logic &or Natural Language
  • While generating the search statement, the user have the ability to weight or assign an Important to different concepts in the statements
  • At this point the binding is to the vocabulary& past experiences of the user
  • The search statement is the users attempt to specify the conditions needed to subset logically the total Item Space to that cluster of Items that contains the information needed by the user

The next level of binding comes when the search statements is parsed for use by a specific search system

  • The search System translates the query to its own Meta Language
  • This process is similar to the Indexing of Item processes

52 of 70

The length of search Statements directly affect the ability of Information Retrieval Systems to find relevant Items

The longer the search query, the easier it is for the system to find items

53 of 70

Similarity Measures and Ranking

The Searching is concerned with calculating the Similarity between a users Search Statement and the items in the Database

 Restricting the similarity Measure to Passages gains Significant Precision with Impact on recall

Once Items are Identified as possibly relevant to the users query, It is best to present the most likely relevant items first Ranking is a Scalar number that represents how similar an Item is to the Query

Searching is Concerned with calculating the Similarity between a users search statement and the items in the Database

Variety of different similarity measures can be used to calculate the similarity the item and the search Statement

A Characteristic of a similarity formula is that the results of the Formula Increase

Increases as the Items become More Similar

 The value is Zero if the Items are totally Dissimilar

 SIM(Item i, Itemj)=z(Termix)(Termja)

 This formula uses the summation of the product of the various terms of two Items when treating the Index as a vector

 The problem with the simple Measures in the normalization needed to account for variances in the length of Items

Similarity Formula by Salton in SMART System

54 of 70

To determine the weight an Item has with respect to the Search Statement, The Cosine Formula is used to calculate the distance between the vector for the Item and the vector for the Query

The DICE

The Measure Simplifies the denominator from the Jaccard Formula and Introduces a factor of 2 in the Numerator.

The Normalization in the Dice Formula is also Invariant to the number of terms in common .

55 of 70

The use of a similarity Algorithm returns the complete Database as Search Results

Many of items have Similarity Close or equal to zero

The Threshold defines the Items in the resultant Hit File from the Query

Query Threshold Process:

Vector:- American, geography, Lakes, Mexico, Painter, oil, Reserve, Subject

DOC1:- geography of Mexico suggests oil reserves are available Vector(0,1,0,2,0,3,1,0)

DOC2:- American Geography has takes available everywhere vector(1,3,0,0,0,0,0,0)

DOC3:-Painters Suggest Mexico Lakes as Subjects Vector(0,0,1,3,3,0,0,2)

Based on Existing of Occurrence &Based frequency Occurrence numberical values .

56 of 70

Relevance Feedback

  • The Thesaurus and Semantic networks provide Utility in generally Expanding a users search statement to Include Potential related Search terms
  • But this still does not Correlate to the vocabulary used by the authors that contributes to a particular Database
  • There is also a significant risk that the Thesaurus does not Include the latest words being used
  • In an Interactive System, users can Manually Modify an Inefficient Query as well as the system Automatically Expand the Query Via Thesaurus
  • The Relevant Items are used to reweight the Existing Query terms and possible Expand the users Search Statement with terms
  • The relevance feedback concept was that new Query Should be based on the old Query Modified to Increase the weight of terms in relevant Items and decrease the weight of terms that are in non-relevant Items

57 of 70

58 of 70

Selective Dissemination of Information Search

  • The Selective Dissemination of Information, frequently Called dissemination Systems are becoming more prevalent with the growth of Internet
  •  A Dissemination system is sometimes called as Push Pull system
  •  Push is nothing but whenever you search statement ,Pull is nothing but Extract on Information based on Statement(or)back to the Information
  •  In a dissemination system, the user defines a profile and as new Information is added to the system, it is Automatically compared to the user’s Profile
  •  If it is considered a Match, it is asynchronously sent to the user mail file
  •  The difference between the two functions lie in the Dynamic nature of the profiling process, The size and diversity of the search statements and the number of Simultaneous searches per Item
  •  In the Search System an Existing Database Exists

59 of 70

Weighted Searches of Boolean Systems:

The two major approaches to generating Queries are Boolean and Natural Language Queries are easily represented within statistical Models and are usable by the Similarity Measures Discussed

 The Issues arise when Boolean Queries are Associated with weighted Index Systems

 Some of the Issues are Associated with how the logic (AND,OR, NOT) Operators Function with Weighted values and How weights are Associated with the Query terms

 If the operators are Interpreted in their normal Interpretation, they act too restrictive or too general AND and OR Operators Respectively

 Closely related to the strict definition problem is the lack of ranking that is missing from a pure Boolean Process

Some of the early work addressing this problem recognized the Fuzziness associated with mixing Boolean and weighted systems(Brookstein-78,Brookstein-80)

 To Integrate the Boolean and weighted systems Model (Fuzzy set=Fuzzy sets are sets whose Elements have degrees of membership

60 of 70

Considering all item weighs versus the Maximum/Minimum approach the Similarity Measure is Calculated as

61 of 70

Searching the Internet & Hypertext

  • The Internet has multiple different mechanisms that are the basis for searching of items. The primary techniques are associated with servers on the Internet that create indexes of items on the Internet and allow search of them. Some of the most commonly used nodes are YAHOO, AltaVista and Lycos. In all of these systems there are active processes that visit a large number of Internet sites and retrieve textual data which they index.
  • The primary design decisions are on the level to which they retrieve data and their general philosophy on user access. LYCOS (http://www.lycos.com) and AltaVista automatically go out to other Internet sites and return the text at the sites for automatic indexing (http://www.altavista.digital.com). Lycos returns home pages from each site for automatic indexing while Altavista indexes all of the text at a site.

62 of 70

Introduction of Information Visualization:

  • The Visualization is the transformation of Information into a visual form which enables the user to observe and Understand the Information
  • The functions that are available with Electronic display and Visualization of Data

Modify representations of Data and Information or Display Conditions(Changing colors scales)

  • Use the same representation while showing changes in data
  • Animate the display to show changes in space &Time
  • Create Hyperlinks under user control to establish relationships between data
  • The Information visualization addresses how the results of search may be optimally display to the users to facilitate their understanding of what the search has provided and their selection of most likely Items of Interest to read Cognitive(The action or process of acquiring Knowledge and understanding through Experience &The Sense)

63 of 70

Link Visualization:

It displays relationships among items

Attribute Visualization:

It reveals Content relationships across Large Numbers of Items

Cognition &Preception

Cognition:- cognition means to Store the Information( The action or process of acquiring knowledge and understanding through Thought, Experience& the Senses)

Preception:- Preception means Receiving Information(the ability to see, hear, or become aware of something through the senses)

Other Measures:

Proximity-Near by figures are grouped together

Similarity- Similar figures are grouped together

Continunity-figures are Interpreted as Smooth Continuous patterns rather than discontinuous Concatenations of Shapes

(Eg:-Circle with its Diameter drawn is perceived as two continuous shapes, a circle& a Line, Versus two half Circles Concatenated together

64 of 70

Information Visualization Technologies

  • Including additional terms from relevance feedback thesaurus Expansion Along with Documents retrieved and Indicate the Importance of the terms to the retrieval and Ranking process various information Visualization Technologies are as follows
  • Tree Structure
  • Cone-Tree
  • Perspective Wall
  • Tree Maps

Envision System

  • Document content Analysis &Retrieval System(DCARS)
  • City Space

65 of 70

Tree Structure:

CONE-TREE:

The Cone-Tree is a 3-Dimensional Representation of Data Where one node of the tree is represented at the Apex and all Information Subordinate to it is arranged in a Circular Structure at its base any Child node may also be the Parent of another Cone .

Perspective Wall:-

66 of 70

Tree Maps:-

Envision System:-

CITY SCAPE

67 of 70

Introduction to Text Search Techniques

The Basic Concept of a Text Scanning System is the ability for one or more users to enter Queries with the text of the Items to be Searched Sequentially and Compared to the Query Terms

When all of the text has been accessed the query is complete

One Advantage of this type Architecture is that as Soon as Item is Identified as satisfying a query the results can be presented to the user for retrieval

The database contains the full text of the Items

 The Term Detector is the Special Hardware/Software that contains all of the search terms and in Some Systems the logic between the terms

 The outputs to the query resolver the detected terms

68 of 70

The query Resolver performs two major functions accepting Search Statements from the users and Extracting the logic and search terms to pass to the detector

 It also accepts results from the detector and determines which queries are satisfied by the Item and possible the relevance weight Associated with Hit

 The query Resolver passes Information to the User Interface, allowing it to Continually update the search Status &on Request Retrieve any items that satisfy the user search statement

Software Text Search Algorithms:

In Software Streaming Techniques, the items to be Searched is read into memory and then the algorithm is applied

There are three Major Algorithms Associated with Software Text Search

 Brute Force Approach

 Boyer-Moore

 Knuth-Morris-Pratt

Boyer Moore Algorithm:

 It is a String Algorithm

69 of 70

 It is Significantly Enhanced as the Comparison process Started at the end of the search pattern processing right to Left versus the start of the search pattern

 The advantage is that large Jumps are Mis-Matched Character in the Input Stream the search Pattern which occurs frequently(one position to another position)

 The original Boyer-Moore Algorithm is developed for additional text search Techniques

 It was originally designed to support scanning for a single Search String

 It was Expanded to handle Multiple Search Strings on a Single Pass

 It enhanced and simplified versions of the Boyer-Moore Algorithm have been developed

Knuth-Morris Pratt Algorithm

It is an text search Algorithm

 Even in the worst Case it works well

 Unlike the previous Algorithms it does not Depend on the length of Input String

 It can also work as long Strings

 The basic Concept Behind the algorithm is that whenever a Mis Match is detected, the previous Matched Characters define the Number of Characters that can be Skipped in the Input Stream Prior to Process Again

 Starting The Comparison

 Position: 1 2 3 4 5 6 7 8

70 of 70

 Input Stream a b d a d e f g

 It is going to Identified as Position

When the Mis-Match Occurs in Position 4 with a “F” in the pattern and a “b” in the Input Stream this Algorithm allows the Comparison to Jump at least the three Positions associated with recognized “abd”

 Since the Mis Match on the position could be the beginning of the search strings four Positions can not be Skipped