� INFORMATION RETRIEVAL SYSTEM
INTRODUCTION
Definition:
Overhead from a user’s perspective is the time required to find the information needed, excluding the time for actually reading the relevant data. Thus search composition, search execution, and reading non-relevant items are all
aspects of information retrieval overhead.
Objectives
The general objective of an Information Retrieval System is to minimize the overhead of a user locating needed information. Overhead can be expressed as
the time a user spends in all of the steps leading to reading an item containing the needed information. For example:
Query Generation Query Execution
Scanning results of query to select items to read Reading non-relevant items etc ...
→ The two major measures commonly associated with information systems are precision and recall.
→Pr ecision
Number _ Re trieved _ Re
levant /
Number _ Total _ Re trieved
Re call
Numbe _ Re trieved _ Re levant /
Number _ Possible _ Re levant
Functional Overview
Item Normalization: The first step in any integrated system is to normalize the incoming items to a standard format. In addition to translating multiple external formats that might be received into a single consistent data structure that can be manipulated by the functional processes, item normalization provides logical restructuring of the item
Relationship to DBMS, Digital Libraries and Data Warehouses:
→ An Information Retrieval System is software that has the features and function required to manipulate “information” items where as a DBMS that is optimized to handle “structured” data. → All the three systems (IRS, Digital Libraries & Data Warehouses) are repositories of information and their primary goal is to satisfy user information needs.
Digital Libraries value original sources where as these are of less concern to IRS.
→ With direct electronic access available to users the social aspects of congregating in a library and learning from librarians, friends and colleagues will be lost and new electronic collaboration equivalencies will come into existence.
→ With the proliferation of information available in electronic form, the role of search intermediary
will shift from an expert in search to being an expert in source analysis.
→ Information storage and retrieval technology has addressed a small subset of the issues associated with digital
libraries.
→ The term Data Warehouse comes more from the commercial sector than academic sources.
→ Its goal is to provide to the decision makers the critical information to answer future direction questions.
→ A data warehouse consists of the following
Data warehouse is more focused on structured data & decision support technologies.
→ Data mining is a search process that automatically analyzes
data and extract relationships and dependencies that were not part of the database design.
Information Retrieval System Capabilities Search:
→ Major functions that are available in an Information Retrieval System
The objective of the search capability is to allow for a mapping between a user’s specified need and the items in the information database that will answer that need. Based upon the algorithms used in a system many different functions are associate with the system’s understanding the search statement. The functions define the relationships between the terms in the search statement
Boolean Logic: Allows a user to logically relate multiple Concepts together to define what information is needed
Input function (captures data and moves it to data warehouse)
Continuous Word Phrases :( CWP) is two or more words that are treated as a single semantic unit.
Fuzzy Searches: provide the capability to locate spellings of words that are similar to the entered search term. Primary used to compensate for errors in spelling of words. Fuzzy searching increases recall at the expense of decreasing precision
Term Masking: The ability to expand a query term by masking a portion of the term and accepting as valid any processing token that maps to the unmasked portions of the term
Numeric and Date Ranges: Allows for specialized numeric or date range processing against those words.
Concept/ Thesaurus Expansion : Thesaurus is typically a one-level or two-level expansion of a
term to
other terms that are similar in meaning where as a concept class is a tree structure that expands each meaning of a word into potential concepts that are related to the initial term.
Natural Language Queries : Allow a user to enter a prose statement that describes the information that the user wants to find. The longer the prose, the more accurate the results returned
→ Multimedia Queries : The user interface becomes far More complex with introduction of the
availability
of multimedia items.
→ All of the previous discussions still apply for search of the textual portions of a multimedia database. But in addition, the user has to be able to specify search terms for the other modalities
Browse:
→ Once the search is complete, browse capabilities provide the user with the capability to determine which items are of interest and select those to be displayed.
→ There are two ways of displaying a summary of the items that are associated with a query: Line Item Status and Data Visualization.
Different browsing capabilities are:
Multimedia : Once a list of potential items that Satisfy the query are discovered, the techniques for displaying them when they are multimedia
Introduces new Challenges. To display more aggregate data textual interfaces sometimes allow for
clustering of the hits and then use of graphical display to show a higher level view of
the information. Neither of these techniques lends themselves when the information is multimodal. The textual aspect of the multimedia can use to apply all of the techniques described.
CATALOGING AND INDEXING
�INDEXING: �
�History of indexing �
the dependency of information processing capabilities on manual and then automatic processing systems .
• Indexing originally called cataloguing : oldest technique to identity the contents of items to assist in retrieval.
• Items overlap between full item indexing , public and private indexing of files
1.Decide the scope of indexing and the level of detail to be provided. Based on usage scenario of users .
2.Second decision is to link index terms together in a single index for a particular concept.
TEXT PROCESSING
Text process phases
Contd…
Lemmatization is a process that typically uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word, thereby collapsing its inflectional forms (see, e.g., [278]). For example, our sentence would result in “Search engine are the most visible information retrieval application” when lemmatized according to a WordNet-based lemmatizer
Weighting. The final phase of text pre processing deals with term weighting. As previously mentioned, words in a text have different descriptive power; hence, index terms can be weighted differently to account for their significance within a document and/or a document collection. Such a weighting can be binary, e.g., assigning 0 for term absence and 1 for presence.
SCOPE OF INDEXING
�PREORDINATION AND LINKAGES �
AUTOMATIC INDEXING
�Bayesian approach �
• “f” representing concepts in an item
Another approach is natural language processing.
• DR-LINK( document retrieval through linguistics knowledge )
• Indexing by concept
• Concept indexing determines a canonical set of concept based upon a test set of terms and uses them as base for indexing all items. Called latent semantics indexing .
• Ex: match plus system developed by HNC inc
• Uses neural NW strength of the system word relationship (synonyms) and uses the information in generating context vectors.
• Two neural networks are used one to generated stem context vectors and another one to perform query.
• Interpretation is same as the weights.
• Multimedia indexing:
• Indexing video or images can be accomplished at raw data level.
• Positional and temporal (time) search can be done.
INFORMATION EXTRACTION
DATA STRUCTURES
�Data structure �
The other data structure contains processing tokens and associated data to support search.
Item normalization
Document File creation
Document manager
Document Search manager
Original document file
Major data structure
Processing token data structure that support search function
Stemming : is the transformation often applied to data before placing it in the searchable data structure
Stemming represents concept(word) to a canonical (authorized; recognized; accepted)morphological (the patterns of word formation in a particular language ) representation .
Risk with stemming : concept discrimination information may be lost in the process. Causing decrease in performance.
Advantage : has a potential to increase recall.
PORTER STEMMING ALGORITHM
*o - stem ends in consonant vowel sequence where the final consonant is not w,x,y(e.g. -WIL, -HOP).
Successor stemmers
INVERTED FILE STRUCTURE
Which is a sorted list( processing tokens) in the system and a pointer to the location of its inversion list.
N-GRAM DATA STRUCTURE
PAT data structure
In creation of PAT trees each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input.
Signature file structure
HYPERTEXT AND XML DATA STRUCTURES
AUTOMATIC INDEXING
Statistic approach – stores a single statistics, such as how often each word occurs in an item
– used for generating relevance scores after a Boolean search .
Probabilistic indexing -stores the information that are used in calculating a probability that a particular item satisfies (i.e., is relevant to) a particular query.
Bayesian and vector space – stores the information that are used in generating a relative confidence level of an items relevancy to a query.
Neural networks –dynamic learning structures– comes under concept indexing – that determine concept class.
Natural language approach perform the similar processing token identification as in statistical techniques - additional level of parsing of the item (present, past, future action ) enhance search precision.
Concept indexing uses the words within an item to correlate to concepts discussed in the index item.
When generating the concept classes automatically, there may not be a name applicable to the concept but just a statistical significance.
Finally, a special class of indexing can be defined by creation of hypertext linkages.
These linkages provide virtual threads of concepts between items versus directly defining the concept within an item.
Each technique has its own strengths and weaknesses.
Probabilistic weighting
Natural Language Processing
CONCEPT INDEXING
The term “automobile” is strongly related to “vehicle,” lesser to “transportation” and much lesser the other terms.
DOCUMENT AND TERM CLUSTERING
CLUSTER ANALYSIS
Applications in Information Retrieval
Search Statements&Binding
The next level of binding comes when the search statements is parsed for use by a specific search system
The length of search Statements directly affect the ability of Information Retrieval Systems to find relevant Items
The longer the search query, the easier it is for the system to find items
Similarity Measures and Ranking
The Searching is concerned with calculating the Similarity between a users Search Statement and the items in the Database
Restricting the similarity Measure to Passages gains Significant Precision with Impact on recall
Once Items are Identified as possibly relevant to the users query, It is best to present the most likely relevant items first Ranking is a Scalar number that represents how similar an Item is to the Query
Searching is Concerned with calculating the Similarity between a users search statement and the items in the Database
Variety of different similarity measures can be used to calculate the similarity the item and the search Statement
A Characteristic of a similarity formula is that the results of the Formula Increase
Increases as the Items become More Similar
The value is Zero if the Items are totally Dissimilar
SIM(Item i, Itemj)=z(Termix)(Termja)
This formula uses the summation of the product of the various terms of two Items when treating the Index as a vector
The problem with the simple Measures in the normalization needed to account for variances in the length of Items
Similarity Formula by Salton in SMART System
To determine the weight an Item has with respect to the Search Statement, The Cosine Formula is used to calculate the distance between the vector for the Item and the vector for the Query
The DICE
The Measure Simplifies the denominator from the Jaccard Formula and Introduces a factor of 2 in the Numerator.
The Normalization in the Dice Formula is also Invariant to the number of terms in common .
The use of a similarity Algorithm returns the complete Database as Search Results
Many of items have Similarity Close or equal to zero
The Threshold defines the Items in the resultant Hit File from the Query
Query Threshold Process:
Vector:- American, geography, Lakes, Mexico, Painter, oil, Reserve, Subject
DOC1:- geography of Mexico suggests oil reserves are available Vector(0,1,0,2,0,3,1,0)
DOC2:- American Geography has takes available everywhere vector(1,3,0,0,0,0,0,0)
DOC3:-Painters Suggest Mexico Lakes as Subjects Vector(0,0,1,3,3,0,0,2)
Based on Existing of Occurrence &Based frequency Occurrence numberical values .
Relevance Feedback
Selective Dissemination of Information Search
Weighted Searches of Boolean Systems:
The two major approaches to generating Queries are Boolean and Natural Language Queries are easily represented within statistical Models and are usable by the Similarity Measures Discussed
The Issues arise when Boolean Queries are Associated with weighted Index Systems
Some of the Issues are Associated with how the logic (AND,OR, NOT) Operators Function with Weighted values and How weights are Associated with the Query terms
If the operators are Interpreted in their normal Interpretation, they act too restrictive or too general AND and OR Operators Respectively
Closely related to the strict definition problem is the lack of ranking that is missing from a pure Boolean Process
Some of the early work addressing this problem recognized the Fuzziness associated with mixing Boolean and weighted systems(Brookstein-78,Brookstein-80)
To Integrate the Boolean and weighted systems Model (Fuzzy set=Fuzzy sets are sets whose Elements have degrees of membership
Considering all item weighs versus the Maximum/Minimum approach the Similarity Measure is Calculated as
Searching the Internet & Hypertext
Introduction of Information Visualization:
Modify representations of Data and Information or Display Conditions(Changing colors scales)
Link Visualization:
It displays relationships among items
Attribute Visualization:
It reveals Content relationships across Large Numbers of Items
Cognition &Preception
Cognition:- cognition means to Store the Information( The action or process of acquiring knowledge and understanding through Thought, Experience& the Senses)
Preception:- Preception means Receiving Information(the ability to see, hear, or become aware of something through the senses)
Other Measures:
Proximity-Near by figures are grouped together
Similarity- Similar figures are grouped together
Continunity-figures are Interpreted as Smooth Continuous patterns rather than discontinuous Concatenations of Shapes
(Eg:-Circle with its Diameter drawn is perceived as two continuous shapes, a circle& a Line, Versus two half Circles Concatenated together
Information Visualization Technologies
Envision System
Tree Structure:
CONE-TREE:
The Cone-Tree is a 3-Dimensional Representation of Data Where one node of the tree is represented at the Apex and all Information Subordinate to it is arranged in a Circular Structure at its base any Child node may also be the Parent of another Cone .
Perspective Wall:-
Tree Maps:-
Envision System:-
CITY SCAPE
� Introduction to Text Search Techniques
The Basic Concept of a Text Scanning System is the ability for one or more users to enter Queries with the text of the Items to be Searched Sequentially and Compared to the Query Terms
When all of the text has been accessed the query is complete
One Advantage of this type Architecture is that as Soon as Item is Identified as satisfying a query the results can be presented to the user for retrieval
The database contains the full text of the Items
The Term Detector is the Special Hardware/Software that contains all of the search terms and in Some Systems the logic between the terms
The outputs to the query resolver the detected terms
The query Resolver performs two major functions accepting Search Statements from the users and Extracting the logic and search terms to pass to the detector
It also accepts results from the detector and determines which queries are satisfied by the Item and possible the relevance weight Associated with Hit
The query Resolver passes Information to the User Interface, allowing it to Continually update the search Status &on Request Retrieve any items that satisfy the user search statement
Software Text Search Algorithms:
In Software Streaming Techniques, the items to be Searched is read into memory and then the algorithm is applied
There are three Major Algorithms Associated with Software Text Search
Brute Force Approach
Boyer-Moore
Knuth-Morris-Pratt
Boyer Moore Algorithm:
It is a String Algorithm
It is Significantly Enhanced as the Comparison process Started at the end of the search pattern processing right to Left versus the start of the search pattern
The advantage is that large Jumps are Mis-Matched Character in the Input Stream the search Pattern which occurs frequently(one position to another position)
The original Boyer-Moore Algorithm is developed for additional text search Techniques
It was originally designed to support scanning for a single Search String
It was Expanded to handle Multiple Search Strings on a Single Pass
It enhanced and simplified versions of the Boyer-Moore Algorithm have been developed
Knuth-Morris Pratt Algorithm
It is an text search Algorithm
Even in the worst Case it works well
Unlike the previous Algorithms it does not Depend on the length of Input String
It can also work as long Strings
The basic Concept Behind the algorithm is that whenever a Mis Match is detected, the previous Matched Characters define the Number of Characters that can be Skipped in the Input Stream Prior to Process Again
Starting The Comparison
Position: 1 2 3 4 5 6 7 8
Input Stream a b d a d e f g
It is going to Identified as Position
When the Mis-Match Occurs in Position 4 with a “F” in the pattern and a “b” in the Input Stream this Algorithm allows the Comparison to Jump at least the three Positions associated with recognized “abd”
Since the Mis Match on the position could be the beginning of the search strings four Positions can not be Skipped