Lecture5/13: Basics of Cloud Computing


1		Text of the question	Hint	Answer

2	1	What is Boolean retrieval?	Lecture 12 slide 10	Users express queries as boolean exoressions: and, or, not can be arbitarily nested Retrieval is basde on notion of sets any given query divides the collection into two sets: retrieved, not-retrieved
3	2	How users express queries ?	Lecture page 10.	As boolean expressions, using AND, OR, NOT.
4	3	What is Inverted index?	http://en.wikipedia.org/wiki/Inverted_index	In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.
5	4	How to execute a Boolean query?	Slide 13	- Build query syntax tree - For each clause, look up postings - Traverse posting and apply Boolean operator
6	5	How to execute Boolean query?	Slide 13	To execute a Boolean query: - Build query syntax tree - For each clause, look up postings - Traverse postings and apply Boolean operator
7	6	If you use boolean retrival then all results are considered equally good. Is it weakness or strength?	slides	weakness
8	7	What are the strengths of boolean retrievals?	Slide #15	- Precise, if you know the right strategies. - Precise, if you have an idea of what you're looking for. - Implementations are fast and efficient.
9	8	What are the advantages and disadvantages of standard boolean retrieval model?	http://en.wikipedia.org/wiki/Standard_Boolean_model	Advantages Clean Formalism Easy to implement Intuitive concept Disadvantages Exact matching may retrieve too few or too many documents Difficult to rank output, some documents are more important than others Hard to translate a query into a Boolean expression All terms are equally weighted More like data retrieval than information retrieval
10	9	What two questions are important for term weights	slide 19	Local: how important is the term in this document? Global: how important is the term in the collection?
11	10	What the term weights consists of?	19	Local: how important is the term in this document? Global: how important is the term in the collection?
12	11	What are the principles of term weighting?	#19	* terms that appear often in document should get high weights; * terms that appear in many documents should get lower weights
13	12	What is TF-IDF?	Lecture slide 20	Term frequency- inverse document frequency
14	13	What is TF-IDF	http://tfidf.com/	Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
15	14	What components term weights consist of?	Slide 19.	Local: how important is the term in this document? Global: how important is the term in the collection?
16	15	What is inverse document frequency?	http://en.wikipedia.org/wiki/Tf*idf#Mathematical_details	The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.
17	16	Term frequency ranking function and TF-IDF are simple ranking functions. What are the more sophisticated functions?	http://en.wikipedia.org/wiki/Ranking_function	Okapi BM25: a variant of tf-idf. As of 2008, represents the state-of-the-art and is used in many practical applications. Machine-learned ranking formulas, obtained automatically from training data by machine learning methods.
18	17	On what the retrieval is based?	Slide 10	Retrieval is based on the notion of sets – Any given query divides the collection into two sets: retrieved, not-retrieved – Pure Boolean systems do not define an ordering of the results
19	18	What does TF-IDF mean?	http://en.wikipedia.org/wiki/Tf*idf	The tf*idf weight (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
20	19	What is TF-IDF?	http://en.wikipedia.org/wiki/Tf*idf	TF-IDF weight (term frequency–inverse document frequency) can be defined as numerical statistic which reflects how important a word is to a document in a collection or corpus. Often it is used as a weighting factor in information retrieval and text mining. TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
21	20	what data is required to calculate tf-idf?		term freq in document, word count in document, term freq in all documents, total document count
22	21	What are indexing problems?	Lecture slides L6 Map reduce Lecture 3 page 31	Must be relatively fast, but need not to be real time For web, incremental updates are importand must have sub-second response for web, only need relatively few results funtamentally, a large sorting problem
23	22	When using TF-IDF and MapReduce, buffering (doc, n, N) counts while summing 1’s into m may not fit in memory. What are the possible solutions to this?	Slide 29	Possible solutions include: – Ignore very-high-frequency words – Write out intermediate data to a file – Use another MapReduce pass
24	23	Give a short summary on TF-IFD	Lecture slide 30	- Several small jobs add up to full algorithm - Lots of code reuse possible - Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle - Very easy to handle medium-large scale, but must take care to ensure flat memory usage for largest scale
25	24	What are the possible solutions when using TF-IDF and MapReduce buffering (doc, n, N) counts while summing 1’s into m, and it doesn't fit in memory?	Lecture 5/13 slide 29	Possible solutions include: – Ignore very-high-frequency words – Write out intermediate data to a file – Use another MapReduce pass
26	25	What is mapreduce twister?	Not on slides but mentioned	Extension to the programming model of mapreduce to expand the applicability of MapReduce to more classes of applications. Twister is a lightweight MapReduce runtime more discussed in here http://www.iterativemapreduce.org/
27	26	Is MapReduce meant for large- or small-data batch processsing?	Cloud lecture 5, slide 32	large
28	27	Describe two approaches of synchronization	Lecture 5 slide 3	• Approach 1: turn synchronization into an ordering problem – Sort keys into correct order of computation – Partition key space so that each reducer gets the appropriate set of partial results – Hold state in reducer across multiple key-value pairs to perform computation – Illustrated by the “pairs” approach • Approach 2: construct data structures that “bring the pieces together” – Each reducer receives all the data it needs to complete the computation – Illustrated by the “stripes” approach
29	28	What are the weaknesses of boolean retrieval?	Slide 15	– Users must learn Boolean logic – Boolean logic insufficient to capture the richness of language – No control over size of result set: either too many hits or none – When do you stop reading? All documents in the result set are considered “equally good” – What about partial matches? Documents that “don’t quite match” the query may be useful also
30	29	How does sorting algorithm work in Oracle?	http://docs.oracle.com/cd/F49540_01/DOC/inter.815/a67843/ascore.htm	To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are "noise" terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
31	30	What are the differences between boolean and ranked retrieval ?	Slide #15, 16 europa.nvc.cs.vt.edu/~jzhang/cs5604/Note-2.pdf	Boolean retrieval: Every document is either relevant or not . Result is a set of those relevant documents. Implementations are fast, efficient (if you know what you are looking for). On the other hand it's hard to formulate queries (users have to learn boolean logic) and you might not get all relevant documents. Ranked retrieval Every document has a score of its relevancy to the query. Result is a list of the ranked documents. It is easy to use and has better retrieval performance. Ranked relevancy (sorted with best result as first) gives better overview of the result. *On the other hand it assumes that the terms are independent.
32	31	What libraries are there to support the concept of map reduce in C++ ?	http://www.craighenderson.co.uk/mapreduce/	Boost MapReduce
33	32	Where are IR Systems used?	http://en.wikipedia.org/wiki/Information_retrieval	Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
34	33	Vector Space Model advantages and disadvantages	slide 17 http://en.wikipedia.org/wiki/Vector_space_model http://www.csee.umbc.edu/~ian/irF02/lectures/07Models-VSM.pdf	Advantages: Simple model based on linear algebra Ranked retrieval Terms are weighted by importance Partial matches allowed Allows ranking documents according to their possible relevance Disadvantages: assumes terms are independent Weighting is intuitive, but not very formal
35	34	What's a word?	http://en.wikipedia.org/wiki/Word	In language, a word is the smallest element that may be uttered in isolation with semantic or pragmatic content (with literal or practical meaning).
36	35	What information is needed for TF-IDF	slide 24	• Number of times term X appears in a given document • Number of terms in each document • Number of documents X appears in • Total number of documents
37	36	Name all steps of information cycle.	Slide 4	Source selection, query formulation, search, selection, examination, delivery.
38	37	What is Vector Space Model?	http://en.wikipedia.org/wiki/Vector_space_model	Vector space model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.
39	38	What is a word stem? What is stemming?	http://en.wikipedia.org/wiki/Word_stem http://en.wikipedia.org/wiki/Stemming Slide #14	In linguistics, a stem is a part of a word. In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Many search engines treat words with the same stem as synonyms as a kind of query broadening, a process called conflation.
40	39	What are the limitations of vector space model?	http://en.wikipedia.org/wiki/Vector_space_model#Limitations	The vector space model has the following limitations: Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) Search keywords must precisely match document terms; word substrings might result in a "false positive match" Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". The order in which the terms appear in the document is lost in the vector space representation. Assumes terms are statistically independent Weighting is intuitive but not very formal Many of these difficulties can, however, be overcome by the integration of various tools, including mathematical techniques such as singular value decomposition and lexical databases such as WordNet.
41	40	What is Information Retrieval (IR) ?	http://en.wikipedia.org/wiki/Information_retrieval	Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web.
42	41	How do represent text?	slide 7	– Treat all the words in a document as index terms for that document – Assign a “weight” to each term based on “importance” – Disregard order, structure, meaning, etc. of the words
43	42	What do search engines use tf*idf weighting for ?	http://en.wikipedia.org/wiki/Tf*idf	Variations of the tfidf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tfidf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
44	43	Which software implements the vector space model?	http://en.wikipedia.org/wiki/Vector_space_model#Software_that_implements_the_vector_space_model	Apache Lucene. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. SemanticVectors. Semantic Vector indexes, created by applying a Random Projection algorithm (similar to Latent semantic analysis) to term-document matrices created using Apache Lucene. Gensim is a Python+NumPy framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for Tf–idf, Latent Semantic Indexing, Random Projections and Latent Dirichlet Allocation. Compressed vector space in C++ by Antonio Gulli Text to Matrix Generator (TMG) MATLAB toolbox that can be used for various tasks in text mining specifically i) indexing, ii) retrieval, iii) dimensionality reduction, iv) clustering, v) classification. Most of TMG is written in MATLAB and parts in Perl. It contains implementations of LSI, clustered LSI, NMF and other methods. SenseClusters, an open source package that supports context and word clustering using Latent Semantic Analysis and word co-occurrence matrices. S-Space Package, a collection of algorithms for exploring and working with statistical semantics.
45	44	Which algorithm does Google use for it's search engine?	http://en.wikipedia.org/wiki/PageRank	PageRank is a link analysis algorithm used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set.
46	45	What is the central problem in search?	slide #5, #7	The difference between the query terms and document terms. While the terms may differ syntacticly, they have the same meaning (semantics). The problem is that computers do not understand anything. The problem is then how to match the two using some algorithms.
47	46	What are the variants of inverted index?	http://en.wikipedia.org/wiki/Inverted_index	There are two main variants of inverted indexes: A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word. A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document. The latter form offers more functionality (like phrase searches), but needs more time and space to be created.
48	47	Definition of vector space model	http://en.wikipedia.org/wiki/Vector_space_model	Documents and queries are represented as vectors. d_j = ( w_{1,j} ,w_{2,j} , ... ,w_{t,j} ) q = ( w_{1,q} ,w_{2,q} , ... ,w_{t,q} ) Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus). Vector operations can be used to compare documents with queries.
49	48	What is vector space model?	Lecture slides 17	Represent documents and queries as vectors. Assume that documents that are “close together” in vector space “talk about” the same things. Therefore, one can retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) by calculating the distance between the two.
50	49	What is TF-IDF sometimes used for?	http://en.wikipedia.org/wiki/Tf*idf	Variations of the tfidf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tfidf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.