1 of 11

Project Overview

CSE 2341

2 of 11

Scenario

  • Stack Overflow’s search system is broken and they need a new one quick!
  • You will build a search engine for a large collection of questions that have been extracted from Stack Overflow.
    • Users will be able to enter a search query and you’ll return results that are ranked by a relevancy scheme that you devise.

3 of 11

The Architecture of a Search Engine

4 of 11

Sample Data 2009-questions.psv

|Id|OwnerUserId|CreationDate|Score|Title|Body|Code�5827|404450||2009-01-01T02:55:13Z|1|LINQ to SQL|"I am finishing off a C# ASP.NET program that allows the user to build their own computer by selecting hardware components such as memory, cpu, etc from a drop down lists. The SQL datatable has 3 columns; ComputerID, Attribute and Value. The computerID is an ID that corresponds to a certain computer in my main datatable of products, the Attribtute is the name of the hardware component; memory,cpu, hard drive etc.. and the value is the value assigned to that attribute, such as 1GB or 2.8GHz 320GB. This means that a computer will have multiple attributes.

What I am trying to do it narrow down the results by first selecting all computers that meet the first attribute requirements and then getting from that list, all computers that meet the next requirement.. and so on for about 10+ attributes.

I thought it might be a good idea to show you an example of my LINQ to SQL query so that you have a btter idea of what I am trying to do. This basically selects the ComputerID where the the computers memory is larger than 1GB.

Next I want to select from the resultsList where the CPU is say, faster than 2.8Ghz and so on.

I hope I have given you enough information.

If anyone could please give me some advice as to how I might go about finishing this project that would be great.

Thanks

"|"var resultsList = from results in db.ComputerAttributes

where computer.Value == ""MEMORY"" && computer.Value >= ""1""

select results.ComputerID;

"

5 of 11

The Document Parser

the

running

a

formed

nicely

Remove Stop

Words

running

formed

nicely

Stem

run

form

nice

These are the words that will actually be placed in your index

6 of 11

The Inverted File Index

Documents:

d1 = computer network security

d2 = network cryptography

d3 = database security

Index:

computer = d1

network = d1, d2

security = d1, d3

cryptography = d2

database = d3

7 of 11

Query Processor

  • java
    • This query should return all questions that contain the word java.
  • AND programming computer java
    • This query should return all questions that contain the words programming and computer and java
  • OR java javascript
    • This query should return all questions that contain either java OR javascript OR both.
  • AND book c++ NOT java
    • This query should return all questions that contain book and c++, but not java.
  • Java NOT c++

8 of 11

Index Handler

  • You'll implement at least two underlying data structures to maintain the inverted file index
    • AVL Tree
    • Hash Table
  • Classes that will be used to store index should implement the same interface.
    • may be adapters of AVL or hash table.
  • Index should be persistent.
    • When program starts, user should have the option of importing index into AVL or Hash table.

9 of 11

Ranking the Results

  • Ranking will be done using Term Frequency/Inverse Document Frequency (TF/IDF) statistic.
  • General Idea:
    • If a word appears in a document D frequently, but appears rarely in other documents from the same corpus, D is an important result for the query.
    • If a word appears in a document but also appears in many other documents, it is less important.

10 of 11

Project Logistics

  • Can be done individually or in teams of 2 or 3 students (max)
    • Teams of 3 have to implement 2-word phrase queries
  • Must use OOP
  • May use as much of c++ std lib as you'd like
    • Caveat: you must implement your own AVL tree and Hash table.
  • Code base should be properly documented and formatted.

11 of 11

Schedule

  • Friday April 12: Team info to TAs
  • Mon April 16 @ 6am: Design Docs uploaded to Canvas
  • Week of April 16 in Lab: Demo document parser/processor and one index data structure to TA.
  • Week of April 23 in Lab:
  • Friday April 27: Sanity Check and Parsing Speed Demo with Fontenot
  • Week of April 30: Finish functionality, polish, work on paper.
  • Monday May 7 @ 6am: Project Due
  • Monday May 7: Demo project to TAs and Fontenot