1 of 22

CSE 163

Intro to HW4�

Suh Young Choi�

🎶 Listening to: The Killers

💬 Before Class: Are animals better when they’re orbular or angular (aesthetically)?

2 of 22

This Time

  • Magic methods
  • os library
  • Introducing HW4

Last Time

  • Private fields and methods
  • Default parameters

2

3 of 22

Announcements

Midterm evaluations dropping next week

  • Will be open Monday (7/24) – Wednesday (7/26)
  • If >70% of the class fills it out, everyone gets an extra day on HW4!
  • Feedback also helps us out a ton!!

Suh Young out of state again in Week 9

  • Zoom MW, in-person Friday
  • More details on this later

3

4 of 22

Lesson 13 Recap

Class methods and fields can be private

  • Indicated with an underscore (_) in front of the method or field name
  • Private methods and fields by convention are not accessed by the user
  • Setter vs. getter methods

Default parameters

  • Use None as a default when expecting an object

Lambda functions

  • Quick shorthand functions, particularly useful for sorting

4

5 of 22

Main Method Pattern

  • Why have we been making you do this annoying pattern?

  • If you don’t, it will run the main method if you import the file!
    • Usually not fun to run a 2 hour data analysis if you just wanted to import one helper function.

5

def main():

print('Hello world')

if __name__ == '__main__':

main()

6 of 22

Building Dogs

6

d1 = Dog('Chester')

d2 = Dog('Scout')

d3 = d1

d4 = Dog('Chester')

d1

name: 'Chester'

d2

name: 'Scout'

d3

d4

name: 'Chester'

7 of 22

Special Methods

  • The fact that you can use the same syntax for many classes is because of these “special methods”

7

Syntax

Method

x < y

x.__lt__(y)

x == y

x.__eq__(y)

x >= y

x.__ge__(y)

print(x)

print(x.__str__())

x[i]

x.__getitem__(i)

x[i] = v

x.__setitem__(i, v)

8 of 22

os

  • os is a built-in package in Python that helps work with the computer (e.g. files)

8

import os

for file_name in os.listdir(directory_name):

print(file_name) # relative path

print(os.path.join(directory_name,

file_name)) # absolute path

9 of 22

Don’t forget absolute paths!!

9

Only os.path.join() deals in absolutes

10 of 22

SearchEngine

  • SearchEngine: A class that manages which words appear in which document
  • Uses an inverted-index, to perform search

10

11 of 22

Project Part 0 Coming Soon

Project has a few parts

  • Proposal
  • Deliverables – Code and Report
  • Deliverables – Presentation Video
  • Peer Feedback – Reflections and Inter-Group Feedback

Bonus Project Component releasing next Friday

  • Find 2 datasets that you are interested in working on and complete a short writeup for each
  • Only bonus credit in the form of an extra Checkpoint Token
  • No penalty if not completed

Project can be completed in groups of ≤3, but bonus component is to be completed individually.

11

12 of 22

Before Next Time

  • Complete Lesson 15
    • Remember not for points, but do go towards Checkpoint Tokens
  • HW4, LR4, and CP4 out after class
  • Be on the lookout for the midterm evaluation!

Next Time

  • Geospatial data – mapmaking!

12

13 of 22

Slides from Video

13

14 of 22

Search Engine

  • HW4 involves writing a search engine from scratch
  • This will mainly involve
    • Processing files
    • Writing classes
  • Can be a bit complicated at first, recommend reading over these slides and the entire spec first to see how the pieces fit together, then tackle each part
  • We have parts separated into a development strategy

  • A search engine involves two main components
    • An index to efficiently find results for a query
    • A ranking algorithm to order the results
  • We’ll describe the index and searching first even though you will implementing part of the ranking algorithm first on the assignment

14

15 of 22

SearchEngine

  • SearchEngine: A class that manages which words appear in which document
  • Uses an inverted-index, to perform search

15

16 of 22

search

  • Part 1 - Single-word query
    • search(“love”)
      • Document 1
      • Document 2
      • Document 3
  • Part 2 - Multi-word query
    • search(“love corgis”)
      • Document 1
      • Document 2
      • Document 3
    • search(“corgis puppies”)
      • Document 1
      • Document 2

16

17 of 22

Document Ranking

  • Once we have a set of results, need rank them by relevance
  • Query: “the dogs”

  • Which order should we rank the results?
    • If we did it by raw number of hits, this would almost always list the dictionary as the first result

17

Dogs are the most amazing pets. Dogs are way better than cats because they are the best pets.

a - An article...

aardvark - An animal...

...

avocado - A fruit…

...

dog - The best pet…

...

dogs_rock.txt

dictionary.txt

18 of 22

TF-IDF

  • Raw counts fails because it favors long documents that have common words
  • Intuition, give higher weight to less common words and penalize common words like “the”

  • TF-IDF: Term Frequency - Inverse Document Frequency

Score(“the dogs”, D) = TFIDF(“the”, D) + TFIDF(“dogs”, D)

TFIDF(t, D) = TF(t, D) * IDF(t)

TF(t, D) = (# of times t in D) / (# of words in D)

IDF(t) = (# of documents) / (# of documents that have t)

18

19 of 22

TF-IDF

19

20 of 22

How to compute TF-IDF

  • It would be inefficient to calculate all these values from scratch every time we get a new query
  • Instead, we will store all the information ahead of time
    • Takes memory and a little bit to build, but then queries are fast
  • Document class stores Term-Frequency information
  • SearchEngine class stores Inverse-Document-Frequency information

20

21 of 22

Testing

  • Testing this assignment can be tricky with the wikipedia dataset
    • It’s large, takes a few minutes to build the index
    • It’s hard to know what the answer “should be”
  • You are recommend to make your own directories of test files

  • It’s okay to directly inspect your private fields in tests
    • We will do this for private methods we ask you to write

21

22 of 22

Development Strategy

  • Read the whole spec first
  • Take notes as you’re writing and try to make a mental map of how the pieces fit together
  • Write down any of the special method calls we tell you about in the spec and what they do so you know when you’ll use them
  • Planning up front can be very helpful with this project
    • You should probably still tackle the project in the order we suggested, but thinking about the whole thing is important

22