1 of 67

You may also like

Solr Clojure ML & Movies

2 of 67

Hello!

I am Leon Talbot

Senior Software Engineer at Stylitics

@leontalbot

2

3 of 67

Stylitics, Inc.

  • NYC Tech Startup

We are hiring!

  • Clojure(script) Engineer
  • Work remotely

3

4 of 67

Movie Recommendation Engine

4

Input

Output

You may also like...

Mov 2

Mov 3

Mov 4

Mov 1

Mov 6

Mov 7

Mov 8

Mov 5

Mov 9

Mov 10

5 of 67

Goal

Starter kit

  • In Clojure
  • Fast
  • Accurate
  • Time savvy

5

6 of 67

Agenda

  • 1. Explain
  • 2. Showcase

6

7 of 67

POC

7

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

?

8 of 67

POC

8

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

2. User

prefs

?

9 of 67

POC

9

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

2. User

profile

3. User

profileS and

movie ratings

?

10 of 67

POC

10

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

2. User

pref

3. User

profileS and

movie ratings

?

Lot of content

to look at

Robust �Content base filtering

11 of 67

Goals

  • In Clojure
  • Fast
  • Accurate
  • Time savvy

11

12 of 67

vs

12

French Police

Movie 1

French

Word

Police

Word

Word

Word

Word

Word

1

5

8

2

5

6

7

Reverse index

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie 8

DB Index

13 of 67

Ranking fast

TF * iDF

Token Freq. Inverse of Doc Freq.

13

french

police

1

x 2

5

x 2

8

x 1

2

x 2

5

x 2

6

x 1

7

x 1

Movie 1

Movie 2

Movie 5

Movie 6

Movie 7

Movie 8

word

word

word

14 of 67

Indexing + Matching + Ranking

No indexing

Matching

Indexing...

Offline

Live

Matching...

1k movies

Live

14

15 of 67

Search engine?

15

16 of 67

POC

16

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

2. User

prefs

3. User

profileS and

movie ratings

?

Pref. are more

movie content

=

Still content base filtering

17 of 67

POC

17

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

2. User

prefs

3. User

profileS and

movie ratings

?

�Collaborative filtering

18 of 67

Ranking

Re-ranking

ML model

Top 10

1 million movies

Ranking...

Matching...

Top 10

Top 100 movies

Re-ranking...

18

19 of 67

2 months ago,

kept stumbling upon...

19

  • Learning to Rank

20 of 67

Apache Solr

  • Apache Lucene token matching and ranking (Java)
  • Standalone server, REST API
  • Declarative syntax for most config
  • High level, powerful, query handlers (MLT)
  • Re-ranking facilities, with LTR plugin

20

21 of 67

POC

21

1. Movie

metadata

2

3

4

1

6

7

8

5

9

10

1 million movies

DB

Results

2. User

profile

3. User

profileS and

movie ratings

?

22 of 67

Goals

  • In Clojure
  • Fast
  • Accurate
  • Time savvy

22

23 of 67

In clojure?

23

24 of 67

github.com

/Stylitics/corona

Solr 8 in Clojure!

( )

)

24

25 of 67

Companies

25

Who is using Solr?

  • Netflix
  • Instagram
  • DuckDuckGO
  • Bloomberg

  • Walmart
  • Salesforce
  • Apple
  • Dell

26 of 67

Companies

26

Who is using Corona?

  • Stylitics

Experiment with us!

27 of 67

github.com

/Stylitics/corona-demo

27

28 of 67

Basic

Content based

recommendations

.

1.

29 of 67

Movies dataset (TMDB 5000)

{:db_id "42151",

:runtime "89",

:budget "31192",

:vote_count "26",

:vote_average "6.3",

:popularity "1.330379",

:revenue "10000",

:release_date "2009-09-01",

:status "Released",

29

:genres ["Drama" "Action" "Comedy"],

:title "Down Terrace",

:tagline "You're only as …”,

:overview "After serving jail time...",

:original_language "en",

:keywords ["murder" "dark comedy" ...],

:director "Ben Wheatley",

:spoken_languages ["en"],

:production_countries ["GB"]}

30 of 67

Server config

| |____solr

| | |____zoo.cfg

| | |____solr.xml

| |____logs

| | |____solr.log

| |____start.jar

____solr-8.0.0

|____bin

|____contrib

|____docs

|____dist

|____example

|____server

|____solr.xml

30

debug

Copy to $SOLR_HOME

31 of 67

App config

____resources

|____solr

| |____tmdb

| | |____core.properties

| | |____data

| | |____conf

| | | |_____schema_model-store.json

| | | |____managed-schema

| | | |____elevate.xml

| | | |____mapping-FoldToASCII.txt

| | | |_____schema_feature-store.json

| | | |____protwords.txt

| | | |____currency.xml

| | | |____synonyms.txt

| | | |____mapping-ISOLatin1Accent.txt

| | | |____spellings.txt

| | | |____solrconfig.xml

| | | |____stopwords.txt

31

INDEX

32 of 67

Start

$ cd corona-demo

$ SOLR_HOME/bin/start

$ lein repl

(def core-dir (str (System/getProperty "user.dir") "/resources/solr/tmdb")

(def client-config {:core :tmdb})

;; clear core config and set it back based on solrconfig.xml

(solr.core/delete! client-config {:deleteIndex true})

(solr.core/create! client-config {:instanceDir core-dir}))

32

33 of 67

Schemas

(first data/movies) ; inspect movies

;; add schema fields missing from managed-schema conf file.

(solr.schema/get-field-types client-config)

(solr.schema/add-field-type! client-config {...})

(solr.schema/get-fields client-config)

(solr.schema/add-field! client-config {}) ; {} or [{}]

(doseq [n (map :name data/content-fields)]

(solr.schema/update-field!

client-config

{:add-copy-field {:source n :dest "_text_"}})) ;default search field

33

34 of 67

Schemas: field types

{:name "text_en_splitting"

:class "solr.TextField"

...

:indexAnalyzer {:tokenizer {:class "solr.WhitespaceTokenizerFactory"}

:filters []}

:queryAnalyzer {:type "query"

:tokenizer {:class "solr.WhitespaceTokenizerFactory"}

:filters []}

34

35 of 67

Schemas: field types

:indexAnalyzer

[{:class "solr.StopFilterFactory" ...} ;; Removes stop words

{:class "solr.WordDelimiterGraphFilterFactory"} ;; 'wi fi'='WiFi'= 'wi-fi'

{:class "solr.LowerCaseFilterFactory"} ;; Lowercases content

{:class "solr.KeywordMarkerFilterFactory"} ;; Protected from stemming

{:class "solr.PorterStemFilterFactory"} ;; "jump"="jumping"="jumped"

{:class "solr.FlattenGraphFilterFactory"}] ;; needed for synonyms

35

36 of 67

Schemas: field types

:queryAnalyzer

[{:class "solr.SynonymGraphFilterFactory"} ;; Adds synonyms at query time

{:class "solr.StopFilterFactory" ...} ;; Removes stop words

{:class "solr.WordDelimiterGraphFilterFactory"} ;; 'wi fi'='WiFi'= 'wi-fi'

{:class "solr.LowerCaseFilterFactory"} ;; Lowercases content

{:class "solr.KeywordMarkerFilterFactory"} ;; Protected from stemming

{:class "solr.PorterStemFilterFactory"}] ;; "jump"="jumping"="jumped"

36

37 of 67

Index

(solr.index/clear! client-config {:commit true})

(solr.index/add! client-config data/movies {:commit true})

37

38 of 67

Normal Search Query

;; Get me bond movies

(solr.query/query

client-config

{:q "bond"

:fl ["db_id" "title" "score"] ; Results: select-keys

:rows 10}) ; Results: Nbr of docs

38

39 of 67

{:responseHeader

{:status 0,

:QTime 1,

:params {:q "bond",

:fl "db_id,title,score",

:rows "10"}},

:response

{:numFound 101,

:start 0,

:maxScore 2.9382193,

:docs [...]

(map (juxt :title :db_id :score) docs)

[["36670" "Never Say Never Again" 2.9382193]

["36669" "Die Another Day" 2.8057525]

["36643" "The World Is Not Enough" 2.732932]

["668" "On Her Majesty's Secret Service" 2.5980718]

["700" "Octopussy" 2.5700123]

["37707" "Splice" 2.5592384]

["253" "Live and Let Die" 2.469219]

["709" "Licence to Kill" 2.4265428]

["710" "GoldenEye" 2.4265428]

["646" "Dr. No" 2.4265428]]

39

40 of 67

More search queries...

;; Get me the Daniel Craig ones

{:q "bond cast:Daniel Craig"}

;; from last 20 years “filter query”

{:fq "release_date:[NOW-20YEARS TO NOW]"}

;; prefer recent (boosting fn)

{:deftype “edismax”

:bf ["recip(ms(NOW/HOUR,release_date),3.16e-11,1,1)^5"]}

40

41 of 67

1st: More Like This

{:db_id "206647"

:title "Spectre"

:release_date #inst"2015-10-26..."

...}

41

42 of 67

1st: MLT: the query

(solr.query/query-mlt client-config

{:q "db_id:206647" ;“this” matched by id

:mlt.fl ["overview" "genres" "title" "keywords" ;interesting-terms from

"production_companies" "production_countries"

"spoken_languages" "director"]

:mlt.mintf "1" ;min Term Frequency below which terms are ignored

:mlt.mindf "3" ;min Document Frequency below which terms are ignored

:mlt.boost "true" ;interesting terms tf-idf score as 1st boost

:mlt.qf [["genres" 10] ["overview" 6] ["title" 3]] ;2nd boost we control

:fl ["db_id" "title" "release_date" "score"]})

42

43 of 67

[["36670" "Never Say Never Again" 2.9382193]

["36669" "Die Another Day" 2.8057525]

["36643" "The World Is Not Enough" 2.732932]

["668" "On Her Majesty's Secret Service" 2.5980718]

["700" "Octopussy" 2.5700123]

["37707" "Splice" 2.5592384]

["253" "Live and Let Die" 2.469219]

["709" "Licence to Kill" 2.4265428]

["710" "GoldenEye" 2.4265428]

["646" "Dr. No" 2.4265428]]

{:db_id "206647"

:title "Spectre"

:release_date "2015..."}

43

“Bond” is not in any titles!

44 of 67

:mlt.interestingTerms

[["keywords" "sequel" 1.0]

["overview" "uncov" 1.0249952]

["overview" "truth" 1.0323821]

["overview" "reveal" 1.048014]

["overview" "send" 1.06495]

["overview" "polit" 1.080229]

["overview" "organ" 1.0899804]

["overview" "aliv" 1.1073512]

["keywords" "servic" 1.1344733]

["keywords" "spy" 1.1344733]

["overview" "servic" 1.1610042]

["keywords" "british" 1.1756936]

["overview" "trail" 1.2087036]

["overview" "sinist" 1.227466]

["overview" "terribl" 1.2341261]

["keywords" "unit" 1.2794498]

["overview" "messag" 1.2794498]

["keywords" "kingdom" 1.306572]

["overview" "deceit" 1.3749144]

["production_companies" "danjaq" 1.4786706]

["director" "sam mendes" 1.5025941]

["overview" "cryptic" 1.5980587]

["overview" "spectr" 1.6433824]

["keywords" "secret" 1.7472144]

["overview" "bond" 3.212773]]

44

Bond!

45 of 67

1st: More Like This

Conclusions

;; Results are really good!

;; Could recognize “bond” movies from bond word in overview field

;; THOUGH:

;; Some movies are old

;; I am not looking for secret agent movie as such

;; Was looking for good action movie, implying spaceships �;; races or car races

45

MLT: cannot use well timestamp

MLT: Only from interesting terms

46 of 67

Content based filtering with more awareness

2.

47 of 67

2nd pass: MLT “aware”...

;; Here was my past search before I watched Spectre

(def past-search-queries

["star wars" "car" "fast car race" "bond"])

(def user-last-watched-movies

[{:title "Furious 7", :db_id "168259"}

{:title "Hidden Away", :db_id "258755"}

{:title "Need for Speed", :db_id "136797"}])

47

48 of 67

2nd: MLT with user prefs and num. fields

(let [watched-ids-str (string/join " "

(map :db_id user-last-watched-movies))

bond-release-date (inst-ms (:release_date bond-spectre-movie))

this-id (:db_id bond-spectre-movie)

mlt-q (str "{!mlt mintf=1 mindf=3 boost=true qf=overview}" this-id)

watched-q (str "{!mlt mintf=1 mindf=3 qf=overview}" watched-ids-str)

search-logs-q (string/join " " past-search-queries)

recent-boost-prefix "{!boost b=recip(sub(${now},ms(release_date)),3.16e-11,1,1)}"

neg-q (format "-db_id:(%s %s)" this-id watched-ids-str)]

48

49 of 67

2nd: MLT with user prefs and num. fields

...�(solr.query/query

client-config

{:defType "lucene"

:q (format "%s (%s) (%s) (%s)" recent-boost-prefix mlt-q watched-q search-logs-q)

:mm 10

:fq neg-q

:fl ["db_id" "title" "release_date" "score"] ; Results: Fields

:rows 15

:now bond-release-date})))

49

50 of 67

Recommendations with collaborative filtering

3.

51 of 67

Goal

  • Re-rank the very top recommendations
  • Based on predicted rating a user would give a movie he/she is seeing
  • Guessed by neural network model.

51

52 of 67

Movie Lens Users and Ratings

Ratings (1M)

[{:userId 44,

:movieId 141,

:rating 3.0,

:timestamp "869251861"}

...]

Users (5800)

[{:userId 44,

:gender 1.0,

:age 56.0,

:occupation 16.0}

...]

52

Get TMDB ID via links.csv

USER SIMILARITY

USER PREFERENCE

53 of 67

(defn build-features [store-name movies]

(mapcat identity

[(ltr/gen-coll-features movies :genres store-name)

(ltr/gen-coll-features movies :production_countries store-name)

(ltr/gen-coll-features movies :spoken_languages store-name)

[(ltr/gen-field-feature :popularity store-name)

(ltr/gen-field-feature :vote_average store-name)

(ltr/gen-field-feature :runtime store-name)

(ltr/gen-field-feature :revenue store-name)

(ltr/gen-field-feature :budget store-name)

(ltr/gen-external-value-feature :gender false store-name)

(ltr/gen-external-value-feature :occupation false store-name)

(ltr/gen-external-value-feature :age false store-name)]]))

53

Converted to boolean features

User profile features

54 of 67

(def movie-features (ml/build-features "tmdb_features" data/movies))

�({:store "tmdb_features",

:name "hasGenresAction",

:class "org.apache.solr.ltr.feature.SolrFeature",

:params {:q "{!func}termfreq(genres,'action')"}}...

{:store "tmdb_features",

:name "popularity",

:class "org.apache.solr.ltr.feature.FieldValueFeature",

:params {:field "popularity"}}

{:store "tmdb_features",

:name "age",

:class "org.apache.solr.ltr.feature.ValueFeature"...

54

Solr feature store name

Boolean feature [0, 1]

Query for Solr to call to get 0 or 1 val

55 of 67

(solr.ltr/upload-features! client-config movie-features)

(ml/extract-mov-features {:type :http :core :tmdb} "tmdb_features")

(def ds (ml/build-training-dataset))

(ml/normalize-training-dataset ds)

(def shuffled-ds (shuffle normed-ds))

(def splitted-ds (ml/split-dataset shuffled-ds))

(defn upload-train-and-test-datasets! [splitted-dataset]

(spit (str data-dir "train.txt") (vec (:train splitted-dataset)))

(spit (str data-dir "test.txt") (vec (:test splitted-dataset)))))

(upload-train-and-test-datasets! splitted-ds)

(cortex/train! (:train splitted-ds) (:test splitted-ds) 1000 "ltr-goa-nn")

55

56 of 67

[{:movieId 1945, :movieIdTMDB 654, :userId 2, :score [5.0],

:features [0.0

0.0

0.0

0.0

1.0

0.0

...

16.015598

...

1.0

56.0

16.0]} ...]

56

57 of 67

[{:movieId 1945, :movieIdTMDB 654, :userId 2, :score [5.0],

:features [0.0

0.0

0.0

0.0

1.0

0.0

0.0

16.015598

...

1.0

56.0

16.0]} ...]

57

300 movie features

3 user features

58 of 67

[{:movieId 1945, :movieIdTMDB 654, :userId 2, :score [5.0],

:features [0.0

0.0

0.0

0.0

1.0

0.0

0.0

16.015598

...

1.0

56.0

16.0]} ...]

58

300+3

512

128

16

Rating

2.8

Expect.

5

300 movie features

3 user features

Neural Network with 4 Classifiers layers

59 of 67

Online query:

(solr.query/query

client-config

{...� :rrq "{!ltr model=ltrGoaModel

reRankDocs=100

efi.gender=0

efi.age=60

efi.occupation=7}"})

59

Top X to re-rank

External Feature Information

60 of 67

Conclusion

...

61 of 67

Experiment conclusion

  • In Clojure: Mostly via Corona wrapper, still SOLR query DSL
  • Speed: results are fast, thanks to Solr
  • Relevancy: ok, more than what we could do by hand, not perfect.
  • Focus: query parser DSLs have a learning curve

61

62 of 67

Perspective

  • Need metrics, statistics, a lot of (ideally automated) experiments with features, architecture, parameters of training.
  • We just didn't focused on this task.
  • And it is completely ok.

62

63 of 67

Credits

63

64 of 67

Thanks!

Any questions?

You can find me at:

  • https://github.com/leontalbot
  • https://github.com/Stylitics/corona
  • https://github.com/Stylitics/corona-demo

64

65 of 67

(ml/normalize-training-dataset ds)

65

66 of 67

66

SlidesCarnival icons are editable shapes.

This means that you can:

  • Resize them without losing quality.
  • Change fill color and opacity.
  • Change line color, width and style.

Isn’t that nice? :)

Examples:

67 of 67

67

Now you can use any emoji as an icon!

And of course it resizes without losing quality and you can change the color.

How? Follow Google instructions https://twitter.com/googledocs/status/730087240156643328

✋👆👉👍👤👦👧👨👩👪💃🏃💑❤😂😉😋😒😭👶😸🐟🍒🍔💣📌📖🔨🎃🎈🎨🏈🏰🌏🔌🔑 and many more...

😉