You may also like
Solr Clojure ML & Movies
Hello!
I am Leon Talbot
Senior Software Engineer at Stylitics
@leontalbot
2
Stylitics, Inc.
We are hiring!
3
Movie Recommendation Engine
4
Input
Output
You may also like...
Mov 2
Mov 3
Mov 4
Mov 1
Mov 6
Mov 7
Mov 8
Mov 5
Mov 9
Mov 10
Goal
Starter kit
5
Agenda
6
POC
7
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
?
POC
8
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
2. User
prefs
?
POC
9
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
2. User
profile
3. User
profileS and
movie ratings
?
POC
10
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
2. User
pref
3. User
profileS and
movie ratings
?
Lot of content
to look at
Robust �Content base filtering
Goals
11
vs
12
French Police
Movie 1
French
Word
Police
Word
Word
Word
Word
Word
1
5
8
2
5
6
7
Reverse index
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
Movie 8
DB Index
Ranking fast
TF * iDF
Token Freq. Inverse of Doc Freq.
13
french
police
1
x 2
5
x 2
8
x 1
2
x 2
5
x 2
6
x 1
7
x 1
Movie 1
Movie 2
Movie 5
Movie 6
Movie 7
Movie 8
word
word
word
Indexing + Matching + Ranking
No indexing
Matching
Indexing...
Offline
Live
Matching...
1k movies
Live
14
Search engine?
15
POC
16
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
2. User
prefs
3. User
profileS and
movie ratings
?
Pref. are more
movie content
=
Still content base filtering
POC
17
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
2. User
prefs
3. User
profileS and
movie ratings
?
�Collaborative filtering
Ranking
Re-ranking
ML model
Top 10
1 million movies
Ranking...
�
Matching...
Top 10
Top 100 movies
Re-ranking...
�
18
2 months ago,
kept stumbling upon...
19
Apache Solr
20
POC
21
1. Movie
metadata
2
3
4
1
6
7
8
5
9
10
1 million movies
DB
Results
2. User
profile
3. User
profileS and
movie ratings
?
Goals
22
In clojure?
23
github.com
/Stylitics/corona
Solr 8 in Clojure!
( )
)
24
Companies
25
Who is using Solr?
Companies
26
Who is using Corona?
Experiment with us!
github.com
/Stylitics/corona-demo
27
Basic
Content based
recommendations
.
1.
Movies dataset (TMDB 5000)
{:db_id "42151",
:runtime "89",
:budget "31192",
:vote_count "26",
:vote_average "6.3",
:popularity "1.330379",
:revenue "10000",
:release_date "2009-09-01",
:status "Released",
29
:genres ["Drama" "Action" "Comedy"],
:title "Down Terrace",
:tagline "You're only as …”,
:overview "After serving jail time...",
:original_language "en",
:keywords ["murder" "dark comedy" ...],
:director "Ben Wheatley",
:spoken_languages ["en"],
:production_countries ["GB"]}
Server config
| |____solr
| | |____zoo.cfg
| | |____solr.xml
| |____logs
| | |____solr.log
| |____start.jar
____solr-8.0.0
|____bin
|____contrib
|____docs
|____dist
|____example
|____server
|____solr.xml
30
debug
Copy to $SOLR_HOME
App config
____resources
|____solr
| |____tmdb
| | |____core.properties
| | |____data
| | |____conf
| | | |_____schema_model-store.json
| | | |____managed-schema
| | | |____elevate.xml
| | | |____mapping-FoldToASCII.txt
| | | |_____schema_feature-store.json
| | | |____protwords.txt
| | | |____currency.xml
| | | |____synonyms.txt
| | | |____mapping-ISOLatin1Accent.txt
| | | |____spellings.txt
| | | |____solrconfig.xml
| | | |____stopwords.txt
31
INDEX
Start
$ cd corona-demo
$ SOLR_HOME/bin/start
$ lein repl
(def core-dir (str (System/getProperty "user.dir") "/resources/solr/tmdb")
(def client-config {:core :tmdb})
;; clear core config and set it back based on solrconfig.xml
(solr.core/delete! client-config {:deleteIndex true})
(solr.core/create! client-config {:instanceDir core-dir}))
32
Schemas
(first data/movies) ; inspect movies
;; add schema fields missing from managed-schema conf file.
(solr.schema/get-field-types client-config)
(solr.schema/add-field-type! client-config {...})
(solr.schema/get-fields client-config)
(solr.schema/add-field! client-config {}) ; {} or [{}]
(doseq [n (map :name data/content-fields)]
(solr.schema/update-field!
client-config
{:add-copy-field {:source n :dest "_text_"}})) ;default search field
33
Schemas: field types
{:name "text_en_splitting"
:class "solr.TextField"
...
:indexAnalyzer {:tokenizer {:class "solr.WhitespaceTokenizerFactory"}
:filters []}
:queryAnalyzer {:type "query"
:tokenizer {:class "solr.WhitespaceTokenizerFactory"}
:filters []}
34
Schemas: field types
:indexAnalyzer
[{:class "solr.StopFilterFactory" ...} ;; Removes stop words
{:class "solr.WordDelimiterGraphFilterFactory"} ;; 'wi fi'='WiFi'= 'wi-fi'
{:class "solr.LowerCaseFilterFactory"} ;; Lowercases content
{:class "solr.KeywordMarkerFilterFactory"} ;; Protected from stemming
{:class "solr.PorterStemFilterFactory"} ;; "jump"="jumping"="jumped"
{:class "solr.FlattenGraphFilterFactory"}] ;; needed for synonyms
35
Schemas: field types
:queryAnalyzer
[{:class "solr.SynonymGraphFilterFactory"} ;; Adds synonyms at query time
{:class "solr.StopFilterFactory" ...} ;; Removes stop words
{:class "solr.WordDelimiterGraphFilterFactory"} ;; 'wi fi'='WiFi'= 'wi-fi'
{:class "solr.LowerCaseFilterFactory"} ;; Lowercases content
{:class "solr.KeywordMarkerFilterFactory"} ;; Protected from stemming
{:class "solr.PorterStemFilterFactory"}] ;; "jump"="jumping"="jumped"
36
Index
(solr.index/clear! client-config {:commit true})
(solr.index/add! client-config data/movies {:commit true})
37
Normal Search Query
;; Get me bond movies
(solr.query/query
client-config
{:q "bond"
:fl ["db_id" "title" "score"] ; Results: select-keys
:rows 10}) ; Results: Nbr of docs
38
{:responseHeader
{:status 0,
:QTime 1,
:params {:q "bond",
:fl "db_id,title,score",
:rows "10"}},
:response
{:numFound 101,
:start 0,
:maxScore 2.9382193,
:docs [...]
(map (juxt :title :db_id :score) docs)
[["36670" "Never Say Never Again" 2.9382193]
["36669" "Die Another Day" 2.8057525]
["36643" "The World Is Not Enough" 2.732932]
["668" "On Her Majesty's Secret Service" 2.5980718]
["700" "Octopussy" 2.5700123]
["37707" "Splice" 2.5592384]
["253" "Live and Let Die" 2.469219]
["709" "Licence to Kill" 2.4265428]
["710" "GoldenEye" 2.4265428]
["646" "Dr. No" 2.4265428]]
39
More search queries...
;; Get me the Daniel Craig ones
{:q "bond cast:Daniel Craig"}
;; from last 20 years “filter query”
{:fq "release_date:[NOW-20YEARS TO NOW]"}
;; prefer recent (boosting fn)
{:deftype “edismax”
:bf ["recip(ms(NOW/HOUR,release_date),3.16e-11,1,1)^5"]}
40
1st: More Like This
{:db_id "206647"
:title "Spectre"
:release_date #inst"2015-10-26..."
...}
41
1st: MLT: the query
(solr.query/query-mlt client-config
{:q "db_id:206647" ;“this” matched by id
:mlt.fl ["overview" "genres" "title" "keywords" ;interesting-terms from
"production_companies" "production_countries"
"spoken_languages" "director"]
:mlt.mintf "1" ;min Term Frequency below which terms are ignored
:mlt.mindf "3" ;min Document Frequency below which terms are ignored
:mlt.boost "true" ;interesting terms tf-idf score as 1st boost
:mlt.qf [["genres" 10] ["overview" 6] ["title" 3]] ;2nd boost we control
:fl ["db_id" "title" "release_date" "score"]})
42
[["36670" "Never Say Never Again" 2.9382193]
["36669" "Die Another Day" 2.8057525]
["36643" "The World Is Not Enough" 2.732932]
["668" "On Her Majesty's Secret Service" 2.5980718]
["700" "Octopussy" 2.5700123]
["37707" "Splice" 2.5592384]
["253" "Live and Let Die" 2.469219]
["709" "Licence to Kill" 2.4265428]
["710" "GoldenEye" 2.4265428]
["646" "Dr. No" 2.4265428]]
{:db_id "206647"
:title "Spectre"
:release_date "2015..."}
43
“Bond” is not in any titles!
:mlt.interestingTerms
[["keywords" "sequel" 1.0]
["overview" "uncov" 1.0249952]
["overview" "truth" 1.0323821]
["overview" "reveal" 1.048014]
["overview" "send" 1.06495]
["overview" "polit" 1.080229]
["overview" "organ" 1.0899804]
["overview" "aliv" 1.1073512]
["keywords" "servic" 1.1344733]
["keywords" "spy" 1.1344733]
["overview" "servic" 1.1610042]
["keywords" "british" 1.1756936]
["overview" "trail" 1.2087036]
["overview" "sinist" 1.227466]
["overview" "terribl" 1.2341261]
["keywords" "unit" 1.2794498]
["overview" "messag" 1.2794498]
["keywords" "kingdom" 1.306572]
["overview" "deceit" 1.3749144]
["production_companies" "danjaq" 1.4786706]
["director" "sam mendes" 1.5025941]
["overview" "cryptic" 1.5980587]
["overview" "spectr" 1.6433824]
["keywords" "secret" 1.7472144]
["overview" "bond" 3.212773]]
44
Bond!
1st: More Like This
Conclusions
;; Results are really good!
;; Could recognize “bond” movies from bond word in overview field
;; THOUGH:
;; Some movies are old
;; I am not looking for secret agent movie as such
;; Was looking for good action movie, implying spaceships �;; races or car races
45
MLT: cannot use well timestamp
MLT: Only from interesting terms
Content based filtering with more awareness
2.
2nd pass: MLT “aware”...
;; Here was my past search before I watched Spectre
(def past-search-queries
["star wars" "car" "fast car race" "bond"])
(def user-last-watched-movies
[{:title "Furious 7", :db_id "168259"}
{:title "Hidden Away", :db_id "258755"}
{:title "Need for Speed", :db_id "136797"}])
47
2nd: MLT with user prefs and num. fields
(let [watched-ids-str (string/join " "
(map :db_id user-last-watched-movies))
bond-release-date (inst-ms (:release_date bond-spectre-movie))
this-id (:db_id bond-spectre-movie)
mlt-q (str "{!mlt mintf=1 mindf=3 boost=true qf=overview}" this-id)
watched-q (str "{!mlt mintf=1 mindf=3 qf=overview}" watched-ids-str)
search-logs-q (string/join " " past-search-queries)
recent-boost-prefix "{!boost b=recip(sub(${now},ms(release_date)),3.16e-11,1,1)}"
neg-q (format "-db_id:(%s %s)" this-id watched-ids-str)]
48
2nd: MLT with user prefs and num. fields
...�(solr.query/query
client-config
{:defType "lucene"
:q (format "%s (%s) (%s) (%s)" recent-boost-prefix mlt-q watched-q search-logs-q)
:mm 10
:fq neg-q
:fl ["db_id" "title" "release_date" "score"] ; Results: Fields
:rows 15
:now bond-release-date})))
49
Recommendations with collaborative filtering
3.
Goal
51
Movie Lens Users and Ratings
Ratings (1M)
[{:userId 44,
:movieId 141,
:rating 3.0,
:timestamp "869251861"}
...]
Users (5800)
[{:userId 44,
:gender 1.0,
:age 56.0,
:occupation 16.0}
...]
52
Get TMDB ID via links.csv
USER SIMILARITY
USER PREFERENCE
(defn build-features [store-name movies]
(mapcat identity
[(ltr/gen-coll-features movies :genres store-name)
(ltr/gen-coll-features movies :production_countries store-name)
(ltr/gen-coll-features movies :spoken_languages store-name)
[(ltr/gen-field-feature :popularity store-name)
(ltr/gen-field-feature :vote_average store-name)
(ltr/gen-field-feature :runtime store-name)
(ltr/gen-field-feature :revenue store-name)
(ltr/gen-field-feature :budget store-name)
(ltr/gen-external-value-feature :gender false store-name)
(ltr/gen-external-value-feature :occupation false store-name)
(ltr/gen-external-value-feature :age false store-name)]]))
53
Converted to boolean features
User profile features
(def movie-features (ml/build-features "tmdb_features" data/movies))
�({:store "tmdb_features",
:name "hasGenresAction",
:class "org.apache.solr.ltr.feature.SolrFeature",
:params {:q "{!func}termfreq(genres,'action')"}}...
{:store "tmdb_features",
:name "popularity",
:class "org.apache.solr.ltr.feature.FieldValueFeature",
:params {:field "popularity"}}
{:store "tmdb_features",
:name "age",
:class "org.apache.solr.ltr.feature.ValueFeature"...
54
Solr feature store name
Boolean feature [0, 1]
Query for Solr to call to get 0 or 1 val
(solr.ltr/upload-features! client-config movie-features)
(ml/extract-mov-features {:type :http :core :tmdb} "tmdb_features")
(def ds (ml/build-training-dataset))
(ml/normalize-training-dataset ds)
(def shuffled-ds (shuffle normed-ds))
(def splitted-ds (ml/split-dataset shuffled-ds))
(defn upload-train-and-test-datasets! [splitted-dataset]
(spit (str data-dir "train.txt") (vec (:train splitted-dataset)))
(spit (str data-dir "test.txt") (vec (:test splitted-dataset)))))
(upload-train-and-test-datasets! splitted-ds)
(cortex/train! (:train splitted-ds) (:test splitted-ds) 1000 "ltr-goa-nn")
55
[{:movieId 1945, :movieIdTMDB 654, :userId 2, :score [5.0],
:features [0.0
0.0
0.0
0.0
1.0
0.0
...
16.015598
...
1.0
56.0
16.0]} ...]
56
[{:movieId 1945, :movieIdTMDB 654, :userId 2, :score [5.0],
:features [0.0
0.0
0.0
0.0
1.0
0.0
0.0
16.015598
...
1.0
56.0
16.0]} ...]
57
300 movie features
3 user features
[{:movieId 1945, :movieIdTMDB 654, :userId 2, :score [5.0],
:features [0.0
0.0
0.0
0.0
1.0
0.0
0.0
16.015598
...
1.0
56.0
16.0]} ...]
58
300+3
512
128
16
Rating
2.8
Expect.
5
300 movie features
3 user features
Neural Network with 4 Classifiers layers
Online query:
(solr.query/query
client-config
{...� :rrq "{!ltr model=ltrGoaModel
reRankDocs=100
efi.gender=0
efi.age=60
efi.occupation=7}"})
59
Top X to re-rank
External Feature Information
Conclusion
...
Experiment conclusion
61
Perspective
62
Credits
63
Thanks!
Any questions?
You can find me at:
64
(ml/normalize-training-dataset ds)
65
66
SlidesCarnival icons are editable shapes.
This means that you can:
Isn’t that nice? :)
Examples:
67
Now you can use any emoji as an icon!
And of course it resizes without losing quality and you can change the color.
How? Follow Google instructions https://twitter.com/googledocs/status/730087240156643328
✋👆👉👍👤👦👧👨👩👪💃🏃💑❤😂😉😋😒😭👶😸🐟🍒🍔💣📌📖🔨🎃🎈🎨🏈🏰🌏🔌🔑 and many more...
😉