1 of 47

Democratizing data at Kiwi.com

2 of 47

Challenge

Kiwi.com - a lot of data
Route combinations (Moscow -> Prague -> Barcelona)
10^12 possible combinations

3 of 47

Challenge

~30-50 Analytics team
10 000 dashboards, docs, questions, charts…
2000-3000 Kiwi.com

4 of 47

5 of 47

6 of 47

Solution

Slack chatbot which will provide all the necessary information by human-like interaction.

7 of 47

Instead of:

8 of 47

We have:

9 of 47

Instead of:

10 of 47

We have:

11 of 47

Main technology stack

Dialogflow (natural language conversations platform)
Elasticsearch (text search database)
Neo4j (graph database)

12 of 47

13 of 47

Workflow

14 of 47

Dialogflow

15 of 47

Dialogflow - classifier

Popular questions (how many bookings?)
Small talk (how are you?)
Unclassified questions -> We look in ES

16 of 47

Dialogflow - intents

Popular questions intents (How many bookings we had today?)

Action: give the link directly
Cons: Put them manually

17 of 47

Dialogflow - small talk

18 of 47

Dialogflow - intents

Unclassified questions:

Search in the database (Elasticsearch)
Main topic of this presentation

19 of 47

Dialogflow - problems

Problem: difficult to create smalltalk intents manually

> 50 intents
5-10 training phrases for each one

20 of 47

Dialogflow - Excel smalltalk

21 of 47

Dialogflow - problems - training

22 of 47

Dialogflow - other problems

API limits
Docs

23 of 47

Now we can understand our user (more or less)
What’s next?

24 of 47

Databases

Elasticsearch to store the text data.
Neo4j to store relations between documents and users.
Elasticsearch: one of the best databases to make full-text queries
Neo4j: graph database, good for fast prototyping

25 of 47

Why do we even need graphs?

Store connections
Improve results from elasticsearch
Get insights from graphs
Dataflow inside the company

26 of 47

Our case

Statistics:

Number of views of a document
Distinct people viewed a document
PageRank score for each document (popularity score)

27 of 47

Document model in ES

1. class DocumentElastic(DocType):�2. uuid = Keyword()�3. title = Text(fields=default_fields)�4. ...�5. description = Text(fields=default_fields)�6. updated_at = Date()�7. ...�8. parameters = Nested(Parameter)�9. ...�10. graph_statistics = Nested(ResultType)�11. �12. class Index:�13. name = 'documents'�14. �15. def is_up_to_date(self, last_updated: datetime):�16. return self.updated_at >= last_updated��

28 of 47

User model in Neo4j

1. class UserNeo(StructuredNode):�2. uuid = StringProperty()�3. email = StringProperty(unique_index=True)�4. time_created = DateTimeProperty()�5. �6. created = RelationshipTo('DocumentNeo', 'CREATED', model=CreatedRelation)�7. consumed = RelationshipTo('DocumentNeo', 'CONSUMED', model=ConsumedRelation)�8. modified = RelationshipTo('DocumentNeo', 'MODIFIED', model=ModifiedRelation)��

29 of 47

Document model in Neo4j

1. class DocumentNeo(StructuredNode):�2. uuid = StringProperty()�3. source = StringProperty(required=True, index=True)�4. source_id = StringProperty(required=True, index=True)�5. views = IntegerProperty(default=0)�6. people_viewed = IntegerProperty(default=0)�7. page_rank = FloatProperty(default=0)�8. �9. created_by = RelationshipTo('UserNeo', 'CREATED_BY', model=CreatedRelation)�10. consumed_by = RelationshipTo('UserNeo', 'CONSUMED_BY', model=ConsumedRelation)�11. modified_by = RelationshipTo('UserNeo', 'MODIFIED_BY', model=ModifiedRelation)��

30 of 47

ES + Neo4j - how to use both dbs?

We were using plugins
Plugins are working with ES v2.x

31 of 47

ES + Neo4j - interface to unite them

1. class Document:�2. """Unites ElasticSearch and Neo4j, representing an entity in both databases.�3. Entities are available by `uuid` or tuple `source, source_id`�4. """�5. �6. def __init__(self):�7. self._elastic_doc: DocumentElastic�8. self._neo4j_doc: DocumentNeo�9. �10. def __getattr__(self, name):�11. if name not in ('_elastic_doc', '_neo4j_doc'):�12. try:�13. return getattr(self._elastic_doc, name)�14. except AttributeError:�15. pass�16. return getattr(self._neo4j_doc, name)�17. return None��

32 of 47

ES + Neo4j - some methods

1. @staticmethod�2. def get_by_source_id(source, source_id):�3. doc = Document()�4. doc._elastic_doc = ElasticQuery.get_doc_by_source_id(source, source_id)�5. doc._neo4j_doc = NeoQuery.get_doc_by_source_id(source, source_id)�6. return doc�7. �8. @staticmethod�9. def get_by_uuid(uuid):�10. doc = Document()�11. doc._elastic_doc = ElasticQuery.get_doc_by_uuid(uuid)�12. doc._neo4j_doc = NeoQuery.get_doc_by_uuid(uuid)�13. return doc�14. �15. def is_up_to_date(self, last_updated: datetime):�16. return self._elastic_doc.is_up_to_date(last_updated)��

33 of 47

So far we:
Discovered Dialogflow
And how to use Elasticsear + Neo4j together

34 of 47

Elasticsearch-dsl - query examples

Filtering by field and limiting the results:

DocumentElastic\� .search(index='documents', using=elastic.client)\� .query('bool', filter=[Q('term', source=source)])\� .fields(['source_id'])[:limit]\

.execute()�

Filtering by field and limiting the results:

DocumentElastic.get(id=uuid, using=elastic.client, index='documents')�

35 of 47

Elasticsearch - word order

Query: “bookings last year”
1) “Average amount of bookings for last year”
2) “Last bookings of the previous year”

36 of 47

Elasticsearch - word order

2 separate analyzed fields:
“last”, “year”
“last year”, “number of bookings”

37 of 47

Elasticsearch - analyzers

1. root = analyzer(�2. 'root',�3. type='custom',�4. tokenizer='standard',�5. char_filter=['html_strip'],�6. filter=[english_possessive_stemmer, synonyms_case_sensitive, 'lowercase',�7. synonyms_lowercase, english_stop, english_stemmer])�8. �9. shingles = analyzer(�10. 'shingles',�11. type='custom',�12. tokenizer='standard',�13. char_filter=['html_strip'],�14. filter=[english_possessive_stemmer, synonyms_case_sensitive, 'lowercase',�15. synonyms_lowercase, english_stop, english_stemmer, shingle_filter])�16. �17. default_fields = {�18. 'default': Text(analyzer=root),�19. 'shingles': Text(analyzer=shingles)�20. }��

38 of 47

Neo4j

Uses SQL-inspired language for queries: Cypher

39 of 47

Neo4j - Graph statistics

Count the views:

db.cypher_query('''� MATCH (doc:DocumentNeo) - [rel:CONSUMED_BY] - (user:UserNeo) # filtering nodes� WITH doc, sum(rel.times_viewed) AS views, # aggregating� SET doc.views = views # updating� ''')�

40 of 47

PageRank

Is a mathematical formula that judges the “value of a page”
Still used in Google search engine
Simple and cool

41 of 47

Neo4j - trick to project bipartite graph

How do we calculate PageRank, if we have bipartite graph?

42 of 47

43 of 47

Elasticsearch - Function score

TF_IDF_SCORE * ln(page_rank) * log10(number_of_views) * Gauss_filter

ln(page_rank)

0 <page_rank < 10
1 < multiplier < 3

log10(number_of_views)

0 < number_of_views < 10000
1 < multiplier < 3

Gauss_filter

Penalize docs which were updated > 1 year ago

44 of 47

Elasticsearch-dsl - Function score

1. query = FunctionScore(�2. query=query,�3. functions=[�4. dict( # Gauss multiplier�5. gauss={�6. 'updated_at': {�7. 'origin': datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S'),�8. 'offset': '365d',�9. 'scale': '700d'�10. }�11. }�12. ),�13. dict( # Multipliers from graph features�14. script_score=dict(script=dict(�15. source=score_script,�16. params=dict(�17. pg_offset=1,�18. pg_multiplier=1,�19. vw_offset=1,�20. vw_multiplier=0.2�21. ),�22. )))]) ��

45 of 47

Are the results good?

46 of 47

Future plans

Gather feedback and statistics
Change Neo4j
Own NLP model instead of Dialogflow

47 of 47

Thank you!

Questions?