1 of 47

Democratizing data at Kiwi.com

2 of 47

Challenge

  • Kiwi.com - a lot of data
  • Route combinations (Moscow -> Prague -> Barcelona)
  • 10^12 possible combinations

3 of 47

Challenge

  • ~30-50 Analytics team
  • 10 000 dashboards, docs, questions, charts…
  • 2000-3000 Kiwi.com

4 of 47

5 of 47

6 of 47

Solution

Slack chatbot which will provide all the necessary information by human-like interaction.

7 of 47

  • Instead of:

8 of 47

  • We have:

9 of 47

  • Instead of:

10 of 47

  • We have:

11 of 47

Main technology stack

  • Dialogflow (natural language conversations platform)
  • Elasticsearch (text search database)
  • Neo4j (graph database)

12 of 47

13 of 47

Workflow

14 of 47

Dialogflow

15 of 47

Dialogflow - classifier

  1. Popular questions (how many bookings?)
  2. Small talk (how are you?)
  3. Unclassified questions -> We look in ES

16 of 47

Dialogflow - intents

  • Popular questions intents (How many bookings we had today?)
    • Action: give the link directly
    • Cons: Put them manually

17 of 47

Dialogflow - small talk

18 of 47

Dialogflow - intents

  • Unclassified questions:
    • Search in the database (Elasticsearch)
    • Main topic of this presentation

19 of 47

Dialogflow - problems

  • Problem: difficult to create smalltalk intents manually
    • > 50 intents
    • 5-10 training phrases for each one

20 of 47

Dialogflow - Excel smalltalk

21 of 47

Dialogflow - problems - training

22 of 47

Dialogflow - other problems

  • API limits
  • Docs

23 of 47

  • Now we can understand our user (more or less)
  • What’s next?

24 of 47

Databases

  • Elasticsearch to store the text data.
  • Neo4j to store relations between documents and users.
  • Elasticsearch: one of the best databases to make full-text queries
  • Neo4j: graph database, good for fast prototyping

25 of 47

Why do we even need graphs?

  1. Store connections
  2. Improve results from elasticsearch
  3. Get insights from graphs
  4. Dataflow inside the company

26 of 47

Our case

  • Statistics:
    • Number of views of a document
    • Distinct people viewed a document
    • PageRank score for each document (popularity score)

27 of 47

Document model in ES

1. class DocumentElastic(DocType):�2. uuid = Keyword()�3. title = Text(fields=default_fields)�4. ...�5. description = Text(fields=default_fields)�6. updated_at = Date()�7. ...�8. parameters = Nested(Parameter)�9. ...�10. graph_statistics = Nested(ResultType)�11. �12. class Index:�13. name = 'documents'�14. �15. def is_up_to_date(self, last_updated: datetime):�16. return self.updated_at >= last_updated���

28 of 47

User model in Neo4j

1. class UserNeo(StructuredNode):�2. uuid = StringProperty()�3. email = StringProperty(unique_index=True)�4. time_created = DateTimeProperty()�5. �6. created = RelationshipTo('DocumentNeo', 'CREATED', model=CreatedRelation)�7. consumed = RelationshipTo('DocumentNeo', 'CONSUMED', model=ConsumedRelation)�8. modified = RelationshipTo('DocumentNeo', 'MODIFIED', model=ModifiedRelation)��

29 of 47

Document model in Neo4j

1. class DocumentNeo(StructuredNode):�2. uuid = StringProperty()�3. source = StringProperty(required=True, index=True)�4. source_id = StringProperty(required=True, index=True)�5. views = IntegerProperty(default=0)�6. people_viewed = IntegerProperty(default=0)�7. page_rank = FloatProperty(default=0)�8. �9. created_by = RelationshipTo('UserNeo', 'CREATED_BY', model=CreatedRelation)�10. consumed_by = RelationshipTo('UserNeo', 'CONSUMED_BY', model=ConsumedRelation)�11. modified_by = RelationshipTo('UserNeo', 'MODIFIED_BY', model=ModifiedRelation)��

30 of 47

ES + Neo4j - how to use both dbs?

  • We were using plugins
  • Plugins are working with ES v2.x

31 of 47

ES + Neo4j - interface to unite them

1. class Document:�2. """Unites ElasticSearch and Neo4j, representing an entity in both databases.�3. Entities are available by `uuid` or tuple `source, source_id`�4. """�5. �6. def __init__(self):�7. self._elastic_doc: DocumentElastic�8. self._neo4j_doc: DocumentNeo�9. �10. def __getattr__(self, name):�11. if name not in ('_elastic_doc', '_neo4j_doc'):�12. try:�13. return getattr(self._elastic_doc, name)�14. except AttributeError:�15. pass�16. return getattr(self._neo4j_doc, name)�17. return None���

32 of 47

ES + Neo4j - some methods

1. @staticmethod�2. def get_by_source_id(source, source_id):�3. doc = Document()�4. doc._elastic_doc = ElasticQuery.get_doc_by_source_id(source, source_id)�5. doc._neo4j_doc = NeoQuery.get_doc_by_source_id(source, source_id)�6. return doc�7. �8. @staticmethod�9. def get_by_uuid(uuid):�10. doc = Document()�11. doc._elastic_doc = ElasticQuery.get_doc_by_uuid(uuid)�12. doc._neo4j_doc = NeoQuery.get_doc_by_uuid(uuid)�13. return doc�14. �15. def is_up_to_date(self, last_updated: datetime):�16. return self._elastic_doc.is_up_to_date(last_updated)��

33 of 47

  • So far we:
  • Discovered Dialogflow
  • And how to use Elasticsear + Neo4j together

34 of 47

Elasticsearch-dsl - query examples

  • Filtering by field and limiting the results:

DocumentElastic\� .search(index='documents', using=elastic.client)\� .query('bool', filter=[Q('term', source=source)])\� .fields(['source_id'])[:limit]\

.execute()�

  • Filtering by field and limiting the results:

DocumentElastic.get(id=uuid, using=elastic.client, index='documents')�

35 of 47

Elasticsearch - word order

  • Query: “bookings last year
  • 1) “Average amount of bookings for last year
  • 2) “Last bookings of the previous year

36 of 47

Elasticsearch - word order

  • 2 separate analyzed fields:
  • “last”, “year”
  • “last year”, “number of bookings”

37 of 47

Elasticsearch - analyzers

1. root = analyzer(�2. 'root',�3. type='custom',�4. tokenizer='standard',�5. char_filter=['html_strip'],�6. filter=[english_possessive_stemmer, synonyms_case_sensitive, 'lowercase',�7. synonyms_lowercase, english_stop, english_stemmer])�8. �9. shingles = analyzer(�10. 'shingles',�11. type='custom',�12. tokenizer='standard',�13. char_filter=['html_strip'],�14. filter=[english_possessive_stemmer, synonyms_case_sensitive, 'lowercase',�15. synonyms_lowercase, english_stop, english_stemmer, shingle_filter])�16. �17. default_fields = {�18. 'default': Text(analyzer=root),�19. 'shingles': Text(analyzer=shingles)�20. }��

38 of 47

Neo4j

  • Uses SQL-inspired language for queries: Cypher

39 of 47

Neo4j - Graph statistics

  • Count the views:

db.cypher_query('''� MATCH (doc:DocumentNeo) - [rel:CONSUMED_BY] - (user:UserNeo) # filtering nodes WITH doc, sum(rel.times_viewed) AS views, # aggregating� SET doc.views = views # updating� ''')�

40 of 47

PageRank

  • Is a mathematical formula that judges the “value of a page”
  • Still used in Google search engine
  • Simple and cool

41 of 47

Neo4j - trick to project bipartite graph

  • How do we calculate PageRank, if we have bipartite graph?

42 of 47

43 of 47

Elasticsearch - Function score

  • TF_IDF_SCORE * ln(page_rank) * log10(number_of_views) * Gauss_filter

    • ln(page_rank)
      • 0 <page_rank < 10
      • 1 < multiplier < 3
    • log10(number_of_views)
      • 0 < number_of_views < 10000
      • 1 < multiplier < 3
    • Gauss_filter
      • Penalize docs which were updated > 1 year ago

44 of 47

Elasticsearch-dsl - Function score

1. query = FunctionScore(�2. query=query,�3. functions=[�4. dict( # Gauss multiplier�5. gauss={�6. 'updated_at': {�7. 'origin': datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S'),�8. 'offset': '365d',�9. 'scale': '700d'�10. }�11. }�12. ),�13. dict( # Multipliers from graph features�14. script_score=dict(script=dict(�15. source=score_script,�16. params=dict(�17. pg_offset=1,�18. pg_multiplier=1,�19. vw_offset=1,�20. vw_multiplier=0.2�21. ),�22. )))]) ��

45 of 47

Are the results good?

46 of 47

Future plans

  • Gather feedback and statistics
  • Change Neo4j
  • Own NLP model instead of Dialogflow

47 of 47

Thank you!

Questions?