BUILDING AN AUTOMATIC KEYPHRASE EXTRACTION SYSTEM IN PYTHON

PRASTUT KUMAR

PYCON DELHI 2016

Hi I am Prastut, a junior year undergraduate at Vellore Insitute of Technology, Chennai Campus.

UI/UX intern at BetterButter

Early age food-tech startup

Co-founded and built Cambuzz
A campus management system

Full stack developer intern at Houseafic

Early age real estate (fin-tech) startup

Stanford CrowdResarch Program

Worked as a Human computer Interaction Researcher.

GSoC Student at Berkman Center, Harvard University

2015
(1st year summer)

2015
(2nd year)

2016
(2nd year winter)

2016
(2nd year)

2016
(2nd year summer)

My dev journey.

So let me give you a glimpse of my dev journey. I realized at the end of my first year, that I want to do work at the intersection of design and technology. I interned twice both at early age startups to validate this interest. I co-founded and built Cambuzz, an open source youth networking platform. And for the past summer I was working as a Google Summer of Code student at Berkman Center, Harvard University.

So just to guage the audience, how many of you have experience working with Python for over 5 years?
How many of you have experience of over 1 year in Python?
Quite a few.

Well I started using Python 5 months back and I am standing here, giving a talk to you guys who know much much more about Python. I was quite surprised myself that I got the chance to speak in front of you guys.

Therefore I can’t tell you about technicalities in depth, because you guys probably know more than me. Since this project deals with NLP, I possibly can’t tell you everything because I am still discovering new things everyday.

That’s where I realized the beauty of Python, it’s so easy to learn and start hacking onto code.

What I can do is share my experience working with Python through this project and how it has made me come up here all the way from Chennai.

I hope this talk provides the necessary motivation for all the guys present who are starting their journey with Python as well as insights I gained from my experience to the experienced people.

Also if you are asking me questions, keep in mind that I am looking towards you with puppy eyes. :D

AGENDA

  • Intro + GSoC Project Details.
  • Science behind tag generation.
    • Methodology.
    • Key phrase selection.
      • Algorithms to quantify weights
      • Methods using Machine Learning
  • Building a basic tag generator.

TEEM

A P2PValue Collaboration App.

So I worked on Teem in my GSoC duration.

Teem is an app focused on increasing the participation and sustainability of commons-based peer production communities (e.g. Wikipedia, free software, Arduino). The model can also be applied to other open online communities (networks, open organizations) or even social movements (social centers, collectives).

The 1-9-90 rule states that the collaboration on projects like Wikipedia, 90% of the participants of a community only view content, 9% of the participants edit content, and 1% of the participants actively create new content. After doing intensive social research and prototype testing, the main needs of the different roles within a community and the tools they typically lack (related to management and internal organization, listing the subprojects available and the needs of each) have been recognized.

The app is grounded on these findings to reduce the frustrations of all participants and increase participation (90s=>9, 9s=>1), while providing a kind-of project management tool for communities (but informal/liquid/open to fit the context) together with a work-space with collaborative edition (like google-docs) and a group chat (like a whatsapp/telegram group).

Collaboration View

Team runs on it’s own backend called SwellRT which is a real time collaborative web server and has a JS SDK. To make it easy for explanation, it’s an open source equivalent of Google Docs, Parse, Firebase.

GSoC Project

Matchmaking Proposal

PROBLEM STATEMENT

New user

So the problem statement for which I wrote a proposal for was : When people start using a tool such as Teem, they come with a set of skills and interests. On the other hand, community projects have needs to fulfill, i.e. they have gaps or missing skills that are slowing down their development. The challenge is matching them, connecting newcomers with communities appealing to them.

So team lacked a dedicated onboarding experience as well search. Stats tell you that most people lose interest within 40 seconds on a websitr it they can't understand what's it about.

SOLUTION

  • Introduce a new route for exploration called interests.
  • Interests: tiles with categories like music, technology, culture etc.
  • Selecting interests -> matchmaking screen.
  • Onboarding user is now frictionless.
    • User interests -> Areas of concern for the community.

OVERALL PICTURE

How this works is where AKES comes into the picture. So every community which get’s put up on Teem’s environment has it’s description and content parsed through AKES so that the communities are auto-tagged. When these communities become auto-tagged, they become searchable on Teem as well as help in contextualizing them into broad categories.

INTRODUCTION

Keyphrases provide:

  • Concise Description.
  • Document Categorization.
  • Clustering.
  • Indexing
  • Search (Most important feature)
  • Summarization.
  • Myopic Semantic Analysis

Quantifiable metrics to perform actions.

Now the intro part is done. I will now start with the science behind the AKES.

Myopoic Semantic Analysis: a basis heuristic model would be to compare their tags first.

DIFFICULTY

    • No standard. Human-labelled tags deal with perspective and interpretation.
    • Fundamental difficulty: To find Most relevant and provide best coverage.
    • Document Length, changes in topic, lack of correlations between topics contribute to the difficulty.

Despite wide applicability and much research, keyphrase extraction suffers from poor performance relative to many other core NLP tasks, partly because there’s no objectively “correct” set of keyphrases for a given document.

While human-labeled keyphrases are generally considered to be the gold standard, they have bias! What are tags to you, doesn’t mean that they are tags to me.

As a general rule of thumb, keyphrases should be relevant to one or more of a document’s major topics, and should provide good coverage of all major topics.

The fundamental difficulty lies in determining which keyphrases are the
most relevant and provide the best coverage.

Document Length -> if you are tagging romeo and juliet, good luck.

METHODOLOGY

DOCUMENT

Scoring

Properties

Candidates

Keywords

Slide window:

  • Break at stopwords & punctuation.
  • Normalize.

Heuristic Formula:

Combination of properties.

Supervised machine learning:

learns properties manually -> assigned keywords.

Calculation

(some possible metrics):

  • Frequency of occurrences.
  • Position in the document.
  • Phrase length.
  • Similarity to other candidates.
  • Is it a popular keyword?

The document is

EXAMPLE

The P2Pvalue project is mapping the diffusion and hybridization of peer production and investigating the conditions which favour collaborative creation and the logic of value of these emerging forms. The research results will be used to develop a digital platform based on decentralized architecture and the design of public policies that promote the commons.

The tags are:

['diffusion', 'decentralized architecture', 'hybridization', 'mapping', 'logic']

Please make a mental note of this example, because I am going to use it later also.

AKES

UNSUPERVISED

SUPERVISED

Advantages
No training data required! In an age of massive but unlabeled datasets, this can be a huge advantage over other approaches.

Disadvantages

Unsupervised methods make assumptions that don’t necessarily hold across different domains.

Advantages
Use training data and hence generally perform better.

Disadvantages

Good training data is hard to find. Danger of training a model that doesn’t generalize to unseen examples is also problematic.

  • Candidate Identification
    • Brute-force method.
    • Need for a better algorithm:
      • Typically identify a smaller subset of better candidates.
    • Machine Learning methods.

So I will explore each step into some depth.

Brute Force:

consider all words and/or phrases in a document as candidate keyphrases.

Problematic:

Computational costs – heavy.

Not all words and phrases in a document are equally likely to convey its content

Better Algo:

Common heuristics include

Removing stop words and punctuation.

Filtering for words with certain parts of speech or, for multi-word phrases, certain POS patterns.

Machine Learning Methods:

Using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases.

Using already-labeled dataset - SemEval 2010 keyphrase extraction dataset

It contains a lot of library documents from UK and US and has corresponding tags attached to them.

  • CI – cont.

Compared to n-grams (where n = 6):

('The', 'P2Pvalue', 'project', 'is', 'mapping', 'the')

('P2Pvalue', 'project', 'is', 'mapping', 'the', 'diffusion')

('project', 'is', 'mapping', 'the', 'diffusion', 'and')

('is', 'mapping', 'the', 'diffusion', 'and', 'hybridization')

('mapping', 'the', 'diffusion', 'and', 'hybridization', 'of')

('the', 'diffusion', 'and', 'hybridization', 'of', 'peer')

('diffusion', 'and', 'hybridization', 'of', 'peer', 'production')

('and', 'hybridization', 'of', 'peer', 'production', 'and')

('hybridization', 'of', 'peer', 'production', 'and', 'investigating')

('of', 'peer', 'production', 'and', 'investigating', 'the')

('peer', 'production', 'and', 'investigating', 'the', 'conditions')

('production', 'and', 'investigating', 'the', 'conditions', 'which')

('and', 'investigating', 'the', 'conditions', 'which', 'favour')

('investigating', 'the', 'conditions', 'which', 'favour', 'collaborative')

('the', 'conditions', 'which', 'favour', 'collaborative', 'creation')

('conditions', 'which', 'favour', 'collaborative', 'creation', 'and')

('which', 'favour', 'collaborative', 'creation', 'and', 'the')

('favour', 'collaborative', 'creation', 'and', 'the', 'logic')

('collaborative', 'creation', 'and', 'the', 'logic', 'of')

('creation', 'and', 'the', 'logic', 'of', 'value')

('and', 'the', 'logic', 'of', 'value', 'of')

('the', 'logic', 'of', 'value', 'of', 'these')

('logic', 'of', 'value', 'of', 'these', 'emerging')

('of', 'value', 'of', 'these', 'emerging', 'forms.')

('value', 'of', 'these', 'emerging', 'forms.', 'The')

('of', 'these', 'emerging', 'forms.', 'The', 'research')

('these', 'emerging', 'forms.', 'The', 'research', 'results')

('emerging', 'forms.', 'The', 'research', 'results', 'will')

('forms.', 'The', 'research', 'results', 'will', 'be')

('The', 'research', 'results', 'will', 'be', 'used')

('research', 'results', 'will', 'be', 'used', 'to')

('results', 'will', 'be', 'used', 'to', 'develop')

('will', 'be', 'used', 'to', 'develop', 'a')

('be', 'used', 'to', 'develop', 'a', 'digital')

('used', 'to', 'develop', 'a', 'digital', 'platform')

('to', 'develop', 'a', 'digital', 'platform', 'based')

('develop', 'a', 'digital', 'platform', 'based', 'on')

('a', 'digital', 'platform', 'based', 'on', 'decentralized')

('digital', 'platform', 'based', 'on', 'decentralized', 'architecture')

('platform', 'based', 'on', 'decentralized', 'architecture', 'and')

('based', 'on', 'decentralized', 'architecture', 'and', 'the')

('on', 'decentralized', 'architecture', 'and', 'the', 'design')

('decentralized', 'architecture', 'and', 'the', 'design', 'of')

('architecture', 'and', 'the', 'design', 'of', 'public')

('and', 'the', 'design', 'of', 'public', 'policies')

('the', 'design', 'of', 'public', 'policies', 'that')

('design', 'of', 'public', 'policies', 'that', 'promote')

('of', 'public', 'policies', 'that', 'promote', 'the')

('public', 'policies', 'that', 'promote', 'the', 'commons.')

The text is done deliberately to create the computational effect.

  • Candidate Identification – cont.
    • Steps to generate suitable key phrases:
      • Take out specific POS patterns.
        • Don’t take all of n-grams permutation. (Bruteforce)
        • Noun phrases matching the POS pattern {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}. This matches any number of adjectives followed by at least one noun that may be joined by a preposition to one other adjective(s)+noun(s) sequence.
      • Take into account features like:
        • frequency-based: term frequency in the entire document and in the corpus. Can the “commonness” of word be an important factor?
        • statistical: term length, max word length.
        • grammatical: “is acronym”, “is named entity”
        • positional: normalized positions of first and last occurrence, “is in title”, “is in key excerpt” (such as an abstract or introductory paragraph)
      • Filter out stop words and punctuation (string module in Python)

I didn’t brainstorm this science myself. It’s an accumulation of reading a lot of articles and papers over the net.

The most important feature other than extracting words is to build upon the structure of the content.

  • Candidate identification completely depends on the user.
  • In the interest of brevity and simplicity, we would be considering noun phrases and positional factors into consideration.

INFERENCE

It depends on what feature set you are creating, what type of application for which you need the tags for.

    • nltk.sent_tokenize To tokenize text on sentences.
    • nltk.pos_tag To postion of speech tag words.

eg:

[('the', 'DT'), ('p2pvalue', 'NN'), ('project', 'NN'), ('is', 'VBZ'), ('mapping', 'VBG'), ('the', 'DT'), ('diffusion', 'NN'), ('and', 'CC'), ('hybridization', 'NN'), ('of', 'IN'), ('peer', 'NN'), ('production', 'NN'), ('and', 'CC'), ('investigating', 'VBG'), ('the', 'DT'), ('conditions', 'NNS'), ('which', 'WDT'), ('favour', 'VBP'), ('collaborative', 'JJ'), ('creation', 'NN'), ('and', 'CC'), ('the', 'DT'), ('logic', 'NN'), ('of', 'IN'), ('value', 'NN'), ('of', 'IN'), ('these', 'DT'), ('emerging', 'VBG'), ('forms', 'NNS')]

    • nltk.PorterStemmer Porter Stemmer Algorithm used for stemming.
    • nltk.corpus.stopwords.words('english') Provides common stop words. NLTK enjoys a large support for other languages also.
    • nltk.chunk.regexp.RegexpParser NLTK agnostic regexparser.

SOME BASIC NLTK COMMANDS

It supports Hindi too.

BASIC READER CLASS

Code explanation

  • "cats" , "catlike", "catty" are based on the root ->"cat"
  • "stems", "stemmer", "stemming", "stemmed" -> "stem".
  • A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish".
  • On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".

    NLTK provides implementation for popular stemming algorithms like PorterStemmer, Lancaster, Snowballer. Check cases needs to be applied for the last point

2. Stemming – finding word roots and eliminating the same words.

3. Rating

    • Information gathered at the previous stages = better rating of the keyphrases according to their “significance” or “relevance” to the document.
    • Present Research states that:
      • Information contained in the document itself is not enough.
      • For this reason, analyzing a corpus (i.e. a sample of documents written in the same language) to build a dictionary of known words is important.
    • Algorithms like TD*IDF and Okami BM25 are used are used to define the relevance of the word.
      • A normalized dictionary is made by combining the dictionary of known words and their weights given by any 2 of the above algorithms.

So if you are taking out just one document to tag, the results would be very poor.

TF*IDF

  • The word “Romeo” is present 7000 times in the document “Romeo and Juliet” and is present 7 times in the whole collection of Shakespeare’s works.
  • The tf-idf of “Romeo” in the document “Romeo and Juliet” is 7000*(1/7) = 1000
  • The word “and” is present 10^4 times in “Romeo and Juliet” and is present 10^6 times in the whole collection of Shakespeare.
  • The tf-idf of “and” is 10^4*(1/10^6) = 0.01

EXAMPLE

  • TF = Term frequency
  • IDF = Inverse Document Frequency.

BACKGROUND

IDF: The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

BM25

IDF of the word qi in the whole corpus

TF of the word qi in the the document

Free variables. k - [1.2, 2.0], b = 0.75

Total Number of Documents

Number of documents containing qi

Explanation of the algoritm will be given in detail.

SUPERVISED

SUPERVISED APPROACHES

  • Task Reformulation: some fraction of candidates are classified as keyphrases, rest as non-keyphrases
    • Binary Classification Problem: Naive Bayes, Decision Trees, SVMS.
    • Conceptually Problematic.
  • Feature Design:

However, this reformulation of the task is conceptually problematic; humans don’t judge keyphrases independently of one another, instead they judge certain phrases as more key than others. As such, more recently the problem has been reformulated as a ranking problem, in which a function is trained to rank candidates pairwise according to degree of “keyness”. The best candidates rise to the top, and the top N are taken to be the document’s keyphrases.

statistical features: phrase length (number of constituent words), phrase position (normalized position within a document of first and/or last occurrence therein), and “supervised keyphraseness” (number of times a keyphrase appears as such in the training data). Some models take advantage of a document’s structural features — titles, abstracts, intros and conclusions, metadata, and so on — because a candidate is more likely to be a keyphrase if it appears in notable sections. Others are external resource-based features: “Wikipedia-based keyphraseness” assumes that keyphrases are more likely to appear as Wiki article links and/or titles.

ARCHITECTURE DESIGN

Tech stack of Teem.

Since the backend server was in Java, I was advised to use Weka. But weka is something like Matlab or R studio, it's tough to write an open ended wrapper for these things. Therefore I switched to NLTK.

DOCKER CONTAINER

SWELLRT

DOCKER CONTAINER

TAG EXTRACTOR
DAEMON SERVER (FLASK)

TAG EXTRACTOR SCRIPT

TAGS

TEXT TO BE TAGGED

TEEM

CHANGES IN TEXT

TAGS (MAKING INTEREST)

SELECTS INTEREST

RELATED

COMMUNITIES

SwellRT makes a POST request with to teem-tag container. The teem-tag container recieves the POST request, splits away the material from the JSON recieved and then processes it to generate the tags as well as the summary of the content.

APPLICATIONS

Document = network whose:

  • Nodes are candidate keyphrases (key words).
  • Edges (optionally weighted by the degree of relatedness) connect related candidates.
  • Then, a graph-based ranking algorithm, such as Google’s famous PageRank, is run over the network, and the highest-scoring terms are taken to be the document’s keyphrases.
  • Google applies it’s magic NLP sauce for further ranking.

USEFUL LINKS

    • Source code for Teem Tag Server + Script: https://github.com/P2Pvalue/teem-tag
    • Other similar implementations:
      • Rake: https://github.com/aneesha/RAKE
      • Supervised Code Implementation: https://github.com/chartbeat-labs/textacy
    • Libraries that can help:
      • NLTK: http://www.nltk.org/
      • NLP for Node: https://github.com/NaturalNode/natural
    • Datasets for supervised methods: https://github.com/snkim/AutomaticKeyphraseExtraction

CREDITS

    • Samer Hassan, researcher at Berkman Klein Center, Harvard University. UCM Principal Investigator in the P2Pvalue. Leads research at Teem.
    • Antonio Tapidor, developer, sysops, researcher at UCM.
    • Pablo Ojanguren, lead developer of SwellRT, the Teem backend.
    • Elena Martínez, lead designer at Teem.
    • Google - for funding me. :D

THANK YOU!

https://github.com/prastut

kr.prastut@gmail.com

Pycon AKES.pptx - Google Slides