IPTC https://iptc.org/ is looking for linguists to write classification rules for EXTRA https://iptc.github.io/extra/overview.html, an open source rules-based classification engine for news. The linguist will write Boolean rules to analyze the text of news articles and suggest the most relevant IPTC Media Topics, a news taxonomy of roughly 1,000 subjects (of which a portion will be selected to have rules written by the linguist for this project).
A browsable tree of the taxonomy is available here:(English) http://show.newscodes.org/index.html?newscodes=medtop&lang=en-GB&startTo=Show (German) http://show.newscodes.org/index.html?newscodes=medtop&lang=de&startTo=Show
The project requires both an English and a German linguist, and requires that applicants complete a short application below demonstrating rule-writing proficiency (German linguists are asked to submit their responses in English). The position will be off-site and involve collaborating remotely with team members from different countries. The initial project phase is expected to run from the end of March to the end of June 2017, with an estimated 100-125 total hours of work per language required for the linguist.
EXTRA is the EXTraction Rules Apparatus, a multilingual open-source platform for rules-based classification of news content. IPTC was awarded a grant from the first round of Google’s Digital News Initiative Innovation Fund https://www.digitalnewsinitiative.com/ to build and freely distribute the initial version of EXTRA. "Classification" means assigning one or more categories to the text of a news document. Rules based classifiers use a set of Boolean rules, rather than machine-learning or statistical techniques, to determine which categories to apply.
* Master’s degree in Library or Information Science, or equivalent professional experience (i.e. taxonomy, classification, computational linguistics, data science or information architecture).* Experience using rules-based categorization software, Regex, Natural Language Processing (NLP), and text mining tools.* Familiarity with one or more query languages (ElasticSearch Query DSL, SQL, Lucene, XQueryFT, Teragram etc.).* Familiarity with general tagging principles using taxonomies and scope notes.* Experience with news content or working in the news industry a plus.* Ability to work independently, while collaborating with remote team members.* Fluency in English. For the German position: fluency in both English and German.
To submit your application, please complete the form below. For the German position, please submit your responses in English. First preference will be given to applications received by 27th February 2017, and review will continue until the positions are filled. The initial project phase is expected to run from the end of March to the end of June 2017.
For question #6: In order to help us understand your approach to writing rules, please provide a rule for the Media Topic "Civil Unrest" where the scope note is: "Dissatisfaction among the population as evidenced by rallies, strikes, demonstrations or sabotage." The rule should match news content related to this scope note, but also specifically it should match the sample articles provided below (please follow the links to view). Specifically, it should match the text appearing in the headline/body/byline/dateline/caption fields of each article (all other ancillary content on the page may be disregarded for the purposes of this exercise):
When writing your rule, you may either use your preferred query syntax or simply use the narrative form to describe your approach, rather than a specific syntax. For example, describe which words or patterns of words you would look for to suggest this topic, which sections of a structured news article might you look in, and what steps might you take to minimize false positives from figurative or metaphorical language.