ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAF
1
NLP Task (Usecase Category)
libraryfeaturepaperjupyter notebook exampleexamplemodel parameterdata urlinputoutputexperimental resultsqualityremarkpipeline
2
中文分词jiebaprobability词典,词频统计
https://github.com/ChiLunHuang/DataAnalysis1-in-Python-with-jieba/blob/master/Data%20Analysis%20in%20Python%20with%20jieba.ipynb
结巴分词
自定义词典,停词,
精确模式,试图将句子最精确地切开,适合文本分析;
全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
我来到北京清华大学我/ 来到/ 北京/ 清华大学没有找到标准数据的实验结果+最好的中文分词
3
词向量表示gensimdistributed representation,
probabilistic modeling
(Euclidean embedding)
words or phrases from the vocabulary are mapped to continuous vectors space, softmax
同义词查询移动窗口,
维度大小,
最小出现频率,
Skip-gram or CBOW model
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"
词向量矩阵:[
human 0.122 0.566 ...
machine 0.655 0.444 ...
...]
没有找到标准数据的实验结果+近义词寻找不是很准确,
需要寻找特定主题的语料来训练
分词,去停词
4
embedding visualizationtensorboarddimension reduction非线性降维,高维词向量映射到二或三维空间嵌入式向量可视化t-SNE or PCA词向量矩阵,维度二维或者三维矩阵+可用在图像化展示词向量化
5
情感分析SnowNLPBayesian classificationSVM, HMM, naive bayes, 最大熵, K-NN, Dictionarytwitter sentiment analysis模型的训练方式,分词方法
https://github.com/isnowfy/snownlp/blob/master/snownlp/sentiment/pos.txt
https://github.com/isnowfy/snownlp/blob/master/snownlp/sentiment/neg.txt
总之,前两位作者写得比较好
小熊宝宝我觉得孩子不喜欢,能换别的吗
positive
negative
没有找到标准数据的实验结果-
库里面的训练数据不包括股市新闻,
需要寻找股市新闻的情感分类训练数据
分词,去停词
6
Entity extractionNER_IDCNN_CRFclassification
End to End Chinese Named Entity Recognition by Iterated Dilated Convolution Neural Networks with Conditional Random Field layer
实体抽取IDCNN+CRF or BiLSTM+CRF
https://github.com/crownpku/Information-Extraction-Chinese/blob/master/NER_IDCNN_CRF/data/example.test
Donald Trump is the president of U.S.
I live in Florida.
SEC is an organization.
Donald Trump(Person)
Florida(Place)
SEC(Organization)
+
可应用在中文实体抽取上
目前只支持人名,地名和组织机构的抽取,
如果想要抽取其它类型,需要添加其它类型的训练数据
分词,词向量化
7
Relation extractionRE_BGRU_2ATTclassification
Bi-directional GRU with Word and Sentence Dual Attentions for End-to End Relation Extraction
关系提取word embedding...
https://github.com/crownpku/Information-Extraction-Chinese/blob/master/RE_BGRU_2ATT/origin_data/test.txt
鲁迅 鲁瑞 -人物事迹鲁迅的父亲鲁瑞,母亲周伯宜鲁迅后来曾说,因为母亲要看书,他必须到处搜集小说,而且老人
父母 鲁迅 鲁瑞 +
可应用在中文关系抽取上
目前只支持训练数据中已有的关系抽取,
如果想要抽取其它类型,需要添加其它类型的训练数据"
分词,词向量化
8
Relation extractionDeepDiveclassification
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. end-to-end information extraction pipelines,a trained system that uses statistical learning to cope with various forms of noise and imprecision.
https://nbviewer.jupyter.org/github/HazyResearch/deepdive/blob/master/examples/spouse/DeepDive%20Tutorial%20-%20Extracting%20mentions%20of%20spouses%20from%20the%20news.ipynb
信息提取
What are the variables of interest that we want DeepDive to predict for us?
What are the features for each of these variables?
What are the connections between the variables?
https://pan.baidu.com/s/1slLpYVz
1201734370,证券代码:600969 证券简称:郴电国际 编号:公告临 2015-033 湖南郴电国际发展股份有限公司 为郴州市城市建设投资发展集团有限公司 提供担保公告 本公司董事会及全体董事保证本公告内容不存在任何虚假记载、 误导性陈述或者重大遗漏,并对其内容的真实性、准确性和完整性承 担个别及连带责任。如有董事对临时公告内容的真实性、准确性和完整性无法保证或 存在异议的,公司应当在公告中作特别提示。 重要内容提示: ●被担保人名称:郴州市城市建设投资发展集团有限公司(以下简称“郴州城投公司”)。 ●本次担保金额及累计为其提供的担保余
额:本次对外担保金额为 4500 万元人民币,截止本公告日,本公司对郴州城投公司的担保 除本次担保外不存在其他任何担保。●本次担保是否有反担保:无。●对外担保逾期的累计数量:截至本公告日,本公司无逾期担保 事项。一、 担保情况概述经本公司第五届董事会第十四次会议审议通过,同意为郴州市城 市建设投资发展集团有限公司向国家开发银行湖南省分行申请国家 第二批专项建设基金借款 4500 万元人民币的贷款事项提供全额、全1程连带责任保证担保,本次担保的借款款项用于本公司 2015 年城镇 配电网建设改造工程项目。二、被担保人基本情况(一)被担保人是郴州市城市建设投资发展集团有限公司,注册 地点:郴州市北湖区五岭大道 1 号(市政府机关政务文化休闲中心 7 楼),法定代表人为刘建国,经营范围:城市基础设施建设项目
投资、 融资及相关的配套服务,农、林、水项目投资、开发建设及相关的配 套服务,房地产开发经营,土地一级开发及整理。(国家禁止经营的 除外,涉及行政许可的凭证可证经营)。截止 2015 年 6 月 29 日郴州 城投公司的信用等级为双 A 级,截止 2014 年 12 月 31 日,资产总额 为 2,888,924.53 万元,资产净额为 1,618,965,52 万元,资产负债率为 43.61%,营业收入为 180,524.54 万元,利润总额为 61,656.39 万元。(二)详细说明被担保人与上市公司关联关系或其他关系。 本公司与郴州城投公司无关联关系。三、担保协议的主要内容 本次担保的方式是提供连带责任保证,担保
范围:根据主合同的约定,借款人向债权人借款 4500 万元人民币,借款期限 15 年(即 2015 年 10 月 30 日至 2030 年 10 月 29 日止)。保证人愿意就借款人 偿付主合同项下全部借款本金、利息、罚息、复利、补偿金、违约金、 损害赔偿金和实现债权的费用向债权人提供担保。本合同的保证期间 为主合同下债务履行届满之日起两年。四、董事会意见为争取国家专项建设基金,支持项目建设,经国家发改委批准, 公司 2015 年城市电网改造项目列入了国家第二批专项建设基金支持 项目,由国家开发银行湖南分行发放建设基金借款 4500 万元,期限215 年,年利率 1.2%。按照国家对专项建设
基金发放的有关规定,须 由郴电国际的股东向国家开发银行借款,专项用于 2015 年郴电国际 城镇配电网建设改造工程项目。郴州市政府旗下的郴州
城投公司,是 国家开发银行湖南分行的信贷客户,在银行存在着良好的信用,符合 国家开发银行放贷条件。郴州城投公司同意为本公司承贷 4500 万
元 借款,但须由本公司对此项借款提供全额、全程连带责任保证担保。 本次担保的借款款项是用于本公司 2015 年城镇配电网建设改造工程 项目。故本公司董事会同意本次担保事项。本公司的独立董事对本次担保事项发表了独立意见如下:我们认 为,本次担保的借款款项是用于本公司 2015 年城镇
配电网建设改造 工程项目,此次担保事项符合公司章程及相关规定的要求。我们同意 公司为上述贷款项目进行担保。 五、累计对外担保数量及逾期
担保的数量截至公告披露日上市公司及其控股子公司均无对外担保,本次对 外担保 4500 万元人民币占 2014 年 12 月 31 日本公司经审计归属于上
市公司股东净资产的 1.79%,截止目前,本公司无逾期担保。 六、上网公告附件 公司第五届董事会第十四次会议决议。 特此公告。湖南郴电国际
湖南郴电国际发展股份有限公司 郴州市城市建设投资发展集团有限公司 股权交易
https://www-cs.stanford.edu/people/chrismre/papers/deepdive_vlds.pdf
+
已经试验成功,目前只能提取两个实体之间发生股权交易的关系
分词,词向量化
9
Relation extractionLabeled Span Graph Networksclassification
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. end-to-end information extraction pipelines,a trained system that uses statistical learning to cope with various forms of noise and imprecision.
https://nbviewer.jupyter.org/github/HazyResearch/deepdive/blob/master/examples/spouse/DeepDive%20Tutorial%20-%20Extracting%20mentions%20of%20spouses%20from%20the%20news.ipynb
http://demo.allennlp.org/semantic-role-labeling/MjUzMzUz
http://barbar.cs.lth.se:8081/
SRLCONLL数据无法下载到
分词,词向量化
10
Relation extractionCoTypeclassification
CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases
https://drive.google.com/drive/folders/0B--ZKWD8ahE4RmFBTjR6aUJjTkU?usp=sharing
"sentText": "Mohammed Oudeh, a former math teacher who became the mastermind of the deadly attack on Israeli athletes at the 1972 Munich Olympics, died Friday in Damascus..", "articleId": "NYT-ENG-20100704.0043"
"relationMentions": [{"em1Text": "Mohammed Oudeh", "em2Text": "Damascus", "label": "per:country_of_death"}, {"em1Text": "Munich", "em2Text": "Damascus", "label": "None"}, {"em1Text": "Damascus", "em2Text": "Munich", "label": "None"}, {"em1Text": "Mohammed Oudeh", "em2Text": "Munich", "label": "None"}, {"em1Text": "Munich", "em2Text": "Mohammed Oudeh", "label": "None"}], "entityMentions": [{"start": 0, "label": "/person", "text": "Mohammed Oudeh"}, {"start": 1, "label": "/art/film,/art", "text": "Munich"}, {"start": 2, "label": "/location/city,/location", "text": "Damascus"}]
error: Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory

Evaluate on Relation Extraction...
Traceback (most recent call last):
File "code/Evaluation/emb_test.py", line 32, in <module>
We ran Stanford NER on training set to detect entity mentions-> mapped entity names to Freebase entities using DBpediaSpotlight-> aligned Freebase facts to sentences-> assign entity types of Freebase entities to their mapped names in sentences
11
Event extractionJointEE-NNclassificationConvolutional Neural Networks for Event Detection事件抽取entity type embedding size,
word embedding size,
RNN hidden layers units,
local features,
mini batch size,
Frobenius norms
ACE2005, https://raw.githubusercontent.com/anoperson/jointEE-NN/master/data/fistDoc.nnData4.txt
sentence, modelentity, event69.30%+暂时还没去试验分词,词向量化
12
Keyword/Sentence extraction
SnowNLPgraph probability theoryTextRank文档关键词提取,文档主题提取
filter, stopwords, POS, minimum occurence,
keywords number
1、公司深度报告:寻价格涨跌之因、需求之形,论茅台的成长
2、核心观点:茅台价格作为白酒板块的风向标,受到消费需求、投资需求以及公司量价政策的共同影响。
茅台', '价格'+
可以应用在SAR extraction文档关键词提取上
分词
13
entity linkingDeeptypeintegrating symbolic information
into the reasoning process of a neural network with a type system
DeepType: Multilingual Entity Linking by Neural Type System Evolution
https://github.com/openai/deeptype/blob/master/learning/SentencePredictions.ipynb
discovering types of entity disambiguationname
batch_size
max_epochs
hidden_sizes
cudnn
anneal_rate
weight_noise
CoNLL (YAGO)
Napoleon was the emperor of the First French Empire. He was defeated at Waterloo by Wellington and [[Blücher]]. He was banned to Saint Helena, died of stomach cancer, and was buried at Invalides.
Napoleon [https://en.wikipedia.org/wiki/Napoleon] was the emperor of the First French Empire. He was defeated at Waterloo [http://en.wikipedia.org/wiki/Battle%20of%20Waterloo] by Wellington [http://en.wikipedia.org/wiki/Arthur%20Wellesley%2C%201st%20Duke%20of%20Wellington] and Blücher [Gebhard Leberecht von Blücher]. He was banned to Saint Helena [Saint Helena], died of stomach cancer, and was buried at Invalides [Les Invalides].
98.6-99% possible accuracies of on two benchmark tasks CoNLL (YAGO) and the TAC KBP 2010 challenge.
+
运行步骤太多,error发生在: To use the saved graph projection on wikipedia data to test out how discriminative this classification is (Oracle performance) (edit the config file to make changes to the classification used):

export DATA_DIR=data/
python3 extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}
已报issue
Get wikiarticle -> wikidata mapping (all languages) + Get anchor tags, redirections, category links, statistics (per language). To store all wikidata ids, their key properties (instance of, part of, etc..), and a mapping from all wikipedia article names to a wikidata id do as follows, along with wikipedia anchor tags and links, in three languages: English (en), French (fr), and Spanish (es)
14
entity linkingEARLprobability theory
Joint Entity and Relation Linking for Question Answering
https://figshare.com/articles/Full_Annotated_LC_QuAD_dataset/5782197
Lc-quad
"question": "Which comic characters are painted by Bill Finger?",
"predicate mapping": [
{
"label": "painted by",
"uri": "http://dbpedia.org/ontology/creator"
},
{
"label": "comic characters",
"uri": "http://dbpedia.org/ontology/ComicsCharacter"
}
]
0.737不能跑
15
entity linkingdeep-edprobability theory
Deep Joint Entity Disambiguation with Local Neural Attention
http://people.inf.ethz.ch/ganeao/emnlp17_poster.pdfentity embedding
learning rate
attention
AIDA-CoNLL
Napoleon was the emperor of the First French Empire. He was defeated at Waterloo by Wellington and [[Blücher]]. He was banned to Saint Helena, died of stomach cancer, and was buried at Invalides.
Napoleon [https://en.wikipedia.org/wiki/Napoleon] was the emperor of the First French Empire. He was defeated at Waterloo [http://en.wikipedia.org/wiki/Battle%20of%20Waterloo] by Wellington [http://en.wikipedia.org/wiki/Arthur%20Wellesley%2C%201st%20Duke%20of%20Wellington] and Blücher [Gebhard Leberecht von Blücher]. He was banned to Saint Helena [Saint Helena], died of stomach cancer, and was buried at Invalides [Les Invalides].
92.22%
在torch里面执行./install.sh错误
16
entity linkingneural-elprobability theory
Entity Linking via Joint Encoding of Types, Descriptions, and Context
http://cogcomp.org/files/presentations/Poster_GuptaSiRo17.pdf
entity, types embedding size,
LSTM size,
document-context encoder size
document context vocabulary strings number
dropout
Adam optimization
learning rate
mini-batches size
AIDA-CoNLL
“12th Asian Nations Cup finals are hosted by Lebanon until this
October 29th.”
Model CD : Lebanon_Football_Team
Model CT, CDTE : Lebanon (correct)
81.80%python3.4不能跑
17
entity linkingmulrel-nelprobability theory
Improving Entity Linking by Modeling Latent Relations between Mentions
https://docs.google.com/document/d/1UlekWlzN54E6Mn6C1aNn1Zf4T9EFMm8hwJ0VgGHJgFM/edit
https://drive.google.com/file/d/1IDjXFnNnHf__MO5j_onw4YwR97oS8lAy/view
EU rejects German call to boycott British lamb.
EU B EU --NME--
rejects
German B German Germany http://en.wikipedia.org/wiki/Germany 11867 /m/0345h
call
to
boycott
British B British United_Kingdom http://en.wikipedia.org/wiki/United_Kingdom 31717 /m/07ssc
lamb
.
aida-A micro F1: 0.9224506836447135
aida-B micro F1: 0.9311036789297659
msnbc micro F1: 0.9441469013006886
aquaint micro F1: 0.8867132867132868
ace2004 micro F1: 0.89738430583501
clueweb micro F1: 0.7784764642472153
wikipedia micro F1: 0.7799718955698544
唯一可本地使用,Unfortunately, the souce code doesn't include a part for handling raw data. You have to convert your raw text into the format of the train/test files. To do so, you have to do two steps:

NER your text (using e.g. Standford CORENLP, SpiCy)
then use the source code from https://github.com/dalab/deep-ed to convert to the correct format. But, you need to modify it a little bit.
Our next version will include a piece of sorce code to do that. But you have to wait, maybe some months.

NER your text (using e.g. Standford CORENLP, SpiCy) ->
then use the source code from https://github.com/dalab/deep-ed to convert to the correct format.
18
entity linkingTAGMEAPI
TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities)
https://tagme.d4science.org/tagme/wiki-disamb30
On this day 24 years ago Maradona scored his infamous "Hand of God" goal against England in the quarter-final of the 1986
On this day 24 years ago Maradona[http://en.wikipedia.org/wiki/Diego%20Maradona] scored his infamous "Hand of God"[http://en.wikipedia.org/wiki/Argentina%20v%20England%20(1986%20FIFA%20World%20Cup)] goal against England[http://en.wikipedia.org/wiki/England%20national%20football%20team] in the quarter-final[http://en.wikipedia.org/wiki/2010%20FIFA%20World%20Cup%20knockout%20stage] of the 1986
0.775比较流行的API
目前只能处理英文,德文 ,意大利文
text->annoated text object
19
entity linkingjist2016Web table entity linking with KB
Entity Linking in Web Tables with Multiple Linked Knowledge Bases
SVM parameter, KBs电影 类型 导演
纳尼亚传奇 科幻片 迈克尔·艾普特
电影 类型 导演
纳尼亚传奇|纳尼亚传奇[小说改编系列电影] 科幻片|科幻片 迈克尔·艾普特|
这个人写的代码还有很多bug,网上还没有更多的做table entity linking 的库,暂时给他提了issue,等待他的回复。
20
Document classificationgensimprobabilistic modelingDocument-term matrix LDA, TF-IDF文档相似性比较"我在九寨沟,很喜欢","我在九寨沟"0.94496047+
21
词性标注StanfordNLPclassification基于词的序列标注数据提取model, textPosition of Speech Tagger+Java语言,不优先考虑
22
PDF文档转换pdfminerdata processing数据清洗,正则转换文档转换max pages, output formatPDF文档plain text+可以用在中文
23
Knowledge GraphOpenCNinformation extraction/storageinformation extraction,信息存储Neo4j,
cypher信息查询, 消歧
用户所选领域模型训练,信息查询,智能投顾,公司知识图谱text节点,关系+
24
爬虫requestsdata extraction/storage网络爬虫信息搜索keywords, websitesstructured data+翻墙问题(已解决),验证码问题
25
extract pdf metapdfxextract reference and title http://pdfx.cs.man.ac.uk/sentence-level tags
DOIs for references
PDF文档XML, HTML, ARCHIVE-
26
extract pdf structurepdf-extractvarious region extraction
It performs structural analysis to determine column bounds, headers, footers, sections, titles and so on. It can analyse and categorise sections into reference and non-reference sections and can split reference sections into individual references.
-
27
text miningGROBIDclassificationMachine learning of sentence classificationscholar documents, PDFtext metadata+
28
Mathematical equation recognization
MathOCR
image preprocessing, layout analysis and character recognition, especially the ability to recognize mathematical expression
http://mathocr.sourceforge.net/report.pdfpngLaTex-识别非标准字体精确度不行
29
Knowledge Graph Embedding
OpenKEprobability theory
Simple interfaces to configure various training environments and classical models
TensorFlow
TransD: Knowledge Graph Embedding via Dynamic Mapping Matrix
Link Prediction, triple classification
TransE, TransH, TransR, TransD, RESCAL, DistMult, ComplEx and HolE.
(entity 1, entity 2)link+
30
Event EmbeddingHEBEdistributed representation
Large-Scale Embedding Learning in Heterogeneous Event Data
phrase#embedding jiawei_han#jialu kdd p1
clustering#embedding jiawei_han#eric kdd p2
phrase#eric fangbo#eric kdd p3 1
phrase#eric kdd p3
4 200
phrase 0.002001 0.002210 -0.001915
embedding -0.002217 -0.001630 0.001743
clustering -0.002353 -0.000528 -0.000986
eric 0.002218 -0.001360 0.000523
31
argument mining
Multi-Task Learning for Argumentation Mining in Low-Resource Settings
32
question answer,machine translation,document summarization,semantic parsing,sentiment analysis,relation extraction,natural language inference,semantic role labelingpytorchQA,classificationdata transfer learningquestion answerword embedding,biLSTM,self attentioncontext: 近年来,深度学习技术上有了许多的提高,学者们提出了ReLU 激活 函数、Dropout、Batch Normalization、残差神经网络等新的模型和技术。answer:深度学习策略应用了哪些技术?answer:ReLU 激活
函数、Dropout、Batch Normalization、残差神经网络
can be found in paper"他的部分数据集跑通 注:这个代码我修改了他的IO那部分code,调试成功,只要给定语料,就可以运行;目前还没有训练看效果"词向量化
33
Sentiment Classificationtsinghua websiteclassification
Sentiment Classification with User and Product Attention
sentiment classficationword embedding,LSTM,user/product attention
can be found in github
userid:0ur2480402/; , productid: \tt0085809; context: this is a stunningly beautiful movie . <sssss> the music by phillip glass is just a work of pure genius . <sssss> i can watch this movie again and again . <sssss> the final sequence of the legend 's judgment where the container falls from the sky is just unbelievable . <sssss> how was it filmed ? <sssss> it 's so amazing . <sssss> if you have not seen this film watch it - again and again ! <sssss> this must be the only movie which in a powerful way sentiment:10can be found in github跑通(他的数据集)词向量化
34
Text Summarizationtsinghua websiteSummarization(相关paper ))NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
http://localhost:8888/notebooks/TensorFlow-Summarization.ipynb
sumarizationword embedding,GRU
harvardnlp/sent-summary.
context:us business leaders lashed out wednesday at legislation that would penalize companies for employing illegal immigrants .title:us business attacks tough immigration law
https://docs.google.com/document/d/1L-EJ_Byf4iyi8S6MfQaolIH-GHXJKk71CP33K_FR1fw/edit
跑通(他的数据)
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100