NLP libraries

	A	B	C	D	E	F	G	H	I	J	K	L	M	N
1	NLP Task (Usecase Category)	library	feature	paper	jupyter notebook example	example	model parameter	data url	input	output	experimental results	quality	remark	pipeline

2	中文分词	jieba	probability	词典，词频统计	https://github.com/ChiLunHuang/DataAnalysis1-in-Python-with-jieba/blob/master/Data%20Analysis%20in%20Python%20with%20jieba.ipynb	结巴分词	自定义词典，停词, 精确模式，试图将句子最精确地切开，适合文本分析；全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义；搜索引擎模式，在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。		我来到北京清华大学	我/ 来到/ 北京/ 清华大学	没有找到标准数据的实验结果	+	最好的中文分词
3	词向量表示	gensim	distributed representation, probabilistic modeling (Euclidean embedding)	words or phrases from the vocabulary are mapped to continuous vectors space, softmax		同义词查询	移动窗口，维度大小，最小出现频率， Skip-gram or CBOW model	https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb	"Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"	词向量矩阵:[ human 0.122 0.566 ... machine 0.655 0.444 ... ...]	没有找到标准数据的实验结果	+	近义词寻找不是很准确，需要寻找特定主题的语料来训练	分词，去停词
4	embedding visualization	tensorboard	dimension reduction	非线性降维，高维词向量映射到二或三维空间		嵌入式向量可视化	t-SNE or PCA		词向量矩阵，维度	二维或者三维矩阵		+	可用在图像化展示	词向量化
5	情感分析	SnowNLP	Bayesian classification	SVM, HMM, naive bayes, 最大熵, K-NN, Dictionary		twitter sentiment analysis	模型的训练方式，分词方法	https://github.com/isnowfy/snownlp/blob/master/snownlp/sentiment/pos.txt https://github.com/isnowfy/snownlp/blob/master/snownlp/sentiment/neg.txt	总之，前两位作者写得比较好小熊宝宝我觉得孩子不喜欢，能换别的吗	positive negative	没有找到标准数据的实验结果	-	库里面的训练数据不包括股市新闻，需要寻找股市新闻的情感分类训练数据	分词，去停词
6	Entity extraction	NER_IDCNN_CRF	classification	End to End Chinese Named Entity Recognition by Iterated Dilated Convolution Neural Networks with Conditional Random Field layer		实体抽取	IDCNN+CRF or BiLSTM+CRF	https://github.com/crownpku/Information-Extraction-Chinese/blob/master/NER_IDCNN_CRF/data/example.test	Donald Trump is the president of U.S. I live in Florida. SEC is an organization.	Donald Trump(Person) Florida(Place) SEC(Organization)		+	可应用在中文实体抽取上目前只支持人名，地名和组织机构的抽取，如果想要抽取其它类型，需要添加其它类型的训练数据	分词，词向量化
7	Relation extraction	RE_BGRU_2ATT	classification	Bi-directional GRU with Word and Sentence Dual Attentions for End-to End Relation Extraction		关系提取	word embedding...	https://github.com/crownpku/Information-Extraction-Chinese/blob/master/RE_BGRU_2ATT/origin_data/test.txt	鲁迅鲁瑞 -人物事迹鲁迅的父亲鲁瑞，母亲周伯宜鲁迅后来曾说，因为母亲要看书，他必须到处搜集小说，而且老人	父母鲁迅鲁瑞		+	可应用在中文关系抽取上目前只支持训练数据中已有的关系抽取，如果想要抽取其它类型，需要添加其它类型的训练数据"	分词，词向量化
8	Relation extraction	DeepDive	classification	DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. end-to-end information extraction pipelines,a trained system that uses statistical learning to cope with various forms of noise and imprecision.	https://nbviewer.jupyter.org/github/HazyResearch/deepdive/blob/master/examples/spouse/DeepDive%20Tutorial%20-%20Extracting%20mentions%20of%20spouses%20from%20the%20news.ipynb	信息提取	What are the variables of interest that we want DeepDive to predict for us? What are the features for each of these variables? What are the connections between the variables?	https://pan.baidu.com/s/1slLpYVz	1201734370,证券代码:600969 证券简称:郴电国际编号:公告临 2015-033 湖南郴电国际发展股份有限公司为郴州市城市建设投资发展集团有限公司提供担保公告本公司董事会及全体董事保证本公告内容不存在任何虚假记载、误导性陈述或者重大遗漏，并对其内容的真实性、准确性和完整性承担个别及连带责任。如有董事对临时公告内容的真实性、准确性和完整性无法保证或存在异议的，公司应当在公告中作特别提示。重要内容提示: ●被担保人名称:郴州市城市建设投资发展集团有限公司(以下简称“郴州城投公司”)。 ●本次担保金额及累计为其提供的担保余额:本次对外担保金额为 4500 万元人民币，截止本公告日，本公司对郴州城投公司的担保除本次担保外不存在其他任何担保。●本次担保是否有反担保:无。●对外担保逾期的累计数量:截至本公告日，本公司无逾期担保事项。一、担保情况概述经本公司第五届董事会第十四次会议审议通过，同意为郴州市城市建设投资发展集团有限公司向国家开发银行湖南省分行申请国家第二批专项建设基金借款 4500 万元人民币的贷款事项提供全额、全1程连带责任保证担保，本次担保的借款款项用于本公司 2015 年城镇配电网建设改造工程项目。二、被担保人基本情况(一)被担保人是郴州市城市建设投资发展集团有限公司,注册地点:郴州市北湖区五岭大道 1 号(市政府机关政务文化休闲中心 7 楼),法定代表人为刘建国,经营范围:城市基础设施建设项目投资、融资及相关的配套服务,农、林、水项目投资、开发建设及相关的配套服务,房地产开发经营,土地一级开发及整理。(国家禁止经营的除外，涉及行政许可的凭证可证经营)。截止 2015 年 6 月 29 日郴州城投公司的信用等级为双 A 级，截止 2014 年 12 月 31 日，资产总额为 2，888，924.53 万元，资产净额为 1，618，965，52 万元，资产负债率为 43.61%，营业收入为 180，524.54 万元，利润总额为 61，656.39 万元。(二)详细说明被担保人与上市公司关联关系或其他关系。本公司与郴州城投公司无关联关系。三、担保协议的主要内容本次担保的方式是提供连带责任保证，担保范围:根据主合同的约定，借款人向债权人借款 4500 万元人民币，借款期限 15 年(即 2015 年 10 月 30 日至 2030 年 10 月 29 日止)。保证人愿意就借款人偿付主合同项下全部借款本金、利息、罚息、复利、补偿金、违约金、损害赔偿金和实现债权的费用向债权人提供担保。本合同的保证期间为主合同下债务履行届满之日起两年。四、董事会意见为争取国家专项建设基金，支持项目建设，经国家发改委批准，公司 2015 年城市电网改造项目列入了国家第二批专项建设基金支持项目，由国家开发银行湖南分行发放建设基金借款 4500 万元，期限215 年，年利率 1.2%。按照国家对专项建设基金发放的有关规定，须由郴电国际的股东向国家开发银行借款，专项用于 2015 年郴电国际城镇配电网建设改造工程项目。郴州市政府旗下的郴州城投公司，是国家开发银行湖南分行的信贷客户，在银行存在着良好的信用，符合国家开发银行放贷条件。郴州城投公司同意为本公司承贷 4500 万元借款，但须由本公司对此项借款提供全额、全程连带责任保证担保。本次担保的借款款项是用于本公司 2015 年城镇配电网建设改造工程项目。故本公司董事会同意本次担保事项。本公司的独立董事对本次担保事项发表了独立意见如下:我们认为，本次担保的借款款项是用于本公司 2015 年城镇配电网建设改造工程项目，此次担保事项符合公司章程及相关规定的要求。我们同意公司为上述贷款项目进行担保。五、累计对外担保数量及逾期担保的数量截至公告披露日上市公司及其控股子公司均无对外担保，本次对外担保 4500 万元人民币占 2014 年 12 月 31 日本公司经审计归属于上市公司股东净资产的 1.79%，截止目前，本公司无逾期担保。六、上网公告附件公司第五届董事会第十四次会议决议。特此公告。湖南郴电国际	湖南郴电国际发展股份有限公司郴州市城市建设投资发展集团有限公司股权交易	https://www-cs.stanford.edu/people/chrismre/papers/deepdive_vlds.pdf	+	已经试验成功，目前只能提取两个实体之间发生股权交易的关系	分词，词向量化
9	Relation extraction	Labeled Span Graph Networks	classification	DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. end-to-end information extraction pipelines,a trained system that uses statistical learning to cope with various forms of noise and imprecision.	https://nbviewer.jupyter.org/github/HazyResearch/deepdive/blob/master/examples/spouse/DeepDive%20Tutorial%20-%20Extracting%20mentions%20of%20spouses%20from%20the%20news.ipynb	http://demo.allennlp.org/semantic-role-labeling/MjUzMzUz http://barbar.cs.lth.se:8081/							SRLCONLL数据无法下载到	分词，词向量化
10	Relation extraction	CoType	classification	CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases				https://drive.google.com/drive/folders/0B--ZKWD8ahE4RmFBTjR6aUJjTkU?usp=sharing	"sentText": "Mohammed Oudeh, a former math teacher who became the mastermind of the deadly attack on Israeli athletes at the 1972 Munich Olympics, died Friday in Damascus..", "articleId": "NYT-ENG-20100704.0043"	"relationMentions": [{"em1Text": "Mohammed Oudeh", "em2Text": "Damascus", "label": "per:country_of_death"}, {"em1Text": "Munich", "em2Text": "Damascus", "label": "None"}, {"em1Text": "Damascus", "em2Text": "Munich", "label": "None"}, {"em1Text": "Mohammed Oudeh", "em2Text": "Munich", "label": "None"}, {"em1Text": "Munich", "em2Text": "Mohammed Oudeh", "label": "None"}], "entityMentions": [{"start": 0, "label": "/person", "text": "Mohammed Oudeh"}, {"start": 1, "label": "/art/film,/art", "text": "Munich"}, {"start": 2, "label": "/location/city,/location", "text": "Damascus"}]			error: Learn CoType embeddings... code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory Evaluate on Relation Extraction... Traceback (most recent call last): File "code/Evaluation/emb_test.py", line 32, in <module>	We ran Stanford NER on training set to detect entity mentions-> mapped entity names to Freebase entities using DBpediaSpotlight-> aligned Freebase facts to sentences-> assign entity types of Freebase entities to their mapped names in sentences
11	Event extraction	JointEE-NN	classification	Convolutional Neural Networks for Event Detection		事件抽取	entity type embedding size, word embedding size, RNN hidden layers units, local features, mini batch size, Frobenius norms	ACE2005, https://raw.githubusercontent.com/anoperson/jointEE-NN/master/data/fistDoc.nnData4.txt	sentence, model	entity, event	69.30%	+	暂时还没去试验	分词，词向量化
12	Keyword/Sentence extraction	SnowNLP	graph probability theory	TextRank		文档关键词提取，文档主题提取	filter, stopwords, POS, minimum occurence, keywords number		1、公司深度报告：寻价格涨跌之因、需求之形，论茅台的成长 2、核心观点：茅台价格作为白酒板块的风向标，受到消费需求、投资需求以及公司量价政策的共同影响。	茅台', '价格'		+	可以应用在SAR extraction文档关键词提取上	分词
13	entity linking	Deeptype	integrating symbolic information into the reasoning process of a neural network with a type system	DeepType: Multilingual Entity Linking by Neural Type System Evolution	https://github.com/openai/deeptype/blob/master/learning/SentencePredictions.ipynb	discovering types of entity disambiguation	name batch_size max_epochs hidden_sizes cudnn anneal_rate weight_noise	CoNLL (YAGO)	Napoleon was the emperor of the First French Empire. He was defeated at Waterloo by Wellington and [[Blücher]]. He was banned to Saint Helena, died of stomach cancer, and was buried at Invalides.	Napoleon [https://en.wikipedia.org/wiki/Napoleon] was the emperor of the First French Empire. He was defeated at Waterloo [http://en.wikipedia.org/wiki/Battle%20of%20Waterloo] by Wellington [http://en.wikipedia.org/wiki/Arthur%20Wellesley%2C%201st%20Duke%20of%20Wellington] and Blücher [Gebhard Leberecht von Blücher]. He was banned to Saint Helena [Saint Helena], died of stomach cancer, and was buried at Invalides [Les Invalides].	98.6-99% possible accuracies of on two benchmark tasks CoNLL (YAGO) and the TAC KBP 2010 challenge.	+	运行步骤太多，error发生在: To use the saved graph projection on wikipedia data to test out how discriminative this classification is (Oracle performance) (edit the config file to make changes to the classification used): export DATA_DIR=data/ python3 extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR} 已报issue	Get wikiarticle -> wikidata mapping (all languages) + Get anchor tags, redirections, category links, statistics (per language). To store all wikidata ids, their key properties (instance of, part of, etc..), and a mapping from all wikipedia article names to a wikidata id do as follows, along with wikipedia anchor tags and links, in three languages: English (en), French (fr), and Spanish (es)
14	entity linking	EARL	probability theory	Joint Entity and Relation Linking for Question Answering		https://figshare.com/articles/Full_Annotated_LC_QuAD_dataset/5782197		Lc-quad	"question": "Which comic characters are painted by Bill Finger?",	"predicate mapping": [ { "label": "painted by", "uri": "http://dbpedia.org/ontology/creator" }, { "label": "comic characters", "uri": "http://dbpedia.org/ontology/ComicsCharacter" } ]	0.737		不能跑
15	entity linking	deep-ed	probability theory	Deep Joint Entity Disambiguation with Local Neural Attention		http://people.inf.ethz.ch/ganeao/emnlp17_poster.pdf	entity embedding learning rate attention	AIDA-CoNLL	Napoleon was the emperor of the First French Empire. He was defeated at Waterloo by Wellington and [[Blücher]]. He was banned to Saint Helena, died of stomach cancer, and was buried at Invalides.	Napoleon [https://en.wikipedia.org/wiki/Napoleon] was the emperor of the First French Empire. He was defeated at Waterloo [http://en.wikipedia.org/wiki/Battle%20of%20Waterloo] by Wellington [http://en.wikipedia.org/wiki/Arthur%20Wellesley%2C%201st%20Duke%20of%20Wellington] and Blücher [Gebhard Leberecht von Blücher]. He was banned to Saint Helena [Saint Helena], died of stomach cancer, and was buried at Invalides [Les Invalides].	92.22%		在torch里面执行./install.sh错误
16	entity linking	neural-el	probability theory	Entity Linking via Joint Encoding of Types, Descriptions, and Context		http://cogcomp.org/files/presentations/Poster_GuptaSiRo17.pdf	entity, types embedding size, LSTM size, document-context encoder size document context vocabulary strings number dropout Adam optimization learning rate mini-batches size	AIDA-CoNLL	“12th Asian Nations Cup finals are hosted by Lebanon until this October 29th.”	Model CD : Lebanon_Football_Team Model CT, CDTE : Lebanon (correct)	81.80%	python3.4	不能跑
17	entity linking	mulrel-nel	probability theory	Improving Entity Linking by Modeling Latent Relations between Mentions			https://docs.google.com/document/d/1UlekWlzN54E6Mn6C1aNn1Zf4T9EFMm8hwJ0VgGHJgFM/edit	https://drive.google.com/file/d/1IDjXFnNnHf__MO5j_onw4YwR97oS8lAy/view	EU rejects German call to boycott British lamb.	EU B EU --NME-- rejects German B German Germany http://en.wikipedia.org/wiki/Germany 11867 /m/0345h call to boycott British B British United_Kingdom http://en.wikipedia.org/wiki/United_Kingdom 31717 /m/07ssc lamb .	aida-A micro F1: 0.9224506836447135 aida-B micro F1: 0.9311036789297659 msnbc micro F1: 0.9441469013006886 aquaint micro F1: 0.8867132867132868 ace2004 micro F1: 0.89738430583501 clueweb micro F1: 0.7784764642472153 wikipedia micro F1: 0.7799718955698544		唯一可本地使用,Unfortunately, the souce code doesn't include a part for handling raw data. You have to convert your raw text into the format of the train/test files. To do so, you have to do two steps: NER your text (using e.g. Standford CORENLP, SpiCy) then use the source code from https://github.com/dalab/deep-ed to convert to the correct format. But, you need to modify it a little bit. Our next version will include a piece of sorce code to do that. But you have to wait, maybe some months.	NER your text (using e.g. Standford CORENLP, SpiCy) -> then use the source code from https://github.com/dalab/deep-ed to convert to the correct format.
18	entity linking	TAGME	API	TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities)		https://tagme.d4science.org/tagme/		wiki-disamb30	On this day 24 years ago Maradona scored his infamous "Hand of God" goal against England in the quarter-final of the 1986	On this day 24 years ago Maradona[http://en.wikipedia.org/wiki/Diego%20Maradona] scored his infamous "Hand of God"[http://en.wikipedia.org/wiki/Argentina%20v%20England%20(1986%20FIFA%20World%20Cup)] goal against England[http://en.wikipedia.org/wiki/England%20national%20football%20team] in the quarter-final[http://en.wikipedia.org/wiki/2010%20FIFA%20World%20Cup%20knockout%20stage] of the 1986	0.775	比较流行的API	目前只能处理英文，德文，意大利文	text->annoated text object
19	entity linking	jist2016	Web table entity linking with KB	Entity Linking in Web Tables with Multiple Linked Knowledge Bases			SVM parameter, KBs		电影类型导演纳尼亚传奇科幻片迈克尔·艾普特	电影类型导演纳尼亚传奇\|纳尼亚传奇[小说改编系列电影] 科幻片\|科幻片迈克尔·艾普特\|			这个人写的代码还有很多bug，网上还没有更多的做table entity linking 的库，暂时给他提了issue，等待他的回复。
20	Document classification	gensim	probabilistic modeling	Document-term matrix LDA, TF-IDF		文档相似性比较			"我在九寨沟，很喜欢","我在九寨沟"	0.94496047		+
21	词性标注	StanfordNLP	classification	基于词的序列标注		数据提取			model, text	Position of Speech Tagger		+	Java语言，不优先考虑
22	PDF文档转换	pdfminer	data processing	数据清洗，正则转换		文档转换	max pages, output format		PDF文档	plain text		+	可以用在中文
23	Knowledge Graph	OpenCN	information extraction/storage	information extraction，信息存储Neo4j, cypher信息查询, 消歧		用户所选领域模型训练，信息查询，智能投顾，公司知识图谱			text	节点，关系		+
24	爬虫	requests	data extraction/storage	网络爬虫		信息搜索			keywords, websites	structured data		+	翻墙问题（已解决），验证码问题
25	extract pdf meta	pdfx	extract reference and title			http://pdfx.cs.man.ac.uk/	sentence-level tags DOIs for references		PDF文档	XML, HTML, ARCHIVE		-
26	extract pdf structure	pdf-extract	various region extraction	It performs structural analysis to determine column bounds, headers, footers, sections, titles and so on. It can analyse and categorise sections into reference and non-reference sections and can split reference sections into individual references.								-
27	text mining	GROBID	classification	Machine learning of sentence classification					scholar documents, PDF	text metadata		+
28	Mathematical equation recognization	MathOCR	image preprocessing, layout analysis and character recognition, especially the ability to recognize mathematical expression	http://mathocr.sourceforge.net/report.pdf					png	LaTex		-	识别非标准字体精确度不行
29	Knowledge Graph Embedding	OpenKE	probability theory	Simple interfaces to configure various training environments and classical models TensorFlow TransD: Knowledge Graph Embedding via Dynamic Mapping Matrix		Link Prediction, triple classification	TransE, TransH, TransR, TransD, RESCAL, DistMult, ComplEx and HolE.		(entity 1, entity 2)	link		+
30	Event Embedding	HEBE	distributed representation	Large-Scale Embedding Learning in Heterogeneous Event Data					phrase#embedding jiawei_han#jialu kdd p1 clustering#embedding jiawei_han#eric kdd p2 phrase#eric fangbo#eric kdd p3 1 phrase#eric kdd p3	4 200 phrase 0.002001 0.002210 -0.001915 embedding -0.002217 -0.001630 0.001743 clustering -0.002353 -0.000528 -0.000986 eric 0.002218 -0.001360 0.000523
31	argument mining			Multi-Task Learning for Argumentation Mining in Low-Resource Settings
32	question answer,machine translation,document summarization,semantic parsing,sentiment analysis,relation extraction,natural language inference,semantic role labeling	pytorch	QA,classification	data transfer learning		question answer	word embedding,biLSTM,self attention		context: 近年来，深度学习技术上有了许多的提高，学者们提出了ReLU 激活函数、Dropout、Batch Normalization、残差神经网络等新的模型和技术。answer:深度学习策略应用了哪些技术？	answer:ReLU 激活函数、Dropout、Batch Normalization、残差神经网络	can be found in paper		"他的部分数据集跑通注：这个代码我修改了他的IO那部分code,调试成功，只要给定语料，就可以运行；目前还没有训练看效果"	词向量化
33	Sentiment Classification	tsinghua website	classification	Sentiment Classification with User and Product Attention		sentiment classfication	word embedding,LSTM,user/product attention	can be found in github	userid:0ur2480402/； , productid: \tt0085809； context: this is a stunningly beautiful movie . <sssss> the music by phillip glass is just a work of pure genius . <sssss> i can watch this movie again and again . <sssss> the final sequence of the legend 's judgment where the container falls from the sky is just unbelievable . <sssss> how was it filmed ? <sssss> it 's so amazing . <sssss> if you have not seen this film watch it - again and again ! <sssss> this must be the only movie which in a powerful way	sentiment:10	can be found in github		跑通（他的数据集）	词向量化
34	Text Summarization	tsinghua website	Summarization	（相关paper ））NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE	http://localhost:8888/notebooks/TensorFlow-Summarization.ipynb	sumarization	word embedding,GRU	harvardnlp/sent-summary.	context:us business leaders lashed out wednesday at legislation that would penalize companies for employing illegal immigrants .	title:us business attacks tough immigration law	https://docs.google.com/document/d/1L-EJ_Byf4iyi8S6MfQaolIH-GHXJKk71CP33K_FR1fw/edit		跑通（他的数据）
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100