國網中心TWCC�自然語言處理方法介紹
李柏翰
leepohan@gmail.com
10 May, 2022
*
1
Lorenz Lab
AI & Python website
*
2
Lorenz Lab
*
3
和AI做朋友
https://www.openedu.tw/course.jsp?id=876
Lorenz Lab
*
4
基本功 2:K 書
Lorenz Lab
*
5
基本功 2:K 書
Lorenz Lab
*
6
基本功 2:NLP books
Lorenz Lab
*
7
大綱
Lorenz Lab
*
8
(一) NCHC註冊 (台灣杉1號)
Lorenz Lab
nchc.org.tw
*
9
NCHC註冊 (台灣杉1號)
Lorenz Lab
*
10
NCHC註冊 (台灣杉1號)
Lorenz Lab
*
11
NCHC個人資料註冊
Lorenz Lab
*
12
NCHC (主機帳號設定)
Lorenz Lab
*
13
TWCC搜尋 (台灣杉3號)
Lorenz Lab
*
14
TWCC (台灣杉3號)
Lorenz Lab
*
15
TWCC (國網學研)
Lorenz Lab
*
16
TWCC (容器選擇)
Lorenz Lab
*
17
TWCC (開發環境)
Lorenz Lab
*
18
TWCC (扣錢單位,1 GPU)
Lorenz Lab
*
19
TWCC (建立容器)
Lorenz Lab
*
20
TWCC (容器初始化)
Lorenz Lab
*
21
不刪除每小時至少86元
有刪除掉才算數,關網頁沒用
Lorenz Lab
*
22
TWCC : ln01.twcc.ai
Lorenz Lab
*
23
TWCC : Jupyter
Lorenz Lab
*
24
putty ; pscp; psftp下載
Lorenz Lab
*
25
Copy these files to C:\Windows\
Lorenz Lab
*
26
cmd 啟動 putty
Lorenz Lab
*
27
putty : ln01.twcc.ai
Lorenz Lab
*
28
連線 TWCC
主機帳號 🡺 密碼 🡺 MOTP 密碼
Lorenz Lab
*
29
調整視窗字型大小
Lorenz Lab
*
30
傳輸檔案
DOS folder 🡺 Linux
User> pscp –r folder phlee0514@140.110.148.11:/work1/phlee0514
Linux folder🡺 DOS
User> pscp –r phlee0514@140.110.148.11:/work1/phlee0514/folder .
Linux file🡺 DOS
User> pscp phlee0514@140.110.148.11:/work1/phlee0514/folder/file .
Lorenz Lab
*
31
(二) Tomas Mikolov
(NIPS Deep Learning Workshop 2013)
Lorenz Lab
*
32
NLP: Word2vec
Lorenz Lab
*
33
Feedforward Neural Net Language Model
T Mikolov (2013)
Lorenz Lab
*
34
Efficient Learning
T Mikolov (2013)
Lorenz Lab
*
35
Efficient Learning
T Mikolov (2013)
Lorenz Lab
*
36
Skip-gram Architecture
T Mikolov (2013)
Lorenz Lab
*
37
Continuous Bag-of-words Architecture: CBOW
T Mikolov (2013)
Lorenz Lab
*
38
Efficient Learning - Summary
T Mikolov (2013)
Lorenz Lab
*
39
Linguistic Regularities in Word Vector Space
T Mikolov (2013)
Lorenz Lab
*
40
Linguistic Regularities in Word Vector Space
T Mikolov (2013)
Lorenz Lab
*
41
Linguistic Regularities - Results
T Mikolov (2013)
Lorenz Lab
*
42
Linguistic Regularities - Results
Lorenz Lab
T Mikolov (2013)
*
43
Linguistic Regularities in Word Vector Space
Lorenz Lab
T Mikolov (2013)
*
44
Performance on Rare Words
Lorenz Lab
T Mikolov (2013)
*
45
Performance on Rare Words - Results
Lorenz Lab
T Mikolov (2013)
*
46
Rare Words - Examples of Nearest Neighbours
Lorenz Lab
T Mikolov (2013)
*
47
From Words to Phrases and Beyond
Lorenz Lab
T Mikolov (2013)
*
48
From Words to Phrases and Beyond
Lorenz Lab
T Mikolov (2013)
*
49
Compositionality by Vector Addition
Lorenz Lab
T Mikolov (2013)
*
50
Visualization of Regularities in Word Vector Space
Lorenz Lab
T Mikolov (2013)
*
51
Visualization of Regularities in Word Vector Space
Lorenz Lab
T Mikolov (2013)
*
52
Visualization of Regularities in Word Vector Space
Lorenz Lab
T Mikolov (2013)
*
53
Visualization of Regularities in Word Vector Space
Lorenz Lab
T Mikolov (2013)
*
54
Machine Translation
Lorenz Lab
*
55
Machine Translation - English to Spanish
Lorenz Lab
T Mikolov (2013)
*
56
Machine Translation
Lorenz Lab
*
57
MT - Accuracy of English to Spanish translation
Lorenz Lab
T Mikolov (2013)
*
58
Machine Translation
Lorenz Lab
*
59
(三)上機訓練: TWCC
自然語言處理最佳實務
1)1)https://github.com/practical-nlp/practical-nlp-code.git
Word2vec
2) https://github.com/Alex-CHUN-YU/Word2vec.git
Lorenz Lab
*
60
利用TWCC實作
第1個 Side Project:
NLP 中文書
自然語言處理最佳實務 practical-nlp-code-master
第2個 Side Project:
NLP 中文 Word2vec
Lorenz Lab
*
61
感知器是甚麼?
Collaboratively administrate empowered markets via plug-and-play networks.
W1:權重(weight) / W2:權重
b: 偏權值 / O: 稱作『神經元』或『節點』
感知器式收到多個輸入號之後,再將訊號整合,在輸出一個訊號
O: 神經元
Lorenz Lab
*
62
深度學習: 建立在多層感知器
猜數字AI訓練,學會參數是多少個
4. softmax
數值轉換成小數點,以機率表示
28x28=784
3. ReLU函數:
(activation function)
Lorenz Lab
*
63
Step 1: Change environment
Change environment 🡺 Py36 empty space
>>conda -V
>>conda create --name py36 python=3.6
>>conda activate py36
>>pip list
>>ls
>>pip list
>>conda install -c anaconda ipykernel
>>python -m ipykernel install --user --name=py36
# generate the self kernel #
Lorenz Lab
*
64
Step 2: Change environment
# 執行課本環境 🡺 save環境 == py36
>>pip install scikit-learn==0.21.3
>>pip install matplotlib==3.2.2
>>pip install numpy==1.19.5
>>pip install pandas==1.1.5
>>pip install requests==2.23.0
>>pip install gensim==3.6.0
>> exit #(離開)
Lorenz Lab
*
65
Step 3: run Jupyter
Change environment 🡺 Py36 empty space
>> cd NLP_test/practical-nlp-code-master
>> ls
>>conda activate py36
>> cd Ch3 (Ch4..)
# use Jupyter with Py36 to run all the cases
Lorenz Lab
*
66
Step 4: run Jupyter
Lorenz Lab
*
67
第2個 Side Project:
NLP 中文 Word2vec
Lorenz Lab
*
68
(ps: word2vec 如果把每種字當成一個維度,假設總共有 4000 個總字,那麼向量就會有 4000 維度。故可透過它來降低維度)
NLP 中文 Word2vec
Lorenz Lab
*
69
第2個 Side Project: NLP 中文 Word2vec
1.download wiki data(請參考資料集)
>>git clone https://github.com/openvinotoolkit/openvino_notebooks.git
2. >>cd Word2Vec 資料夾
3.>> python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2(wiki xml 轉換成 wiki text)
4.>> python segmentation.py(簡體轉繁體,在進行斷詞並同步過濾停用詞,由於檔案較大故斷詞較久,時間約 30 min)
5.>> python train.py(訓練並產生 model ,時間約 3 hours) # Line 17 Error: size ==> vector_size (old version)
5.>> python main.py(使用 Model,輸入詞彙)
註:如果在 Windows cmd 下執行 python 時有編碼問題請下以下指令:chcp 65001(使用utf-8)
Lorenz Lab
*
70
Change environment 🡺 Py36 empty space
>> pip install gensim
>>pip install jieba
>>pip install hanziconv
>>ls
>>cat README.md
>>wget --no-check-certificate https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest
>> python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2
>> python segmentation.py
>> python train.py #(訓練並產生 model , 約3 hours會比較久) # Line 17 Error: size ==> vector_size (old version)
>> python main.py # (使用 Model,輸入詞彙)
Lorenz Lab
第2個 Side Project: NLP 中文 Word2vec
*
71
Output:
1.輸入一個詞彙會找出前5名相似
2.輸入兩個詞彙會算出兩者之間相似度
3.輸入三個詞彙爸爸之於老公,如媽媽之於老婆
輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
>>老師
詞彙相似詞前 5 排序
班導,0.6360481977462769
班導師,0.6360464096069336
代課,0.6358826160430908
級任,0.6271134614944458
班主任,0.6270170211791992
Lorenz Lab
中文 Word2vec 分析字詞結果
*
72
NLP 中文 Word2vec 分析字詞
輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
>>爸爸,媽媽
計算兩個詞彙間 Cosine 相似度
輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
>>爸爸,老公,媽媽
爸爸之於老公,如媽媽之於
0.780765200371
老婆,0.5401346683502197
蠢萌,0.5245970487594604
夠秤,0.5059393048286438
駁命,0.4888317286968231
孔爵,0.4857243597507477
Lorenz Lab
*
73
今日您學會的課程
Lorenz Lab
*
74
Thanks for attention
Lorenz Lab