1 of 74

國網中心TWCC�自然語言處理方法介紹

李柏翰

leepohan@gmail.com

10 May, 2022

*

1

Lorenz Lab

2 of 74

AI & Python website

*

2

  1. 基本功:莫煩Python ( https://mofanpy.com/ )
  2. Machine Learning Crash Course (https://developers.google.com/machine-learning/crash-course)
  3. 深度學習 專業課程 (https://www.coursera.org/specializations/deep-learning)
  4. Comprehensive Python Cheatsheet,進去 Ctrl+F 搜尋一下可以找任何範例( https://gto76.github.io/python-cheatsheet/ )
  5. 自我學習資源 100-Days-Of-ML-Code ( https://github.com/Avik-Jain/100-Days-Of-ML-Code )
  6. 機器學習實戰 ( https://github.com/apachecn/AiLearning )
  7. hung-yi lee (李宏毅) (https://www.youtube.com/channel/UC2ggjtuuWvxrHHHiaDH1dlQ/featured)
  8. Fast.ai: PyTorch作為開發的工具,它先教你建立模型以後,再回頭告訴你為什麼這樣做。 (https://www.fast.ai/)
  9. BERT 科普文(https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html)
  10. 林軒田教授( https://www.coursera.org/instructor/htlin )
  11. 【Keras 深度學習】#3 建立模型 | HiSKIO 程式語言線上教學 ( https://www.youtube.com/watch?v=JR8bo75U3fc )
  12. TensorFlow Playground( http://playground.tensorflow.org/ )

Lorenz Lab

3 of 74

*

3

和AI做朋友

https://www.openedu.tw/course.jsp?id=876

Lorenz Lab

4 of 74

*

4

基本功 2:K 書

Lorenz Lab

5 of 74

*

5

基本功 2:K 書

Lorenz Lab

6 of 74

*

6

基本功 2:NLP books

Lorenz Lab

7 of 74

*

7

  1. NCHC 註冊帳號(台灣杉1號: 140.110.148.11)
  2. TWCC 計算簡介(台灣杉3號: lin01.twcc.ai)
  3. 利用 putty 連線,pscp 傳輸檔案
  4. 利用TWCC來學習 NLP
  5. Jupyter設定py36環境變數,change kernel
  6. 利用book資源,進入NLP(Natural Language processing)
  7. 學習Word2vec

大綱

Lorenz Lab

8 of 74

*

8

(一) NCHC註冊 (台灣杉1號)

Lorenz Lab

nchc.org.tw

9 of 74

*

9

NCHC註冊 (台灣杉1號)

Lorenz Lab

10 of 74

*

10

NCHC註冊 (台灣杉1號)

Lorenz Lab

11 of 74

*

11

NCHC個人資料註冊

Lorenz Lab

12 of 74

*

12

NCHC (主機帳號設定)

Lorenz Lab

13 of 74

*

13

TWCC搜尋 (台灣杉3號)

Lorenz Lab

14 of 74

*

14

TWCC (台灣杉3號)

Lorenz Lab

15 of 74

*

15

TWCC (國網學研)

Lorenz Lab

16 of 74

*

16

TWCC (容器選擇)

Lorenz Lab

17 of 74

*

17

TWCC (開發環境)

Lorenz Lab

18 of 74

*

18

TWCC (扣錢單位,1 GPU)

Lorenz Lab

19 of 74

*

19

TWCC (建立容器)

Lorenz Lab

20 of 74

*

20

TWCC (容器初始化)

Lorenz Lab

21 of 74

*

21

不刪除每小時至少86元

有刪除掉才算數,關網頁沒用

Lorenz Lab

22 of 74

*

22

TWCC : ln01.twcc.ai

Lorenz Lab

23 of 74

*

23

TWCC : Jupyter

Lorenz Lab

24 of 74

*

24

putty ; pscp; psftp下載

Lorenz Lab

25 of 74

*

25

Copy these files to C:\Windows\

Lorenz Lab

26 of 74

*

26

cmd 啟動 putty

Lorenz Lab

27 of 74

*

27

putty : ln01.twcc.ai

Lorenz Lab

28 of 74

*

28

連線 TWCC

主機帳號 🡺 密碼 🡺 MOTP 密碼

Lorenz Lab

29 of 74

*

29

調整視窗字型大小

Lorenz Lab

30 of 74

*

30

傳輸檔案

DOS folder 🡺 Linux

User> pscp –r folder phlee0514@140.110.148.11:/work1/phlee0514

Linux folder🡺 DOS

User> pscp –r phlee0514@140.110.148.11:/work1/phlee0514/folder .

Linux file🡺 DOS

User> pscp phlee0514@140.110.148.11:/work1/phlee0514/folder/file .

Lorenz Lab

31 of 74

*

31

(二) Tomas Mikolov

  • Distributed Representations of Words and Phrases and their Compositionality
  • Efficient Estimation of Word Representations in Vector Space
  • Learning Representations of Text using Neural Networks

(NIPS Deep Learning Workshop 2013)

Lorenz Lab

32 of 74

*

32

NLP: Word2vec

Lorenz Lab

33 of 74

*

33

Feedforward Neural Net Language Model

T Mikolov (2013)

Lorenz Lab

34 of 74

*

34

Efficient Learning

T Mikolov (2013)

Lorenz Lab

35 of 74

*

35

Efficient Learning

T Mikolov (2013)

Lorenz Lab

36 of 74

*

36

Skip-gram Architecture

T Mikolov (2013)

Lorenz Lab

37 of 74

*

37

Continuous Bag-of-words Architecture: CBOW

T Mikolov (2013)

Lorenz Lab

38 of 74

*

38

Efficient Learning - Summary

T Mikolov (2013)

Lorenz Lab

39 of 74

*

39

Linguistic Regularities in Word Vector Space

T Mikolov (2013)

Lorenz Lab

40 of 74

*

40

Linguistic Regularities in Word Vector Space

T Mikolov (2013)

Lorenz Lab

41 of 74

*

41

Linguistic Regularities - Results

T Mikolov (2013)

Lorenz Lab

42 of 74

*

42

Linguistic Regularities - Results

Lorenz Lab

T Mikolov (2013)

43 of 74

*

43

Linguistic Regularities in Word Vector Space

Lorenz Lab

T Mikolov (2013)

44 of 74

*

44

Performance on Rare Words

Lorenz Lab

T Mikolov (2013)

45 of 74

*

45

Performance on Rare Words - Results

Lorenz Lab

T Mikolov (2013)

46 of 74

*

46

Rare Words - Examples of Nearest Neighbours

Lorenz Lab

T Mikolov (2013)

47 of 74

*

47

From Words to Phrases and Beyond

Lorenz Lab

T Mikolov (2013)

48 of 74

*

48

From Words to Phrases and Beyond

Lorenz Lab

T Mikolov (2013)

49 of 74

*

49

Compositionality by Vector Addition

Lorenz Lab

T Mikolov (2013)

50 of 74

*

50

Visualization of Regularities in Word Vector Space

Lorenz Lab

T Mikolov (2013)

51 of 74

*

51

Visualization of Regularities in Word Vector Space

Lorenz Lab

T Mikolov (2013)

52 of 74

*

52

Visualization of Regularities in Word Vector Space

Lorenz Lab

T Mikolov (2013)

53 of 74

*

53

Visualization of Regularities in Word Vector Space

Lorenz Lab

T Mikolov (2013)

54 of 74

*

54

Machine Translation

Lorenz Lab

55 of 74

*

55

Machine Translation - English to Spanish

Lorenz Lab

T Mikolov (2013)

56 of 74

*

56

Machine Translation

Lorenz Lab

57 of 74

*

57

MT - Accuracy of English to Spanish translation

Lorenz Lab

T Mikolov (2013)

58 of 74

*

58

Machine Translation

Lorenz Lab

59 of 74

*

59

(三)上機訓練: TWCC

自然語言處理最佳實務

1)1)https://github.com/practical-nlp/practical-nlp-code.git

Word2vec

2) https://github.com/Alex-CHUN-YU/Word2vec.git

Lorenz Lab

60 of 74

*

60

利用TWCC實作

第1個 Side Project:

NLP 中文書

自然語言處理最佳實務 practical-nlp-code-master

第2個 Side Project:

NLP 中文 Word2vec

Lorenz Lab

61 of 74

*

61

感知器是甚麼?

Collaboratively administrate empowered markets via plug-and-play networks.

W1:權重(weight) / W2:權重

b: 偏權值 / O: 稱作『神經元』或『節點』

感知器式收到多個輸入號之後,再將訊號整合,在輸出一個訊號

O: 神經元

Lorenz Lab

62 of 74

*

62

深度學習: 建立在多層感知器

猜數字AI訓練,學會參數是多少個

4. softmax

數值轉換成小數點,以機率表示

28x28=784

3. ReLU函數:

(activation function)

Lorenz Lab

63 of 74

*

63

Step 1: Change environment

Change environment 🡺 Py36 empty space

>>conda -V

>>conda create --name py36 python=3.6

>>conda activate py36

>>pip list

>>ls

>>pip list

>>conda install -c anaconda ipykernel

>>python -m ipykernel install --user --name=py36

# generate the self kernel #

Lorenz Lab

64 of 74

*

64

Step 2: Change environment

# 執行課本環境 🡺 save環境 == py36

>>pip install scikit-learn==0.21.3

>>pip install matplotlib==3.2.2

>>pip install numpy==1.19.5

>>pip install pandas==1.1.5

>>pip install requests==2.23.0

>>pip install gensim==3.6.0

>> exit #(離開)

Lorenz Lab

65 of 74

*

65

Step 3: run Jupyter

Change environment 🡺 Py36 empty space

>> cd NLP_test/practical-nlp-code-master

>> ls

>>conda activate py36

>> cd Ch3 (Ch4..)

# use Jupyter with Py36 to run all the cases

Lorenz Lab

66 of 74

*

66

Step 4: run Jupyter

Lorenz Lab

67 of 74

*

67

第2個 Side Project:

NLP 中文 Word2vec

Lorenz Lab

68 of 74

*

68

  • Word2vec 是基於非監督式學習
  • 訓練集建議越大越好。
  • 語料涵蓋越全面,訓練出來結果相對比較好
  • 也有可能 garbage in 進而 garbage out
  • 檔案所使用的資料集較大,所以每個過程請耐心等候。

(ps: word2vec 如果把每種字當成一個維度,假設總共有 4000 個總字,那麼向量就會有 4000 維度。故可透過它來降低維度)

NLP 中文 Word2vec

Lorenz Lab

69 of 74

*

69

第2個 Side Project: NLP 中文 Word2vec

1.download wiki data(請參考資料集)

>>git clone https://github.com/openvinotoolkit/openvino_notebooks.git

2. >>cd Word2Vec 資料夾

3.>> python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2(wiki xml 轉換成 wiki text)

4.>> python segmentation.py(簡體轉繁體,在進行斷詞並同步過濾停用詞,由於檔案較大故斷詞較久,時間約 30 min)

5.>> python train.py(訓練並產生 model ,時間約 3 hours) # Line 17 Error: size ==> vector_size (old version)

5.>> python main.py(使用 Model,輸入詞彙)

註:如果在 Windows cmd 下執行 python 時有編碼問題請下以下指令:chcp 65001(使用utf-8)

Lorenz Lab

70 of 74

*

70

Change environment 🡺 Py36 empty space

>> pip install gensim

>>pip install jieba

>>pip install hanziconv

>>ls

>>cat README.md

>>wget --no-check-certificate https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest

>> python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2

>> python segmentation.py

>> python train.py #(訓練並產生 model , 約3 hours會比較久) # Line 17 Error: size ==> vector_size (old version)

>> python main.py # (使用 Model,輸入詞彙)

Lorenz Lab

第2個 Side Project: NLP 中文 Word2vec

71 of 74

*

71

Output:

1.輸入一個詞彙會找出前5名相似

2.輸入兩個詞彙會算出兩者之間相似度

3.輸入三個詞彙爸爸之於老公,如媽媽之於老婆

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)

>>老師

詞彙相似詞前 5 排序

班導,0.6360481977462769

班導師,0.6360464096069336

代課,0.6358826160430908

級任,0.6271134614944458

班主任,0.6270170211791992

Lorenz Lab

中文 Word2vec 分析字詞結果

72 of 74

*

72

NLP 中文 Word2vec 分析字詞

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)

>>爸爸,媽媽

計算兩個詞彙間 Cosine 相似度

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)

>>爸爸,老公,媽媽

爸爸之於老公,如媽媽之於

0.780765200371

老婆,0.5401346683502197

蠢萌,0.5245970487594604

夠秤,0.5059393048286438

駁命,0.4888317286968231

孔爵,0.4857243597507477

Lorenz Lab

73 of 74

*

73

今日您學會的課程

  • 連線TWCC,MCHC。
  • 知道如何設定py36環境變數。
  • 利用book codes,進入NLP的世界。
  • 認識大神 Tomas Mikolov
  • Jypyter and TWCC container 🡺好幫手。
  • Word2vec 提高字詞學習效率。

Lorenz Lab

74 of 74

*

74

Thanks for attention

Lorenz Lab