1 of 12

Sentiment Analysis for Event-Driven Stock Prediction

Code review by Kim Hoyeob

2 of 12

Sentiment Analysis for event driven stock prediction

[Github] https://github.com/WayneDW/Sentiment-Analysis-in-Event-Driven-Stock-Price-Movement-Prediction

[Papers]

Convolutional Neural Networks for Sentence Classification

Global Vectors for Word Representation

On the Importance of Text Analysis for Stock Price Prediction

Deep Learning for Event-Driven Stock Prediction

3 of 12

Brief code review

Data Collection

Crawl news, stock price

Applied GloVe to train a dense word vector from Reuters corpus in NLTK

Build word-word co-occurrence matrix and factorize

Feature Engineering

Text Preprocessing and extract feature.

Traine a ConvNet to predict the stock price movement based on a reasonable parameter selection
The result shows a significant 1-2% improve on the test set

4 of 12

Get stock list : crawler_allTickers.py

NASDAQ, NYSE, AMEX 시장의 종목코드 정보 불러오기
urllib2 라이브러리 이용, www.nasdaq.com/screening/… 에서 코드정보 크롤링
tickerList.csv 파일에 저장

1. Data Collection

5 of 12

Get news data : crawler_reuters.py

BeatifulSoup, urllib2 이용한 crawler 생성
코드정보(ticker), 날짜를 이용해 url 규칙 생성해서 response로 뉴스 수신
http://www.reuters.com/finance/stocks/companyNews?symbol=AMZN&date=09202017
news_reuters.csv 파일에 종목, 날짜, 제목, 본문, 뉴스타입 저장

1. Data Collection

6 of 12

Get stock price : crawler_yahoo_finance.py

urllib2 이용, yahoo finance 홈페이지에서 종목별 시고저종, 거래량, 수정종가 수신
stockPrices_raw.json 포맷으로 저장
create_label.py 에서 수정종가 데이터로 1,7,28일 수익률 계산
stockReturns.json 포맷으로 저장

1. Data Collection

7 of 12

Make word matrix : embeddingWord.py

Word embedding algorithm : GloVe
Input word data : NLTK’s Reuter data
Output

sentences.json, word2idx.json
cc_matrix.npy, glove_model_50.npz

Understanding GloVe

2. Word Embedding

8 of 12

Make word matrix : embeddingWord.py

get_reuters_data(n_vocab)

NLTK corpus 중 로이터 말뭉치 로드
각 고유 단어에 번호 부여, 말뭉치에서 단어 빈도수 계산 (빈도 높은 순 정렬 후 n_vocab 밖의 단어 버림)
단어 고유번호를 이용, [‘ASIAN’, ‘EXPORT’, ‘...’, …] 말뭉치를 [4,667,4,...] 형식으로 변환
Output

sentences.json( 고유 번호로 된 문장 파일), word2idx.json( 단어 -> 고유번호 매칭 파일)

2. Word Embedding

9 of 12

Make word matrix : embeddingWord.py

fit(self, sentences, cc_matrix=None, learning_rate=10e-5, reg=0.1, xmax=100, alpha=0.75,

epochs=10, gd=False, use_theano=True)

Co occurrence matrix 존재하지 않을 경우 생성 후 디스크 및 메모리 저장 ( cc_matrix.npy )
Gradient Descent method (theano 이용) word embedding( 50 dim) 후 저장 ( glove_model_50.npz )

2. Word Embedding

10 of 12

Generate Feature : genFeatureMat.py

gen_FeatureMatrix(wordEmbedding, word2idx, priceDt, max_words=60, mtype="test"):

Word embedding, Price data, Reuter News 이용해 Feature matrix 생성
test/train set 나누어 featureMatrix_test.csv, featureMatrix_train.csv 로 데이터 저장

3. Feature Engineering

11 of 12

Train CNN model : model_cnn.py

CNN 으로 문장 분류하기 (참고자료)

4. Train ConvNet

12 of 12

Prediction : model_cnn.py

Significant Improve : 1~2%p more accuracy than random pick (?)
How about doing this project in micro market structure rather than daily event?

5. Stock Prediction