LO 7.2.3.B

Learning Objective: Explain how to treat stopwords and unknown words during training.

Review:

Stop words, very frequent words like ‘the’ and ‘a’, are usually removed from the ‘bag of words’. This can be done by sorting the vocabulary by frequency in the training set, and defining the top 10–100 vocabulary entries as stop words, or alternatively by using one of the many predefined stop word lists available online. Then every instance of these stop words are simply removed from both training and test documents.

        Unknown words are removed from the test document.