By Juniper Johnson
This document contains a guide with detailed checklists and questions for consideration, to guide the process of data preparation. Every corpus is different, yet there are some core features of textual data that can usefully be considered when preparing data for word embedding models.
Data preparation is something that is ongoing and just as much a part of experimentation with word embedding models as training the model is. How you prepare your data directly impacts what kind of results you get by training a model. I have split the process of data preparation into five stages: 1) performing initial data exploration, 2) analyzing the data and identifying “noise”, 3) creating a data preparation plan, 4) cleaning or modifying the corpus, 5) reflecting and repeating (as necessary). As with testing a model, preparing data is an ongoing, evolving process; the more you do, the more you recognize what you could do next. In outlining data preparation in this way, I hope to emphasize two key concepts. First, time and computing power are important concerns to be realistic about. It is tempting to make all of your changes at once, but this process is iterative for a reason. Second, the best thing that anyone can do in this process is simply take careful notes. Any step you take to prepare your data should be written down because you likely will not remember everything if you are making big changes. It is harder to retrace your steps than you think. While there can be a great deal of variability with word embedding models, having a data preparation plan with detailed notes will let you reduce some degree of uncertainty about the data you are using to train your models.
A sample lab notebook for tracking this information can be found here: https://docs.google.com/document/d/1POFR8vAmUwb9BE-Z6mppAWH8gOC6D81ERca2Ua-veqs/edit?usp=sharing
In order to prepare a corpus, you have to understand what it contains. In this context, exploring and understanding a corpus can be conceptualized as creating a “profile” with basic and advanced kinds of information. Consider the following questions:
If you are working with textual data, there are a number of ways to collect and compile information for your corpus profile. Some user-friendly (and free) tools are Voyant Tools (https://voyant-tools.org/) and AntConc (https://www.laurenceanthony.net/software/antconc/). Voyant Tools is a web-based tool and does not require any installation. If you are unfamiliar with it, I recommend this tutorial and documentation for understanding what kinds of textual analysis it is capable of: https://voyant-tools.org/docs/#!/guide/tutorial. AntConc, unlike Voyant, needs to be downloaded and installed but also has a lot of documentation: https://www.laurenceanthony.net/software/antconc/releases/AntConc358/help.pdf.
With either of these tools, it is easy to survey a corpus for important structural and thematic elements using different functions: word frequency, concordance, collocations, word clusters, n-grams, sentence length, and vocabulary density. To gain an “aerial” view of your data, I also recommend using topic modeling. One easy-to-use program is DARIAH (https://dariah-de.github.io/TopicsExplorer/). While there are a lot of different perspectives on topic modeling as a form of textual analysis, it can be very useful for reading across a larger corpus for key themes, and thus can help you to better understand what is in your corpus as you plan for data preparation.
On the topic of reading, another useful way to understand a corpus—especially if it was not hand curated or if you are unfamiliar with the file format, structure, and content—is simply to choose random files or sections and begin reading. Reading with data preparation in mind brings up different features than reading strictly for content. Take a sampling of the corpus by choosing random pages or files and reading those with the above questions in mind. If you are using a text editor like Oxygen, BBEdit, or Atom, you can additionally read across a corpus by using simple features like “find all” and, for XML documents, you can explore your corpus with XPath. Using tools to aid in exploring a corpus has two positive effects: increased knowledge and improved navigatibility. The more time you spend navigating the corpus without an initial agenda, the easier it is to understand what it contains. Once you have all of this information, next comes analysis and creating an action plan.
After your initial data exploration, the next step is to analyze your data to identify what features are “noise” and may impact your word embedding model. Depending on the type of data in your corpus, there are several different features to look for:
Examples of metadata include:
For corpora that contain data that is not already in plain text, an important consideration is how you will transform the text. If your data is TEI or any other form of XML, it is fairly straightforward to use XQuery to transform XML to plain text. Additionally, due to how features are tagged in XML documents, you can combine the cleaning and transformation into the same process, removing certain elements or setting the parameters to transform certain portions of the text. The WWP has several XQueries for this purpose on GitHub (https://github.com/NEU-DSG/wwp-public-code-share/tree/master/fulltext). These include an XQuery that can be used to transform non-TEI XML data to plain text (https://github.com/NEU-DSG/wwp-public-code-share/blob/master/fulltext/fulltext2table.non-tei.enmasse.xq).
We have created a tutorial for these XQuery documents, including instructions on how to configure transformation settings in Oxygen (https://docs.google.com/document/d/1hMtQH7bh90SqeyKyjJLU5tu0T0PNlTN3uiR5VEx7rP4/edit?usp=sharing). This document is still a work in progress, so if there are any problems, please let us know.
Another important aspect of data preparation is regularization. Popular forms of regularization in data are modernizing archaic spellings, expanding abbreviations, fixing OCR errors, and correcting misspelled words. If you are using a corpus that was already prepared, it is important to consider what steps have been taken (if any) to regularize the text. See if you can find other versions of your texts that are more or less regularized to use in comparison. Regularization can be quite time-intensive but if you choose not to do it, you may find inconsistencies in your final word embedding model (for example, if the same word might be spelled several different ways).
Below are slides from the Word Vector Institute on data analysis and identifying what features might cause “noise” for your word embedding model:
After completing data analysis for your corpus, the next stage is to create a plan for how you want to modify your original corpus before model training. This plan can take many forms, but the information that it should contain is: a) what textual features you will be removing or changing, b) what documents from within your corpus will be modified for each change, and c) how you will make these changes. This last piece of information, in particular, is important to document and, as you start implementing these changes, is likely to change.
As with the earlier recommendation for taking notes about your corpus, taking notes about the changes you make throughout data preparation is essential, especially if you make a change that you may want to revert. Before you make any changes to your data, save (and clearly label) a completely unmodified copy. Whether or not you are using a form of version control for your corpus—either with tools like GitHub or by saving different copies on your own computer—documenting changes between corpora is essential. In the event that you want to go back to an earlier form of your corpus, having this documentation (both as a plan and in tracking any changes) is very helpful. For example, here is a brief outline of a generic data preparation plan from the first Word Vectors institute in summer 2019:
After creating your data preparation plan, the next step is to put this plan into action and modify your corpus. This process can take shape in a variety of ways, depending on many factors: your experience with different tools, your research question and issues of interest, and the makeup of your corpus (content, format, and quantity). Indeed, there are many different tools and tutorials for data manipulation that are great for all different skill levels. Here are some useful resources for data cleaning and preparation:
A free tool for data manipulation supported by Google that allows for exploration, transformation, data matching, and manipulation. Here are tutorials and documentation on how to use it:
Regular expressions are helpful for finding, modifying, or removing repetition in data by describing a sequence of characters in a text or dataset. Many text editors will have an option to use regular expressions along with the “find all” or “find and replace” features. While there are a few different notations for regular expressions (it is helpful to check which is used in your preferred text editor), here are some generally useful introductory and intermediate sources:
Here is a sample process that can be used to remove common features such as chapter titles:
The following are some regular expressions that can be useful in data preparation for humanities texts (pay attention to case sensitivity and spacing):
During the model training process, punctuation is largely removed, but _underscores_ are an exception. For any corpus with texts from Project Gutenberg or several other text transcription projects, underscores are frequently used to mark italics. We advise people to remove underscores because the model treats these words as distinct from the non-underscored version of the word, even if they are the same from a human reading perspective. However, when used deliberately, this behavior can also be quite useful. If there are words or phrases that you would like to treat as a single token in the model, using an underscore to differentiate them (i.e. free_trade or queer_liberation) will let you explore phrases in a trained model.
Using this feature in your data manipulation stage is fairly easy. Using a “find and replace” feature, search for the words or phrases of interest and replace with the desired phrase with underscores in place of spaces. For more information, here is a helpful exploration by Kavita Ganesan: “How to incorporate phrases into Word2Vec-a text mining approach.”
The final stage of data preparation before training a word embedding model is simply to reflect on the process so far. Regardless of how much you document and plan, new issues with data preparation will arise as you are doing the work. Before you train your first (or twentieth) model, it is useful to think about how you have prepared the data for this step. Are there issues that you could not or did not address in this round of preparation? If so, what are these and why did you choose to leave them as is? Might you need to change them in the future?
Data preparation is iterative; it is tempting to try and make all the changes at once, but slowing the process down to observe, reflect, and explore the data and resulting word embedding model is an important step, especially at the beginning of a project. After training your first model, there will likely be new changes or ways to organize your data that you will be interested in exploring. Documenting your preparation process and reflecting on it will help as you move forward to testing your model and exploring the effect of training parameters.