Writing Code for NLP Research
EMNLP 2018
{joelg,mattg,markn}@allenai.org
Who we are
Matt Gardner (@nlpmattg)
Matt is a research scientist on AllenNLP. He was the original architect of AllenNLP, and he co-hosts the NLP Highlights podcast.
Mark Neumann (@markneumannnn)
Mark is a research engineer on AllenNLP. He helped build AllenNLP and its precursor DeepQA with Matt, and has implemented many of the models in the demos.
Joel Grus (@joelgrus)
Joel is a research engineer on AllenNLP, although you may know him better from "I Don't Like Notebooks" or from "Fizz Buzz in Tensorflow" or from his book Data Science from Scratch.
Outline
BREAK
What we expect you know already
What we expect you know already
modern (neural) NLP
What we expect you know already
Python
What we expect you know already
the difference between good science and bad science
What you'll learn today
What you'll learn today
how to write code in a way that facilitates good science and reproducible experiments
What you'll learn today
how to write code in a way that makes your life easier
The Elephant in the Room: AllenNLP
AllenNLP
Two modes of writing research code
1: prototyping
2: writing components
Prototyping New Models
Main goals during prototyping
Main goals during prototyping
Writing code quickly - Use a framework!
Writing code quickly - Use a framework!
Writing code quickly - Use a framework!
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM,
len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
validation_losses = []
patience = 10
for epoch in range(1000):
training_loss = 0.0
validation_loss = 0.0
for dataset, training in [(training_data, True),
(validation_data, False)]:
correct = total = 0
torch.set_grad_enabled(training)
t = tqdm.tqdm(dataset)
for i, (sentence, tags) in enumerate(t):
model.zero_grad()
model.hidden = model.init_hidden()
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = prepare_sequence(tags, tag_to_ix)
tag_scores = model(sentence_in)
loss = loss_function(tag_scores, targets)
predictions = tag_scores.max(-1)[1]
correct += (predictions == targets).sum().item()
total += len(targets)
accuracy = correct / total
if training:
loss.backward()
training_loss += loss.item()
t.set_postfix(training_loss=training_loss/(i + 1),
accuracy=accuracy)
optimizer.step()
else:
validation_loss += loss.item()
t.set_postfix(validation_loss=validation_loss/(i + 1),
accuracy=accuracy)
validation_losses.append(validation_loss)
if (patience and
len(validation_losses) >= patience and
validation_losses[-patience] ==
min(validation_losses[-patience:])):
print("patience reached, stopping early")
break
Writing code quickly - Use a framework!
Writing code quickly - Use a framework!
Writing code quickly - Use a framework!
Writing code quickly - Use a framework!
Writing code quickly - Get a good starting place
Writing code quickly - Get a good starting place
Writing code quickly - Get a good starting place
Writing code quickly - Get a good starting place
Writing code quickly - Get a good starting place
Add ELMo / BERT here
Writing code quickly - Get a good starting place
Writing code quickly - Copy first, refactor later
Writing code quickly - Copy first, refactor later
Writing code quickly - Copy first, refactor later
We’re prototyping! Just go fast and find something that works, then go back and refactor (if you made something useful)
Writing code quickly - Copy first, refactor later
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Writing code quickly - Do use good code style
Meaningful names
Writing code quickly - Do use good code style
Shape comments on tensors
Writing code quickly - Do use good code style
Comments describing non-obvious logic
Writing code quickly - Do use good code style
Write code for people, not machines
Writing code quickly - Minimal testing (but not no testing)
Writing code quickly - Minimal testing (but not no testing)
Writing code quickly - Minimal testing (but not no testing)
Writing code quickly - Minimal testing (but not no testing)
Writing code quickly - Minimal testing (but not no testing)
Writing code quickly - Minimal testing (but not no testing)
Makes sure data processing works consistently, that tensor operations run, gradients are non-zero
Writing code quickly - Minimal testing (but not no testing)
Run on small test fixtures, so debugging cycle is seconds, not minutes
Writing code quickly - How much to hard-code?
Writing code quickly - How much to hard-code?
I’m just prototyping! Why shouldn’t I just hard-code an embedding layer?
Writing code quickly - How much to hard-code?
Why so abstract?
Writing code quickly - How much to hard-code?
On the parts that aren’t what you’re focusing on, you start simple. Later add ELMo, etc., without rewriting your code.
Writing code quickly - How much to hard-code?
This also makes controlled experiments easier (both for you and for people who come after you).
Writing code quickly - How much to hard-code?
And it helps you think more clearly about the pieces of your model.
Main goals during prototyping
Running experiments - Keep track of what you ran
Running experiments - Keep track of what you ran
Running experiments - Keep track of what you ran
This is important!
Running experiments - Keep track of what you ran
Running experiments - Keep track of what you ran
Running experiments - Keep track of what you ran
Running experiments - Keep track of what you ran
Running experiments - Controlled experiments
Running experiments - Controlled experiments
Important for putting your work in context
Running experiments - Controlled experiments
But… too many moving parts, hard to know what caused the difference
Running experiments - Controlled experiments
Very controlled experiments, varying one thing: we can make causal claims
Running experiments - Controlled experiments
How do you set up your code for this?
Running experiments - Controlled experiments
Running experiments - Controlled experiments
Possible ablations
Running experiments - Controlled experiments
GloVe vs. character CNN vs. ELMo vs. BERT
Running experiments - Controlled experiments
LSTM vs. Transformer vs. GatedCNN vs. QRNN
Running experiments - Controlled experiments
Main goals during prototyping
Analyze results - Tensorboard
Analyze results - Tensorboard
A good training loop will give you this for free, for any model.
Analyze results - Tensorboard
Analyze results - Tensorboard
Tensorboard will find optimisation bugs for you for free.
Here, the gradient for the embedding is 2 orders of magnitude different from the rest of the gradients.
Analyze results - Tensorboard
Tensorboard will find optimisation bugs for you for free.
Here, the gradient for the embedding is 2 orders of magnitude different from the rest of the gradients.
Can anyone guess why?
Analyze results - Tensorboard
Tensorboard will find optimisation bugs for you for free.
Here, the gradient for the embedding is 2 orders of magnitude different from the rest of the gradients.
Embeddings have sparse gradients (only some embeddings are updated), but the momentum coefficients from ADAM are calculated for the whole embedding every time.
Solution:
from allennlp.training.optimizers import DenseSparseAdam
(uses sparse accumulators for gradient moments)
Analyze results - Look at your data!
Analyze results - Look at your data!
Analyze results - Look at your data!
Analyze results - Look at your data!
Analyze results - Look at your data!
How do you design your code for this?
Analyze results - Look at your data!
How do you design your code for this?
Well say more later, but the key points are:
Key point during prototyping:
The components that you use matter. A lot.
We’ll give specific thoughts on designing components after the break
Developing Good Processes
Source Control
We Hope You're Already Using Source Control!
makes it easy to safely experiment with code changes
We Hope You're Already Using Source Control!
We Hope You're Already Using Source Control!
We Hope You're Already Using Source Control!
That's right, code reviews!
About Code Reviews
About Code Reviews
About Code Reviews
About Code Reviews
and clear, readable code allows your code reviews to be discussions of your modeling decisions
About Code Reviews
Continuous Integration
(+ Build Automation)
Continuous Integration (+ Build Automation)
Continuous Integration
always be merging (into a branch)
Build Automation
always be running your tests (+ other checks)
(this means you have to write tests)
Example: Typical AllenNLP PR
if you're not building a library that lots of other people rely on, you probably don't need all these steps
but you do need some of them
Testing Your Code
What do we mean by "test your code"?
Write Unit Tests
a unit test is an automated check that a small part of your code works correctly
What should I test?
If You're Prototyping, Test the Basics
Prototyping? Test the Basics
def test_read_from_file(self):
conll_reader = Conll2003DatasetReader()
instances = conll_reader.read('data/conll2003.txt'))
instances = ensure_list(instances)
expected_labels = ['I-ORG', 'O', 'I-PER', 'O', 'O', 'I-LOC', 'O']
fields = instances[0].fields
tokens = [t.text for t in fields['tokens'].tokens]
assert tokens == ['U.N.', 'official', 'Ekeus', 'heads', 'for', 'Baghdad', '.']
assert fields["tags"].labels == expected_labels
fields = instances[1].fields
tokens = [t.text for t in fields['tokens'].tokens]
assert tokens == ['AI2', 'engineer', 'Joel', 'lives', 'in', 'Seattle', '.']
assert fields["tags"].labels == expected_labels
Prototyping? Test the Basics
def test_forward_pass_runs_correctly(self):
output_dict = self.model(**self.training_tensors)
tags = output_dict['tags']
assert len(tags) == 2
assert len(tags[0]) == 7
assert len(tags[1]) == 7
for example_tags in tags:
for tag_id in example_tags:
tag = idx_to_token[tag_id]
assert tag in {'O', 'I-ORG', 'I-PER', 'I-LOC'}
If You're Writing Reusable Components, Test Everything
Test Everything
test your model can train, save, and load
Test Everything
test that it's computing / backpropagating gradients
Test Everything
but how?
Use Test Fixtures
create tiny datasets that look like the real thing
The###DET dog###NN ate###V the###DET apple###NN
Everybody###NN read###V that###DET book###NN
Use Test Fixtures
use them to create tiny pretrained models
It’s ok if the weights are essentially random. We’re not testing that the model is any good.
Use Test Fixtures
Use your knowledge to write clever tests
def test_attention_is_normalised_correctly(self):
input_dim = 7
sequence_tensor = torch.randn([2, 5, input_dim])
extractor = SelfAttentiveSpanExtractor(input_dim=input_dim)
# In order to test the attention, we'll make the weight which
# computes the logits zero, so the attention distribution is
# uniform over the sentence. This lets us check that the
# computed spans are just the averages of their representations.
extractor._global_attention._module.weight.data.fill_(0.0)
extractor._global_attention._module.bias.data.fill_(0.0)
span_representations = extractor(sequence_tensor, indices)
spans = span_representations[0]
mean_embeddings = sequence_tensor[0, 1:4, :].mean(0)
numpy.testing.assert_array_almost_equal(spans[0].data.numpy(),
mean_embeddings.data.numpy())
Attention is hard to test because it relies on parameters
Use your knowledge to write clever tests
def test_attention_is_normalised_correctly(self):
input_dim = 7
sequence_tensor = torch.randn([2, 5, input_dim])
extractor = SelfAttentiveSpanExtractor(input_dim=input_dim)
# In order to test the attention, we'll make the weight which
# computes the logits zero, so the attention distribution is
# uniform over the sentence. This lets us check that the
# computed spans are just the averages of their representations.
extractor._global_attention._module.weight.data.fill_(0.0)
extractor._global_attention._module.bias.data.fill_(0.0)
span_representations = extractor(sequence_tensor, indices)
spans = span_representations[0]
mean_embeddings = sequence_tensor[0, 1:4, :].mean(0)
numpy.testing.assert_array_almost_equal(spans[0].data.numpy(),
mean_embeddings.data.numpy())
Idea: Make the parameters deterministic so you can test everything else
Pre-Break Summary
BREAK
please fill out our survey:
https://tinyurl.com/emnlp-tutorial-survey
will tweet out link to slides after talk
@ai2_allennlp
Reusable Components
What are the right abstractions for NLP?
The Right Abstractions
Things That We Use A Lot
Things That Require a Fair Amount of Code
Things That Have Many Variations
Things that reflect our higher-level thinking
Along the way, we need to worry about some things that make NLP tricky
Inputs are text, but neural models want tensors
Inputs are sequences of things
and order matters
Inputs can vary in length
Some sentences are short.
Whereas other sentences are so long that by the time you finish reading them you've already forgotten what they started off talking about and you have to go back and read them a second time in order to remember the parts at the beginning.
Reusable Components in AllenNLP
AllenNLP is built on PyTorch
AllenNLP is built on PyTorch
and is inspired by the question "what higher-level components would help NLP researchers do their research better + more easily?"
AllenNLP is built on PyTorch
under the covers, every piece of a model is a torch.nn.Module and every number is part of a torch.Tensor
AllenNLP is built on PyTorch
but we want you to be able to reason at a higher level most of the time
hence the higher level concepts
the Model
class Model(torch.nn.Module, Registrable):
def __init__(self,
vocab: Vocabulary,
regularizer: RegularizerApplicator = None) -> None: ...
def forward(self, *inputs) -> Dict[str, torch.Tensor]: ...
def get_metrics(self, reset: bool = False) -> Dict[str, float]: ...
@classmethod
def load(cls,
config: Params,
serialization_dir: str,
weights_file: str = None,
cuda_device: int = -1) -> 'Model': ...
Model.forward
def forward(self, *inputs) -> Dict[str, torch.Tensor]: ...
every NLP project needs a Vocabulary
class Vocabulary(Registrable):
def __init__(self,
counter: Dict[str, Dict[str, int]] = None,
min_count: Dict[str, int] = None,
max_vocab_size: Union[int, Dict[str, int]] = None,
non_padded_namespaces: Iterable[str] = DEFAULT_NON_PADDED_NAMESPACES,
pretrained_files: Optional[Dict[str, str]] = None,
only_include_pretrained_words: bool = False,
tokens_to_add: Dict[str, List[str]] = None,
min_pretrained_embeddings: Dict[str, int] = None) -> None: ...
@classmethod
def from_instances(cls, instances: Iterable['Instance'], ...) -> 'Vocabulary': ...
def add_token_to_namespace(self, token: str, namespace: str = 'tokens') -> int: ...
def get_token_index(self, token: str, namespace: str = 'tokens') -> int: ...
def get_token_from_index(self, index: int, namespace: str = 'tokens') -> str: ...
return self._index_to_token[namespace][index]
def get_vocab_size(self, namespace: str = 'tokens') -> int: ...
return len(self._token_to_index[namespace])
a Vocabulary is built from Instances
class Instance(Mapping[str, Field]):
def __init__(self, fields: MutableMapping[str, Field]) -> None: ...
def add_field(self, field_name: str, field: Field, vocab: Vocabulary = None) -> None: ...
def count_vocab_items(self, counter: Dict[str, Dict[str, int]]): ...
def index_fields(self, vocab: Vocabulary) -> None: ...
def get_padding_lengths(self) -> Dict[str, Dict[str, int]]: ...
def as_tensor_dict(self,
padding_lengths: Dict[str, Dict[str, int]] = None) -> Dict[str, DataArray]:
an Instance is a collection of Fields
a Field contains a data element and knows how to turn it into a tensor
class Field(Generic[DataArray]):
def count_vocab_items(self, counter: Dict[str, Dict[str, int]]): ...
def index(self, vocab: Vocabulary): ...
def get_padding_lengths(self) -> Dict[str, int]: ...
def as_tensor(self, padding_lengths: Dict[str, int]) -> DataArray: ...
def empty_field(self) -> 'Field': ...
def batch_tensors(self, tensor_list: List[DataArray]) -> DataArray: ...
Many kinds of Fields
Example: an Instance for SNLI
def text_to_instance(self,
premise: str,
hypothesis: str,
label: str = None) -> Instance:
fields: Dict[str, Field] = {}
premise_tokens = self._tokenizer.tokenize(premise)
hypothesis_tokens = self._tokenizer.tokenize(hypothesis)
fields['premise'] = TextField(premise_tokens, self._token_indexers)
fields['hypothesis'] = TextField(hypothesis_tokens, self._token_indexers)
if label:
fields['label'] = LabelField(label)
metadata = {"premise_tokens": [x.text for x in premise_tokens],
"hypothesis_tokens": [x.text for x in hypothesis_tokens]}
fields["metadata"] = MetadataField(metadata)
return Instance(fields)
Example: an Instance for SQuAD
def make_reading_comprehension_instance(question_tokens: List[Token],
passage_tokens: List[Token],
token_indexers: Dict[str, TokenIndexer],
token_spans: List[Tuple[int, int]] = None) -> Instance:
fields: Dict[str, Field] = {}
fields['passage'] = TextField(passage_tokens, token_indexers)
fields['question'] = TextField(question_tokens, token_indexers)
if token_spans:
# There may be multiple answer annotations, so we pick the one that occurs the most.
candidate_answers: Counter = Counter()
for span_start, span_end in token_spans:
candidate_answers[(span_start, span_end)] += 1
span_start, span_end = candidate_answers.most_common(1)[0][0]
fields['span_start'] = IndexField(span_start, passage_field)
fields['span_end'] = IndexField(span_end, passage_field)
return Instance(fields)
What's a TokenIndexer?
And don't forget DatasetReader
Library also handles batching, via DataIterator
Tokenizer
back to the Model
Model is a subclass of torch.nn.Module
*usually on the first try it won't "just work", but usually that's your fault not PyTorch's
TokenEmbedder
Seq2VecEncoder
(batch_size, sequence_length, embedding_dim)
(batch_size, embedding_dim)
Seq2SeqEncoder
(batch_size, sequence_length, embedding_dim)
(batch_size, sequence_length, embedding_dim)
Wait, Two Different Abstractions for RNNs?
Attention
(batch_size, sequence_length, embedding_dim),
(batch_size, embedding_dim)
(batch_size, sequence_length)
MatrixAttention
(batch_size, sequence_length1, embedding_dim),
(batch_size, sequence_length2, embedding_dim)
(batch_size, sequence_length1, sequence_length2)
Attention and MatrixAttention
SpanExtractor
(batch_size, num_spans, 2) (batch_size, sequence_length, embedding_dim)
(batch_size, num_spans, embedding_dim)
Sequence of Text
Span Indices
Embedded Spans
This seems like a lot of abstractions!
Abstractions just to make your life nicer
Declarative syntax
"model": {
"type": "crf_tagger",
"label_encoding": "BIOUL",
"constrain_crf_decoding": true,
"calculate_span_f1": true,
"dropout": 0.5,
"include_start_end_transitions": false,
"text_field_embedder": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": 50,
"pretrained_file": "glove.6B.50d.txt.gz",
"trainable": true
},
"token_characters": {
"type": "character_encoding",
"embedding": {
"embedding_dim": 16
},
"encoder": {
"type": "cnn",
"embedding_dim": 16,
"num_filters": 128,
"ngram_filter_sizes": [3],
"conv_layer_activation": "relu"
}
}
},
},
"encoder": {
"type": "lstm",
"input_size": 50 + 128,
"hidden_size": 200,
"num_layers": 2,
"dropout": 0.5,
"bidirectional": true
},
},
most AllenNLP objects can be instantiated from Jsonnet blobs
Declarative syntax
"encoder": {
"type": "lstm",
"input_size": 50 + 128,
"hidden_size": 200,
"num_layers": 2,
"dropout": 0.5,
"bidirectional": true
},
"encoder": {
"type": "gru",
"input_size": 50 + 128,
"hidden_size": 200,
"num_layers": 1,
"dropout": 0.5,
"bidirectional": true
},
"encoder": {
"type": "pass_through",
"input_dim": 50 + 128
},
Declarative syntax
How does it work?
Registrable
class Model(torch.nn.Module, Registrable): ...
@Model.register("bidaf")
class BidirectionalAttentionFlow(Model): ...
@Model.register("decomposable_attention")
class DecomposableAttention(Model): ...
@Model.register("simple_tagger")
class SimpleTagger(Model):
model = Model.by_name("bidaf")(param1,
param2,
...)
returns the class itself
Model config, again
"model": {
"type": "crf_tagger",
"label_encoding": "BIOUL",
"constrain_crf_decoding": true,
"calculate_span_f1": true,
"dropout": 0.5,
"include_start_end_transitions": false,
"text_field_embedder": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": 50,
"pretrained_file": "glove.6B.50d.txt.gz",
"trainable": true
},
"token_characters": {
"type": "character_encoding",
"embedding": {
"embedding_dim": 16
},
"encoder": {
"type": "cnn",
"embedding_dim": 16,
"num_filters": 128,
"ngram_filter_sizes": [3],
"conv_layer_activation": "relu"
}
}
},
},
"encoder": {
"type": "lstm",
"input_size": 50 + 128,
"hidden_size": 200,
"num_layers": 2,
"dropout": 0.5,
"bidirectional": true
},
},
from_params, originally
@Model.register("crf_tagger")
class CrfTagger(Model):
def __init__(
self,
vocab: Vocabulary,
text_field_embedder: TextFieldEmbedder,
encoder: Seq2SeqEncoder,
label_namespace: str = "labels",
constraint_type: str = None,
include_start_end_transitions: bool = True,
dropout: float = None,
initializer: InitializerApplicator = None,
regularizer: Optional[RegularizerApplicator] = None
) -> None:
...
@classmethod
def from_params(cls,
vocab: Vocabulary,
params: Params) -> 'CrfTagger':
embedder_params = params.pop("text_field_embedder")
text_field_embedder = TextFieldEmbedder.from_params(vocab,
embedder_params)
encoder = Seq2SeqEncoder.from_params(params.pop("encoder"))
label_namespace = params.pop("label_namespace", "labels")
constraint_type = params.pop("constraint_type", None)
dropout = params.pop("dropout", None)
include_start_end_transitions = \
params.pop("include_start_end_transitions", True)
initializer_params = params.pop('initializer', [])
initializer = InitializerApplicator.from_params(initializer_params)
regularizer_params = params.pop('regularizer', [])
regularizer = RegularizerApplicator.from_params(regularizer_params)
�
params.assert_empty(cls.__name__)
�
return cls(vocab=vocab,
text_field_embedder=text_field_embedder,
encoder=encoder,
label_namespace=label_namespace,
constraint_type=constraint_type,
dropout=dropout,
include_start_end_transitions=include_start_end_transitions,
initializer=initializer,
from_params, now
class FromParams:
@classmethod
def from_params(cls: Type[T], params: Params, **extras) -> T:
from allennlp.common.registrable import Registrable # import here to avoid circular imports
if params is None: return None
registered_subclasses = Registrable._registry.get(cls)
if registered_subclasses is not None:
as_registrable = cast(Type[Registrable], cls)
default_to_first_choice = as_registrable.default_implementation is not None
choice = params.pop_choice("type",
choices=as_registrable.list_available(),
default_to_first_choice=default_to_first_choice)
subclass = registered_subclasses[choice]
if not takes_arg(subclass.from_params, 'extras'):
extras = {k: v for k, v in extras.items() if takes_arg(subclass.from_params, k)}
return subclass.from_params(params=params, **extras)
else:
if cls.__init__ == object.__init__:
kwargs: Dict[str, Any] = {}
else:
kwargs = create_kwargs(cls, params, **extras)
return cls(**kwargs) # type: ignore
from_params, now
def create_kwargs(cls: Type[T], params: Params, **extras) -> Dict[str, Any]:
"""
Given some class, a `Params` object, and potentially other keyword arguments,
create a dict of keyword args suitable for passing to the class's constructor.
The function does this by finding the class's constructor, matching the constructor
arguments to entries in the `params` object, and instantiating values for the parameters
using the type annotation and possibly a from_params method.
Any values that are provided in the `extras` will just be used as is.
For instance, you might provide an existing `Vocabulary` this way.
"""
...
Trainer
class Trainer(Registrable):
def __init__(
self,
model: Model,
optimizer: torch.optim.Optimizer,
iterator: DataIterator,
train_dataset: Iterable[Instance],
validation_dataset: Optional[Iterable[Instance]] = None,
patience: Optional[int] = None,
validation_metric: str = "-loss",
validation_iterator: DataIterator = None,
shuffle: bool = True,
num_epochs: int = 20,
serialization_dir: Optional[str] = None,
num_serialized_models_to_keep: int = 20,
keep_serialized_model_every_num_seconds: int = None,
model_save_interval: float = None,
cuda_device: Union[int, List] = -1,
grad_norm: Optional[float] = None,
grad_clipping: Optional[float] = None,
learning_rate_scheduler: LearningRateScheduler = None,
summary_interval: int = 100,
histogram_interval: int = None,
should_log_parameter_statistics: bool = True,
should_log_learning_rate: bool = False) -> None:
Model archives
Making Predictions
Predictor
@Predictor.register('sentence-tagger')
class SentenceTaggerPredictor(Predictor):
def __init__(self,
model: Model,
dataset_reader: DatasetReader) -> None:
super().__init__(model, dataset_reader)
self._tokenizer = SpacyWordSplitter(language='en_core_web_sm',
pos_tags=True)
def predict(self, sentence: str) -> JsonDict:
return self.predict_json({"sentence" : sentence})
@overrides
def _json_to_instance(self, json_dict: JsonDict) -> Instance:
sentence = json_dict["sentence"]
tokens = self._tokenizer.split_words(sentence)
return self._dataset_reader.text_to_instance(tokens)
this is (partly) why we split out text_to_instance as its own function in the dataset reader
and this is enabled by all of our models taking optional labels and returning an optional loss and also various model internals and interesting results
Serving a demo
With this setup, serving a demo is easy.
We don't have it all figured out!
still figuring out some abstractions that we may not have correct
you can do all these things, but almost certainly not in the most optimal / generalizable way
Case study
"an LSTM for part-of-speech tagging"
(based on the official PyTorch tutorial)
The Problem
Given a training dataset that looks like
The###DET dog###NN ate###V the###DET apple###NN
Everybody###NN read###V that###DET book###NN
learn to predict part-of-speech tags
With a Few Enhancements to Make Things More Realistic
Start With a Simple Baseline Model
The
dog
ate
the
apple
vThe
vdog
vate
vthe
vapple
embedding
LSTM
wThe
wdog
wate
wthe
wapple
LThe
Ldog
Late
Lthe
Lapple
Linear
word vectors
encodings
tag logits
v0: numpy
aka "this is why we use libraries"
v0: numpy (aka "this is why we use libraries")
class LSTM:
def __init__(self, input_size: int, hidden_size: int) -> None:
self.params = {
# forget gate
"w_f": np.random.randn(input_size, hidden_size)
"b_f": np.random.randn(hidden_size)
"u_f": np.random.randn(hidden_size, hidden_size)
# external input gate
"w_g": np.random.randn(input_size, hidden_size)
"b_g": np.random.randn(hidden_size)
"u_g": np.random.randn(hidden_size, hidden_size)
# output gate
"w_q": np.random.randn(input_size, hidden_size)
"b_q": np.random.randn(hidden_size)
"u_q": np.random.randn(hidden_size, hidden_size)
# usual params
"w": np.random.randn(input_size, hidden_size)
"b": np.random.randn(hidden_size)
"u": np.random.randn(hidden_size, hidden_size)
}
self.grads = {name: None for name in self.params}
v1: PyTorch
v1: PyTorch - Load Data
def load_data(file_path: str) -> List[Tuple[str, str]]:
"""
One sentence per line, formatted like
The###DET dog###NN ate###V the###DET apple###NN
Returns a list of pairs (tokenized_sentence, tags)
"""
data = []
with open(file_path) as f:
for line in f:
pairs = line.strip().split()
sentence, tags = zip(*(pair.split("###") for pair in pairs))
data.append((sentence, tags))
return data
seems reasonable
v1: PyTorch - Define Model
class LSTMTagger(nn.Module):
def __init__(self, embedding_dim: int, hidden_dim: int, vocab_size: int, tagset_size: int) -> None:
super().__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs,
# and outputs hidden states with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
self.hidden = self.init_hidden()
def forward(self, sentence: torch.Tensor) -> torch.Tensor:
embeds = self.word_embeddings(sentence)
lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
tag_scores = F.log_softmax(tag_space, dim=1)
return tag_scores
much nicer than writing our own LSTM!
v1: PyTorch - Train Model
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM,
len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
validation_losses = []
patience = 10
for epoch in range(1000):
training_loss = 0.0
validation_loss = 0.0
for dataset, training in [(training_data, True),
(validation_data, False)]:
correct = total = 0
torch.set_grad_enabled(training)
t = tqdm.tqdm(dataset)
for i, (sentence, tags) in enumerate(t):
model.zero_grad()
model.hidden = model.init_hidden()
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = prepare_sequence(tags, tag_to_ix)
tag_scores = model(sentence_in)
loss = loss_function(tag_scores, targets)
predictions = tag_scores.max(-1)[1]
correct += (predictions == targets).sum().item()
total += len(targets)
accuracy = correct / total
if training:
loss.backward()
training_loss += loss.item()
t.set_postfix(training_loss=training_loss/(i + 1),
accuracy=accuracy)
optimizer.step()
else:
validation_loss += loss.item()
t.set_postfix(validation_loss=validation_loss/(i + 1),
accuracy=accuracy)
validation_losses.append(validation_loss)
if (patience and
len(validation_losses) >= patience and
validation_losses[-patience] ==
min(validation_losses[-patience:])):
print("patience reached, stopping early")
break
this part is maybe less than ideal
v2: AllenNLP
(but without config files)
v2: AllenNLP - Dataset Reader
class PosDatasetReader(DatasetReader):
def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
super().__init__(lazy=False)
self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
sentence_field = TextField(tokens, self.token_indexers)
fields = {"sentence": sentence_field}
if tags:
label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
fields["labels"] = label_field
return Instance(fields)
def _read(self, file_path: str) -> Iterator[Instance]:
with open(file_path) as f:
for line in f:
pairs = line.strip().split()
sentence, tags = zip(*(pair.split("###") for pair in pairs))
yield self.text_to_instance([Token(word) for word in sentence], tags)
v2: AllenNLP - Model
class LstmTagger(Model):
def __init__(self, word_embeddings: TextFieldEmbedder, encoder: Seq2SeqEncoder, vocab: Vocabulary) -> None:
super().__init__(vocab)
self.word_embeddings = word_embeddings
self.encoder = encoder
self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
out_features=vocab.get_vocab_size('labels'))
self.accuracy = CategoricalAccuracy()
def forward(self, sentence: Dict[str, torch.Tensor], labels: torch.Tensor = None) -> torch.Tensor:
mask = get_text_field_mask(sentence)
embeddings = self.word_embeddings(sentence)
encoder_out = self.encoder(embeddings, mask)
tag_logits = self.hidden2tag(encoder_out)
output = {"tag_logits": tag_logits}
if labels is not None:
self.accuracy(tag_logits, labels, mask)
output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)
return output
def get_metrics(self, reset: bool = False) -> Dict[str, float]:
return {"accuracy": self.accuracy.get_metric(reset)}
v2: AllenNLP - Training
reader = PosDatasetReader()
train_dataset = reader.read(cached_path('https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/training.txt'))
validation_dataset = reader.read(
cached_path(https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/validation.txt'))
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'), embedding_dim=EMBEDDING_DIM)
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
iterator.index_with(vocab)
trainer = Trainer(model=model, optimizer=optimizer, iterator=iterator,
train_dataset=train_dataset, validation_dataset=validation_dataset,
patience=10, num_epochs=1000)
trainer.train()
this is where the config-driven approach would make our lives a lot easier
v3: AllenNLP + config
v3: AllenNLP - config
local embedding_dim = 6;
local hidden_dim = 6;
local num_epochs = 1000;
local patience = 10;
local batch_size = 2;
local learning_rate = 0.1;
{
"train_data_path": "...",
"validation_data_path": "...",
"dataset_reader": { "type": "pos-tutorial" },
"model": {
"type": "lstm-tagger",
"word_embeddings": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": embedding_dim
}
}
},
"encoder": {
"type": "lstm",
"input_size": embedding_dim,
"hidden_size": hidden_dim
}
},
"iterator": {
"type": "bucket",
"batch_size": batch_size,
"sorting_keys": [["sentence", "num_tokens"]]
},
"trainer": {
"num_epochs": num_epochs,
"optimizer": {
"type": "sgd",
"lr": learning_rate
},
"patience": patience
}
}
params = Params.from_file('...')
serialization_dir = tempfile.mkdtemp()
model = train_model(params, serialization_dir)
Augmenting the Tagger with Character-Level Features
v1: PyTorch
class LSTMTagger(nn.Module):
def __init__(self, embedding_dim: int, hidden_dim: int,
vocab_size: int, tagset_size: int) -> None:
super().__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size,embedding_dim)
# The LSTM takes word embeddings as inputs,
# and outputs hidden states with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
# Linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
self.hidden = self.init_hidden()
def forward(self, sentence: torch.Tensor) -> torch.Tensor:
embeds = self.word_embeddings(sentence)
lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
tag_scores = F.log_softmax(tag_space, dim=1)
return tag_scores
add char_embedding_dim
add char_embedding layer = embedding + LSTM?
change LSTM input dim
compute char embeddings
concatenate inputs
we really have to change our model code and how it works
v1: PyTorch
class LSTMTagger(nn.Module):
def __init__(self, embedding_dim: int, hidden_dim: int,
vocab_size: int, tagset_size: int) -> None:
super().__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size,embedding_dim)
# The LSTM takes word embeddings as inputs,
# and outputs hidden states with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
# Linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
self.hidden = self.init_hidden()
def forward(self, sentence: torch.Tensor) -> torch.Tensor:
embeds = self.word_embeddings(sentence)
lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
tag_scores = F.log_softmax(tag_space, dim=1)
return tag_scores
I'm not really that thrilled to do this exercise
v2: AllenNLP
reader = PosDatasetReader()
# ...
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
# ...
token_embedding = Embedding(
num_embeddings=vocab.get_vocab_size('tokens'),
embedding_dim=EMBEDDING_DIM)
word_embeddings = BasicTextFieldEmbedder(
{"tokens": token_embedding}
)
# ...
reader = PosDatasetReader(token_indexers={
"tokens": SingleIdTokenIndexer(),
"token_characters": TokenCharactersIndexer()
})
# ...
WORD_EMBEDDING_DIM = 5
CHAR_EMBEDDING_DIM = 3
EMBEDDING_DIM = WORD_EMBEDDING_DIM + CHAR_EMBEDDING_DIM
HIDDEN_DIM = 6
# ...
token_embedding = Embedding(
num_embeddings=vocab.get_vocab_size('tokens'),
embedding_dim=WORD_EMBEDDING_DIM)
char_embedding = TokenCharactersEncoder(
embedding=Embedding(
num_embeddings=vocab.get_vocab_size('token_characters'),
embedding_dim=CHAR_EMBEDDING_DIM),
encoder=PytorchSeq2VecWrapper(
torch.nn.LSTM(CHAR_EMBEDDING_DIM, CHAR_EMBEDDING_DIM,
batch_first=True))
word_embeddings = BasicTextFieldEmbedder({
"tokens": token_embedding,
"token_characters": char_embedding})
# ...
add a second token indexer
add an extra parameter
add a character embedder
use the character embedder
no changes to the model itself!
v3: AllenNLP - config
local embedding_dim = 6;
local hidden_dim = 6;
local num_epochs = 1000;
local patience = 10;
local batch_size = 2;
local learning_rate = 0.1;
{
"train_data_path": "...",
"validation_data_path": "...",
"dataset_reader": { "type": "pos-tutorial" },
"model": {
"type": "lstm-tagger",
"word_embeddings": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": embedding_dim
}
}
},
"encoder": {
"type": "lstm",
"input_size": embedding_dim,
"hidden_size": hidden_dim
}
},
we can accomplish this with just a couple of minimal config changes
v3: AllenNLP - config
local embedding_dim = 6;
local hidden_dim = 6;
local num_epochs = 1000;
local patience = 10;
local batch_size = 2;
local learning_rate = 0.1;
local word_embedding_dim = 5;
local char_embedding_dim = 3;
local embedding_dim = word_embedding_dim + char_embedding_dim;
local hidden_dim = 6;
local num_epochs = 1000;
local patience = 10;
local batch_size = 2;
local learning_rate = 0.1;
add a couple of new Jsonnet variables
v3: AllenNLP - config
"dataset_reader": { "type": "pos-tutorial" }
"dataset_reader": {
"type": "pos-tutorial",
"token_indexers": {
"tokens": { "type": "single_id" },
"token_characters": { "type": "characters" }
}
}
add a second token indexer
v3: AllenNLP - config
"model": {
"type": "lstm-tagger",
"word_embeddings": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": embedding_dim
}
}
},
"encoder": {
"type": "lstm",
"input_size": embedding_dim,
"hidden_size": hidden_dim
}
}
"model": {
"type": "lstm-tagger",
"word_embeddings": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": word_embedding_dim
},
"token_characters": {
"type": "character_encoding",
"embedding": {
"embedding_dim": char_embedding_dim,
},
"encoder": {
"type": "lstm",
"input_size": char_embedding_dim,
"hidden_size": char_embedding_dim
}
}
},
},
"encoder": {
"type": "lstm",
"input_size": embedding_dim,
"hidden_size": hidden_dim
}
}
add a corresponding token embedder
For a one-time change this is maybe not such a big win.
But being able to experiment with lots of architectures without having to change any code (and with a reproducible JSON description of each experiment) is a huge boon to research! (we think)
Sharing Your Research
How to make it easy to release your code
In the least amount of time possible:
Simplify your workflow for installation and data
Make your code run anywhere*
Isolated environments for your project
Docker
Objective: You don’t feel like this about Docker
What does Docker Do?
3 Ideas: Dockerfiles, Images and Containers
Step 1: Write a Dockerfile
Here is a finished Dockerfile.
How does this work?
Step 1: Write a Dockerfile
COMMAND <command>
Step 1: Write a Dockerfile
COMMAND <command>
Dockerfile commands are capitalised. Some important ones are:
FROM, RUN, ENV, COPY and CMD
Step 1: Write a Dockerfile
FROM python:3.6.3-jessie
FROM includes another Dockerfile in your one. Here we start from a base Python Dockerfile.
Step 1: Write a Dockerfile
RUN pip install -r requirements.txt
RUN … runs a command. To use a command, it must be installed in a previous step!
Step 1: Write a Dockerfile
ENV LANG=C.UTF-8
ENV sets an environment variable which can be used inside the container.
Step 1: Write a Dockerfile
COPY my_research/ my_research/
COPY copies code from your current folder into the Docker image.
Step 1: Write a Dockerfile
COPY my_research/ my_research/
Do yourself a favour. Don’t change the names of things during this step.
Step 1: Write a Dockerfile
CMD [“/bin/bash”]
CMD [“python”, “my/script.py”]
CMD is what gets run when you run a built image.
Step 1: Write a Dockerfile
Here is a finished Dockerfile.
Step 2: Build your Dockerfile into an Image
docker build --tag <name> .
Step 2: Build your Dockerfile into an Image
docker build --tag <name> .
This is what you want the image to be called, e.g
markn/my-paper-code.
Step 2: Build your Dockerfile into an Image
docker build --tag <name> .
You can see what images you have built already by running
docker images
Step 2: Build your Dockerfile into an Image
docker build --tag <name> .
This describes where docker should look for a Dockerfile. It can also be a URL.
Step 2: Build your Dockerfile into an Image
docker build --tag <name> .
If you’ve already built a line of your dockerfile before, Docker will remember and not build it again (so long as things before it haven’t changed.)
Step 2: Build your Dockerfile into an Image
docker build --tag <name> .
TIP: Put things that change more frequently (like your code) lower down in your Dockerfile.
Step 3: Run your Image as a Container
docker run <image-name>
Step 3: Run your Image as a Container
docker run -it <image-name>
-i: interactive
-t: tty (with a terminal)
Step 3: Run your Image as a Container
docker run -it -e /bin/bash ...
These arguments will give you a command prompt inside any docker container, regardless of the CMD in the Dockerfile.
Optional Step 4: DockerHub
docker push <image-name>
DockerHub is to Docker as Github is to Git
Docker automatically looks at dockerhub to find Docker images to run
Pros of Docker
Cons of Docker
Releasing your data
Use a simple file cache
There are currently 27 CoreNLP Jar files you could download from the CoreNLP website
Use a simple file cache
embedding_file = cached_path(“embedding_url”)
datasets = cached_path(“dataset_url”)
Use a simple file cache
But now I have to write a file cache ....
Use a simple file cache
from file_cache import cached_path
embeddings = cached_path(url)
Isolated (Python) environments
Python environments
Stable environments for Python can be tricky
This makes releasing code very annoying
Python environments
Docker is ideal, but not great for developing locally. For this, you should either use virtualenvs or anaconda.
Here we will talk about anaconda, because it’s what we use.
Python environments
Anaconda is a very stable distribution of Python (amongst other things). Installing it is easy:
Python environments
One annoying install step - adding where you installed it to the front of your PATH environment variable.
export PATH=”/path/to/anaconda/bin:PATH”
Python environments
Now, your default python should be an anaconda one (you did install python > 3.6, didn’t you).
Virtual environments
Every time you start a new project, make a new virtual environment which has only its dependencies in.
conda create -n your-env-name python=3.6
Virtual environments
Before you work on your project, run this command. This prepends the location of this particular copy of Python to your PATH.
source activate your-project-name
pip install -r requirements.txt
etc
Virtual environments
When you’re done, or you want to work on a different project, run:
source deactivate your-project-name
In Conclusion
In Conclusion
Thanks for Coming!
Questions?
please fill out our survey:
https://tinyurl.com/emnlp-tutorial-survey
will tweet out link to slides after talk
@ai2_allennlp