1 of 254

Writing Code for NLP Research

EMNLP 2018

{joelg,mattg,markn}@allenai.org

2 of 254

Who we are

Matt Gardner (@nlpmattg)

Matt is a research scientist on AllenNLP. He was the original architect of AllenNLP, and he co-hosts the NLP Highlights podcast.

Mark Neumann (@markneumannnn)

Mark is a research engineer on AllenNLP. He helped build AllenNLP and its precursor DeepQA with Matt, and has implemented many of the models in the demos.

Joel Grus (@joelgrus)

Joel is a research engineer on AllenNLP, although you may know him better from "I Don't Like Notebooks" or from "Fizz Buzz in Tensorflow" or from his book Data Science from Scratch.

3 of 254

Outline

  • How to write code when prototyping
  • Developing good processes

BREAK

  • How to write reusable code for NLP
  • Case Study: A Part-of-Speech Tagger
  • Sharing Your Research

4 of 254

What we expect you know already

5 of 254

What we expect you know already

modern (neural) NLP

6 of 254

What we expect you know already

Python

7 of 254

What we expect you know already

the difference between good science and bad science

8 of 254

What you'll learn today

9 of 254

What you'll learn today

how to write code in a way that facilitates good science and reproducible experiments

10 of 254

What you'll learn today

how to write code in a way that makes your life easier

11 of 254

The Elephant in the Room: AllenNLP

  • This is not a tutorial about AllenNLP
  • But (obviously, seeing as we wrote it) AllenNLP represents our experiences and opinions about how best to write research code
  • Accordingly, we'll use it in most of our examples
  • And we hope you'll come out of this tutorial wanting to give it a try
  • But our goal is that you find the tutorial useful even if you never use AllenNLP

AllenNLP

12 of 254

Two modes of writing research code

13 of 254

1: prototyping

2: writing components

14 of 254

Prototyping New Models

15 of 254

Main goals during prototyping

  • Write code quickly
  • Run experiments, keep track of what you tried
  • Analyze model behavior - did it do what you wanted?

16 of 254

Main goals during prototyping

  • Write code quickly
  • Run experiments, keep track of what you tried
  • Analyze model behavior - did it do what you wanted?

17 of 254

Writing code quickly - Use a framework!

18 of 254

Writing code quickly - Use a framework!

  • Training loop?

19 of 254

Writing code quickly - Use a framework!

  • Training loop?

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM,

len(word_to_ix), len(tag_to_ix))

loss_function = nn.NLLLoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

validation_losses = []

patience = 10

for epoch in range(1000):

training_loss = 0.0

validation_loss = 0.0

for dataset, training in [(training_data, True),

(validation_data, False)]:

correct = total = 0

torch.set_grad_enabled(training)

t = tqdm.tqdm(dataset)

for i, (sentence, tags) in enumerate(t):

model.zero_grad()

model.hidden = model.init_hidden()

sentence_in = prepare_sequence(sentence, word_to_ix)

targets = prepare_sequence(tags, tag_to_ix)

tag_scores = model(sentence_in)

loss = loss_function(tag_scores, targets)

predictions = tag_scores.max(-1)[1]

correct += (predictions == targets).sum().item()

total += len(targets)

accuracy = correct / total

if training:

loss.backward()

training_loss += loss.item()

t.set_postfix(training_loss=training_loss/(i + 1),

accuracy=accuracy)

optimizer.step()

else:

validation_loss += loss.item()

t.set_postfix(validation_loss=validation_loss/(i + 1),

accuracy=accuracy)

validation_losses.append(validation_loss)

if (patience and

len(validation_losses) >= patience and

validation_losses[-patience] ==

min(validation_losses[-patience:])):

print("patience reached, stopping early")

break

20 of 254

Writing code quickly - Use a framework!

  • Tensorboard logging?
  • Model checkpointing?
  • Complex data processing, with smart batching?
  • Computing span representations?
  • Bi-directional attention matrices?

  • Easily thousands of lines of code!

21 of 254

Writing code quickly - Use a framework!

  • Don’t start from scratch! Use someone else’s components.

22 of 254

Writing code quickly - Use a framework!

  • But...

23 of 254

Writing code quickly - Use a framework!

  • But...
  • Make sure you can bypass the abstractions when you need to

24 of 254

Writing code quickly - Get a good starting place

25 of 254

Writing code quickly - Get a good starting place

  • First step: get a baseline running

  • This is good research practice, too

26 of 254

Writing code quickly - Get a good starting place

  • Could be someone else’s code... as long as you can read it

27 of 254

Writing code quickly - Get a good starting place

  • Could be someone else’s code... as long as you can read it

28 of 254

Writing code quickly - Get a good starting place

  • Even better if this code already modularizes what you want to change

Add ELMo / BERT here

29 of 254

Writing code quickly - Get a good starting place

  • Re-implementing a SOTA baseline is incredibly helpful for understanding what’s going on, and where some decisions might have been made better

30 of 254

Writing code quickly - Copy first, refactor later

  • CS degree:

31 of 254

Writing code quickly - Copy first, refactor later

  • CS degree:

32 of 254

Writing code quickly - Copy first, refactor later

  • CS degree:

We’re prototyping! Just go fast and find something that works, then go back and refactor (if you made something useful)

33 of 254

Writing code quickly - Copy first, refactor later

  • Really bad idea: using inheritance to share code for related models
  • Instead: just copy the code, figure out how to share later, if it makes sense

34 of 254

Writing code quickly - Do use good code style

  • CS degree:

35 of 254

Writing code quickly - Do use good code style

  • CS degree:

36 of 254

Writing code quickly - Do use good code style

37 of 254

Writing code quickly - Do use good code style

38 of 254

Writing code quickly - Do use good code style

39 of 254

Writing code quickly - Do use good code style

Meaningful names

40 of 254

Writing code quickly - Do use good code style

Shape comments on tensors

41 of 254

Writing code quickly - Do use good code style

Comments describing non-obvious logic

42 of 254

Writing code quickly - Do use good code style

Write code for people, not machines

43 of 254

Writing code quickly - Minimal testing (but not no testing)

  • CS degree:

44 of 254

Writing code quickly - Minimal testing (but not no testing)

  • CS degree:

45 of 254

Writing code quickly - Minimal testing (but not no testing)

  • A test that checks experimental behavior is a waste of time

46 of 254

Writing code quickly - Minimal testing (but not no testing)

  • But, some parts of your code aren’t experimental

47 of 254

Writing code quickly - Minimal testing (but not no testing)

  • And even experimental parts can have useful tests

48 of 254

Writing code quickly - Minimal testing (but not no testing)

  • And even experimental parts can have useful tests

Makes sure data processing works consistently, that tensor operations run, gradients are non-zero

49 of 254

Writing code quickly - Minimal testing (but not no testing)

  • And even experimental parts can have useful tests

Run on small test fixtures, so debugging cycle is seconds, not minutes

50 of 254

Writing code quickly - How much to hard-code?

  • Which one should I do?

51 of 254

Writing code quickly - How much to hard-code?

  • Which one should I do?

I’m just prototyping! Why shouldn’t I just hard-code an embedding layer?

52 of 254

Writing code quickly - How much to hard-code?

  • Which one should I do?

Why so abstract?

53 of 254

Writing code quickly - How much to hard-code?

  • Which one should I do?

On the parts that aren’t what you’re focusing on, you start simple. Later add ELMo, etc., without rewriting your code.

54 of 254

Writing code quickly - How much to hard-code?

  • Which one should I do?

This also makes controlled experiments easier (both for you and for people who come after you).

55 of 254

Writing code quickly - How much to hard-code?

  • Which one should I do?

And it helps you think more clearly about the pieces of your model.

56 of 254

Main goals during prototyping

  • Write code quickly
  • Run experiments, keep track of what you tried
  • Analyze model behavior - did it do what you wanted?

57 of 254

Running experiments - Keep track of what you ran

  • You run a lot of stuff when you’re prototyping, it can be hard to keep track of what happened when, and with what code

58 of 254

Running experiments - Keep track of what you ran

59 of 254

Running experiments - Keep track of what you ran

This is important!

60 of 254

Running experiments - Keep track of what you ran

  • Currently in invite-only alpha; public beta coming soon
  • https://github.com/allenai/beaker
  • https://beaker-pub.allenai.org

61 of 254

Running experiments - Keep track of what you ran

62 of 254

Running experiments - Keep track of what you ran

63 of 254

Running experiments - Keep track of what you ran

64 of 254

Running experiments - Controlled experiments

  • Which one gives more understanding?

65 of 254

Running experiments - Controlled experiments

  • Which one gives more understanding?

Important for putting your work in context

66 of 254

Running experiments - Controlled experiments

  • Which one gives more understanding?

But… too many moving parts, hard to know what caused the difference

67 of 254

Running experiments - Controlled experiments

  • Which one gives more understanding?

Very controlled experiments, varying one thing: we can make causal claims

68 of 254

Running experiments - Controlled experiments

  • Which one gives more understanding?

How do you set up your code for this?

69 of 254

Running experiments - Controlled experiments

70 of 254

Running experiments - Controlled experiments

Possible ablations

71 of 254

Running experiments - Controlled experiments

GloVe vs. character CNN vs. ELMo vs. BERT

72 of 254

Running experiments - Controlled experiments

LSTM vs. Transformer vs. GatedCNN vs. QRNN

73 of 254

Running experiments - Controlled experiments

  • Not good: modifying code to run different variants; hard to keep track of what you ran
  • Better: configuration files, or separate scripts, or something

74 of 254

Main goals during prototyping

  • Write code quickly
  • Run experiments, keep track of what you tried
  • Analyze model behavior - did it do what you wanted?

75 of 254

Analyze results - Tensorboard

  • Crucial tool for understanding model behavior during training
  • There is no better visualizer. If you don’t use this, start now.

76 of 254

Analyze results - Tensorboard

  • Crucial tool for understanding model behavior during training
  • There is no better visualizer. If you don’t use this, start now.

A good training loop will give you this for free, for any model.

77 of 254

Analyze results - Tensorboard

  • Metrics
    • Loss
    • Accuracy etc.
  • Gradients
    • Mean values
    • Std values
    • Actual update values
  • Parameters
    • Mean values
    • Std values
  • Activations
    • Log problematic activations

78 of 254

Analyze results - Tensorboard

Tensorboard will find optimisation bugs for you for free.

Here, the gradient for the embedding is 2 orders of magnitude different from the rest of the gradients.

79 of 254

Analyze results - Tensorboard

Tensorboard will find optimisation bugs for you for free.

Here, the gradient for the embedding is 2 orders of magnitude different from the rest of the gradients.

Can anyone guess why?

80 of 254

Analyze results - Tensorboard

Tensorboard will find optimisation bugs for you for free.

Here, the gradient for the embedding is 2 orders of magnitude different from the rest of the gradients.

Embeddings have sparse gradients (only some embeddings are updated), but the momentum coefficients from ADAM are calculated for the whole embedding every time.

Solution:

from allennlp.training.optimizers import DenseSparseAdam

(uses sparse accumulators for gradient moments)

81 of 254

Analyze results - Look at your data!

  • Good:

82 of 254

Analyze results - Look at your data!

  • Better:

83 of 254

Analyze results - Look at your data!

  • Better:

84 of 254

Analyze results - Look at your data!

  • Best:

85 of 254

Analyze results - Look at your data!

  • Best:

How do you design your code for this?

86 of 254

Analyze results - Look at your data!

  • Best:

How do you design your code for this?

Well say more later, but the key points are:

  • Separate data processing that also works on JSON
  • Model needs to run without labels / computing loss

87 of 254

Key point during prototyping:

The components that you use matter. A lot.

88 of 254

We’ll give specific thoughts on designing components after the break

89 of 254

Developing Good Processes

90 of 254

Source Control

91 of 254

We Hope You're Already Using Source Control!

makes it easy to safely experiment with code changes

    • if things go wrong, just revert!

92 of 254

We Hope You're Already Using Source Control!

  • makes it easy to collaborate

93 of 254

We Hope You're Already Using Source Control!

  • makes it easy to revisit older versions of your code

94 of 254

We Hope You're Already Using Source Control!

  • makes it easy to implement code reviews

95 of 254

That's right, code reviews!

96 of 254

About Code Reviews

  • code reviewers find mistakes

97 of 254

About Code Reviews

  • code reviewers point out improvements

98 of 254

About Code Reviews

  • code reviewers force you to make your code readable

99 of 254

About Code Reviews

and clear, readable code allows your code reviews to be discussions of your modeling decisions

100 of 254

About Code Reviews

  • code reviewers can be your scapegoat when it turns out your results are wrong because of a bug

101 of 254

Continuous Integration

(+ Build Automation)

102 of 254

Continuous Integration (+ Build Automation)

Continuous Integration

always be merging (into a branch)

Build Automation

always be running your tests (+ other checks)

(this means you have to write tests)

103 of 254

Example: Typical AllenNLP PR

104 of 254

105 of 254

if you're not building a library that lots of other people rely on, you probably don't need all these steps

106 of 254

but you do need some of them

107 of 254

Testing Your Code

108 of 254

What do we mean by "test your code"?

109 of 254

Write Unit Tests

a unit test is an automated check that a small part of your code works correctly

110 of 254

What should I test?

111 of 254

If You're Prototyping, Test the Basics

112 of 254

Prototyping? Test the Basics

def test_read_from_file(self):

conll_reader = Conll2003DatasetReader()

instances = conll_reader.read('data/conll2003.txt'))

instances = ensure_list(instances)

expected_labels = ['I-ORG', 'O', 'I-PER', 'O', 'O', 'I-LOC', 'O']

fields = instances[0].fields

tokens = [t.text for t in fields['tokens'].tokens]

assert tokens == ['U.N.', 'official', 'Ekeus', 'heads', 'for', 'Baghdad', '.']

assert fields["tags"].labels == expected_labels

fields = instances[1].fields

tokens = [t.text for t in fields['tokens'].tokens]

assert tokens == ['AI2', 'engineer', 'Joel', 'lives', 'in', 'Seattle', '.']

assert fields["tags"].labels == expected_labels

113 of 254

Prototyping? Test the Basics

def test_forward_pass_runs_correctly(self):

output_dict = self.model(**self.training_tensors)

tags = output_dict['tags']

assert len(tags) == 2

assert len(tags[0]) == 7

assert len(tags[1]) == 7

for example_tags in tags:

for tag_id in example_tags:

tag = idx_to_token[tag_id]

assert tag in {'O', 'I-ORG', 'I-PER', 'I-LOC'}

114 of 254

If You're Writing Reusable Components, Test Everything

115 of 254

Test Everything

test your model can train, save, and load

116 of 254

Test Everything

test that it's computing / backpropagating gradients

117 of 254

Test Everything

but how?

118 of 254

Use Test Fixtures

create tiny datasets that look like the real thing

The###DET dog###NN ate###V the###DET apple###NN

Everybody###NN read###V that###DET book###NN

119 of 254

Use Test Fixtures

use them to create tiny pretrained models

It’s ok if the weights are essentially random. We’re not testing that the model is any good.

120 of 254

Use Test Fixtures

  • write unit tests that use them to run your data pipelines and models
    • detect logic errors
    • detect malformed outputs
    • detect incorrect outputs

121 of 254

Use your knowledge to write clever tests

def test_attention_is_normalised_correctly(self):

input_dim = 7

sequence_tensor = torch.randn([2, 5, input_dim])

extractor = SelfAttentiveSpanExtractor(input_dim=input_dim)

# In order to test the attention, we'll make the weight which

# computes the logits zero, so the attention distribution is

# uniform over the sentence. This lets us check that the

# computed spans are just the averages of their representations.

extractor._global_attention._module.weight.data.fill_(0.0)

extractor._global_attention._module.bias.data.fill_(0.0)

span_representations = extractor(sequence_tensor, indices)

spans = span_representations[0]

mean_embeddings = sequence_tensor[0, 1:4, :].mean(0)

numpy.testing.assert_array_almost_equal(spans[0].data.numpy(),

mean_embeddings.data.numpy())

Attention is hard to test because it relies on parameters

122 of 254

Use your knowledge to write clever tests

def test_attention_is_normalised_correctly(self):

input_dim = 7

sequence_tensor = torch.randn([2, 5, input_dim])

extractor = SelfAttentiveSpanExtractor(input_dim=input_dim)

# In order to test the attention, we'll make the weight which

# computes the logits zero, so the attention distribution is

# uniform over the sentence. This lets us check that the

# computed spans are just the averages of their representations.

extractor._global_attention._module.weight.data.fill_(0.0)

extractor._global_attention._module.bias.data.fill_(0.0)

span_representations = extractor(sequence_tensor, indices)

spans = span_representations[0]

mean_embeddings = sequence_tensor[0, 1:4, :].mean(0)

numpy.testing.assert_array_almost_equal(spans[0].data.numpy(),

mean_embeddings.data.numpy())

Idea: Make the parameters deterministic so you can test everything else

123 of 254

Pre-Break Summary

  • Two Modes of Writing Research Code
    • Difference between prototyping and building components
    • When should you transition?
    • Good ways to analyse results
  • Developing Good Processes
    • How to write good tests
    • How to know what to test
    • Why you should do code reviews

124 of 254

BREAK

please fill out our survey:

https://tinyurl.com/emnlp-tutorial-survey

will tweet out link to slides after talk

@ai2_allennlp

125 of 254

Reusable Components

126 of 254

What are the right abstractions for NLP?

127 of 254

The Right Abstractions

  • AllenNLP now has more than 20 models in it
    • some simple
    • some complex
  • Some abstractions have consistently proven useful
  • (Some haven't)

128 of 254

Things That We Use A Lot

  • training a model
  • mapping words (or characters, or labels) to indexes
  • summarizing a sequence of tensors with a single tensor

129 of 254

Things That Require a Fair Amount of Code

  • training a model
  • (some ways of) summarizing a sequence of tensors with a single tensor
  • some neural network modules

130 of 254

Things That Have Many Variations

  • turning a word (or a character, or a label) into a tensor
  • summarizing a sequence of tensors with a single tensor
  • transforming a sequence of tensors into a sequence of tensors

131 of 254

Things that reflect our higher-level thinking

  • we'll have some inputs:
    • text, almost certainly
    • tags/labels, often
    • spans, sometimes
  • we need some ways of embedding them as tensors
    • one hot encoding
    • low-dimensional embeddings
  • we need some ways of dealing with sequences of tensors
    • sequence in -> sequence out (e.g. all outputs of an LSTM)
    • sequence in -> tensor out (e.g. last output of an LSTM)

132 of 254

Along the way, we need to worry about some things that make NLP tricky

133 of 254

Inputs are text, but neural models want tensors

134 of 254

Inputs are sequences of things

and order matters

135 of 254

Inputs can vary in length

Some sentences are short.

Whereas other sentences are so long that by the time you finish reading them you've already forgotten what they started off talking about and you have to go back and read them a second time in order to remember the parts at the beginning.

136 of 254

Reusable Components in AllenNLP

137 of 254

AllenNLP is built on PyTorch

138 of 254

AllenNLP is built on PyTorch

and is inspired by the question "what higher-level components would help NLP researchers do their research better + more easily?"

139 of 254

AllenNLP is built on PyTorch

under the covers, every piece of a model is a torch.nn.Module and every number is part of a torch.Tensor

140 of 254

AllenNLP is built on PyTorch

but we want you to be able to reason at a higher level most of the time

141 of 254

hence the higher level concepts

142 of 254

the Model

class Model(torch.nn.Module, Registrable):

def __init__(self,

vocab: Vocabulary,

regularizer: RegularizerApplicator = None) -> None: ...

def forward(self, *inputs) -> Dict[str, torch.Tensor]: ...

def get_metrics(self, reset: bool = False) -> Dict[str, float]: ...

@classmethod

def load(cls,

config: Params,

serialization_dir: str,

weights_file: str = None,

cuda_device: int = -1) -> 'Model': ...

143 of 254

Model.forward

def forward(self, *inputs) -> Dict[str, torch.Tensor]: ...

  • returns a dict [!]
  • by convention, "loss" tensor is what the training loop will optimize
  • but as a dict entry, "loss" is completely optional
    • which is good, since at inference / prediction time you don't have one
  • can also return predictions, model internals, or any other outputs you'd want in an output dataset or a demo

144 of 254

every NLP project needs a Vocabulary

class Vocabulary(Registrable):

def __init__(self,

counter: Dict[str, Dict[str, int]] = None,

min_count: Dict[str, int] = None,

max_vocab_size: Union[int, Dict[str, int]] = None,

non_padded_namespaces: Iterable[str] = DEFAULT_NON_PADDED_NAMESPACES,

pretrained_files: Optional[Dict[str, str]] = None,

only_include_pretrained_words: bool = False,

tokens_to_add: Dict[str, List[str]] = None,

min_pretrained_embeddings: Dict[str, int] = None) -> None: ...

@classmethod

def from_instances(cls, instances: Iterable['Instance'], ...) -> 'Vocabulary': ...

def add_token_to_namespace(self, token: str, namespace: str = 'tokens') -> int: ...

def get_token_index(self, token: str, namespace: str = 'tokens') -> int: ...

def get_token_from_index(self, index: int, namespace: str = 'tokens') -> str: ...

return self._index_to_token[namespace][index]

def get_vocab_size(self, namespace: str = 'tokens') -> int: ...

return len(self._token_to_index[namespace])

145 of 254

a Vocabulary is built from Instances

class Instance(Mapping[str, Field]):

def __init__(self, fields: MutableMapping[str, Field]) -> None: ...

def add_field(self, field_name: str, field: Field, vocab: Vocabulary = None) -> None: ...

def count_vocab_items(self, counter: Dict[str, Dict[str, int]]): ...

def index_fields(self, vocab: Vocabulary) -> None: ...

def get_padding_lengths(self) -> Dict[str, Dict[str, int]]: ...

def as_tensor_dict(self,

padding_lengths: Dict[str, Dict[str, int]] = None) -> Dict[str, DataArray]:

146 of 254

an Instance is a collection of Fields

a Field contains a data element and knows how to turn it into a tensor

class Field(Generic[DataArray]):

def count_vocab_items(self, counter: Dict[str, Dict[str, int]]): ...

def index(self, vocab: Vocabulary): ...

def get_padding_lengths(self) -> Dict[str, int]: ...

def as_tensor(self, padding_lengths: Dict[str, int]) -> DataArray: ...

def empty_field(self) -> 'Field': ...

def batch_tensors(self, tensor_list: List[DataArray]) -> DataArray: ...

147 of 254

Many kinds of Fields

  • TextField: represents a sentence, or a paragraph, or a question, or ...
  • LabelField: represents a single label (e.g. "entailment" or "sentiment")
  • SequenceLabelField: represents the labels for a sequence (e.g. part-of-speech tags)
  • SpanField: represents a span (start, end)
  • IndexField: represents a single integer index
  • ListField[T]: for repeated fields
  • MetadataField: represents anything (but not tensorizable)

148 of 254

Example: an Instance for SNLI

def text_to_instance(self,

premise: str,

hypothesis: str,

label: str = None) -> Instance:

fields: Dict[str, Field] = {}

premise_tokens = self._tokenizer.tokenize(premise)

hypothesis_tokens = self._tokenizer.tokenize(hypothesis)

fields['premise'] = TextField(premise_tokens, self._token_indexers)

fields['hypothesis'] = TextField(hypothesis_tokens, self._token_indexers)

if label:

fields['label'] = LabelField(label)

metadata = {"premise_tokens": [x.text for x in premise_tokens],

"hypothesis_tokens": [x.text for x in hypothesis_tokens]}

fields["metadata"] = MetadataField(metadata)

return Instance(fields)

149 of 254

Example: an Instance for SQuAD

def make_reading_comprehension_instance(question_tokens: List[Token],

passage_tokens: List[Token],

token_indexers: Dict[str, TokenIndexer],

token_spans: List[Tuple[int, int]] = None) -> Instance:

fields: Dict[str, Field] = {}

fields['passage'] = TextField(passage_tokens, token_indexers)

fields['question'] = TextField(question_tokens, token_indexers)

if token_spans:

# There may be multiple answer annotations, so we pick the one that occurs the most.

candidate_answers: Counter = Counter()

for span_start, span_end in token_spans:

candidate_answers[(span_start, span_end)] += 1

span_start, span_end = candidate_answers.most_common(1)[0][0]

fields['span_start'] = IndexField(span_start, passage_field)

fields['span_end'] = IndexField(span_end, passage_field)

return Instance(fields)

150 of 254

What's a TokenIndexer?

  • how to represent text in our model is one of the fundamental decisions in doing NLP
  • many ways, but pretty much always want to turn text into indices
  • many choices
    • sequence of unique token_ids (or id for OOV) from a vocabulary
    • sequence of sequence of character_ids
    • sequence of ids representing byte-pairs / word pieces
    • sequence of pos_tag_ids
  • might want to use several
  • this is (deliberately) independent of the choice about how to embed these as tensors

151 of 254

And don't forget DatasetReader

  • "given a path [usually but not necessarily to a file], produce Instances"
  • decouples your modeling code from your data-on-disk format
  • two pieces:
    • text_to_instance: creates an instance from named inputs ("passage", "question", "label", etc..)
    • read: parses data from a file and (typically) hands it to text_to_instance
  • new dataset -> create a new DatasetReader (not too much code), but keep the model as-is
  • same dataset, new model -> just re-use the DatasetReader
  • default is to read all instances into memory, but base class handles laziness if you want it

152 of 254

Library also handles batching, via DataIterator

  • BasicIterator just shuffles (optionally) and produces fixed-size batches
  • BucketIterator groups together instances with similar "length" to minimize padding
  • (Correctly padding and sorting instances that contain a variety of fields is slightly tricky; a lot of the API here is designed around getting this right)
  • Maybe someday we'll have a working AdaptiveIterator that creates variable GPU-sized batches

153 of 254

Tokenizer

  • Single abstraction for both word-level and character-level tokenization
  • Possibly this wasn't the right decision!
  • Pros:
    • easy to switch between words-as-tokens and characters-as-tokens in the same model
  • Cons:
    • non-standard names + extra complexity
    • doesn’t seem to get used this way at all

154 of 254

back to the Model

155 of 254

Model is a subclass of torch.nn.Module

  • so if you give it members that are torch.nn.Parameters or are themselves torch.nn.Modules, all the optimization will just work*
  • for reasons we'll see in a bit, we'll also inject any model component that we might want to configure
  • and AllenNLP provides NLP / deep-learning abstractions that allow us not to reinvent the wheel

*usually on the first try it won't "just work", but usually that's your fault not PyTorch's

156 of 254

TokenEmbedder

  • turns ids (the outputs of your TokenIndexers) into tensors
  • many options:
    • learned word embeddings
    • pretrained word embeddings
    • contextual embeddings (e.g. ELMo)
    • character embeddings + Seq2VecEncoder

157 of 254

Seq2VecEncoder

(batch_size, sequence_length, embedding_dim)

(batch_size, embedding_dim)

  • bag of words
  • (last output of) LSTM
  • CNN + pooling

158 of 254

Seq2SeqEncoder

(batch_size, sequence_length, embedding_dim)

(batch_size, sequence_length, embedding_dim)

  • LSTM (and friends)
  • self-attention
  • do-nothing

159 of 254

Wait, Two Different Abstractions for RNNs?

  • Conceptually, RNN-for-Seq2Seq is different from RNN-for-Seq2Vec
  • In particular, the class of possible replacements for the former is different from the class of replacements for the latter
  • That is, "RNN" is not the right abstraction for NLP!

160 of 254

Attention

(batch_size, sequence_length, embedding_dim),

(batch_size, embedding_dim)

(batch_size, sequence_length)

  • dot product (xTy)
  • bilinear (xTWy)
  • linear ([x;y;x*y;...]Tw)

161 of 254

MatrixAttention

(batch_size, sequence_length1, embedding_dim),

(batch_size, sequence_length2, embedding_dim)

(batch_size, sequence_length1, sequence_length2)

  • dot product (xTy)
  • bilinear (xTWy)
  • linear ([x;y;x*y;...]Tw)

162 of 254

Attention and MatrixAttention

  • These look similar - you could imagine sharing the similarity computation code
  • We did this at first - code sharing, yay!
  • But it was very memory inefficient - code sharing isn’t always a good idea
  • You could also imagine having a single Attention abstraction that also works for attention matrices
  • But then you have a muddied and confusing input/output spec
  • So, again, more duplicated (or at least very similar) code, but in this case that’s probably the right decision, especially for efficiency

163 of 254

SpanExtractor

  • Many modern NLP models use representations of spans of text
    • Used by the Constituency Parser and the Co-reference model in AllenNLP
    • We generalised this after needing it again to implement the Constituency Parser.
  • Lots of ways to represent a span:
    • Difference of endpoints
    • Concatenation of endpoints (etc)
    • Attention over intermediate words

(batch_size, num_spans, 2) (batch_size, sequence_length, embedding_dim)

(batch_size, num_spans, embedding_dim)

Sequence of Text

Span Indices

Embedded Spans

164 of 254

This seems like a lot of abstractions!

  • But in most cases it's pretty simple:
    • create a DatasetReader that generates the Instances you want
      • (if you're using a standard dataset, likely one already exists)
    • create a Model that turns Instances into predictions and a loss
      • use off-the-shelf components => can often write little code
    • create a JSON config and use the AllenNLP training code
    • (and also often a Predictor, coming up next)
  • We'll go through a detailed example at the end of the tutorial
  • And you can write as much PyTorch as you want when the built-in components don't do what you need

165 of 254

Abstractions just to make your life nicer

166 of 254

Declarative syntax

"model": {

"type": "crf_tagger",

"label_encoding": "BIOUL",

"constrain_crf_decoding": true,

"calculate_span_f1": true,

"dropout": 0.5,

"include_start_end_transitions": false,

"text_field_embedder": {

"token_embedders": {

"tokens": {

"type": "embedding",

"embedding_dim": 50,

"pretrained_file": "glove.6B.50d.txt.gz",

"trainable": true

},

"token_characters": {

"type": "character_encoding",

"embedding": {

"embedding_dim": 16

},

"encoder": {

"type": "cnn",

"embedding_dim": 16,

"num_filters": 128,

"ngram_filter_sizes": [3],

"conv_layer_activation": "relu"

}

}

},

},

"encoder": {

"type": "lstm",

"input_size": 50 + 128,

"hidden_size": 200,

"num_layers": 2,

"dropout": 0.5,

"bidirectional": true

},

},

most AllenNLP objects can be instantiated from Jsonnet blobs

167 of 254

Declarative syntax

  • allows us to specify an entire experiment using JSON
  • allows us to change architectures without changing code

"encoder": {

"type": "lstm",

"input_size": 50 + 128,

"hidden_size": 200,

"num_layers": 2,

"dropout": 0.5,

"bidirectional": true

},

"encoder": {

"type": "gru",

"input_size": 50 + 128,

"hidden_size": 200,

"num_layers": 1,

"dropout": 0.5,

"bidirectional": true

},

"encoder": {

"type": "pass_through",

"input_dim": 50 + 128

},

168 of 254

Declarative syntax

How does it work?

  • Registrable
    • retrieve a class by its name
  • FromParams
    • instantiate a class instance from JSON

169 of 254

Registrable

class Model(torch.nn.Module, Registrable): ...

@Model.register("bidaf")

class BidirectionalAttentionFlow(Model): ...

@Model.register("decomposable_attention")

class DecomposableAttention(Model): ...

@Model.register("simple_tagger")

class SimpleTagger(Model):

model = Model.by_name("bidaf")(param1,

param2,

...)

  • so now, given a model "type" (specified in the JSON config), we can programmatically retrieve the class
  • remaining problem: how do we programmatically call the constructor?

returns the class itself

170 of 254

Model config, again

"model": {

"type": "crf_tagger",

"label_encoding": "BIOUL",

"constrain_crf_decoding": true,

"calculate_span_f1": true,

"dropout": 0.5,

"include_start_end_transitions": false,

"text_field_embedder": {

"token_embedders": {

"tokens": {

"type": "embedding",

"embedding_dim": 50,

"pretrained_file": "glove.6B.50d.txt.gz",

"trainable": true

},

"token_characters": {

"type": "character_encoding",

"embedding": {

"embedding_dim": 16

},

"encoder": {

"type": "cnn",

"embedding_dim": 16,

"num_filters": 128,

"ngram_filter_sizes": [3],

"conv_layer_activation": "relu"

}

}

},

},

"encoder": {

"type": "lstm",

"input_size": 50 + 128,

"hidden_size": 200,

"num_layers": 2,

"dropout": 0.5,

"bidirectional": true

},

},

171 of 254

from_params, originally

@Model.register("crf_tagger")

class CrfTagger(Model):

def __init__(

self,

vocab: Vocabulary,

text_field_embedder: TextFieldEmbedder,

encoder: Seq2SeqEncoder,

label_namespace: str = "labels",

constraint_type: str = None,

include_start_end_transitions: bool = True,

dropout: float = None,

initializer: InitializerApplicator = None,

regularizer: Optional[RegularizerApplicator] = None

) -> None:

...

@classmethod

def from_params(cls,

vocab: Vocabulary,

params: Params) -> 'CrfTagger':

embedder_params = params.pop("text_field_embedder")

text_field_embedder = TextFieldEmbedder.from_params(vocab,

embedder_params)

encoder = Seq2SeqEncoder.from_params(params.pop("encoder"))

label_namespace = params.pop("label_namespace", "labels")

constraint_type = params.pop("constraint_type", None)

dropout = params.pop("dropout", None)

include_start_end_transitions = \

params.pop("include_start_end_transitions", True)

initializer_params = params.pop('initializer', [])

initializer = InitializerApplicator.from_params(initializer_params)

regularizer_params = params.pop('regularizer', [])

regularizer = RegularizerApplicator.from_params(regularizer_params)

params.assert_empty(cls.__name__)

return cls(vocab=vocab,

text_field_embedder=text_field_embedder,

encoder=encoder,

label_namespace=label_namespace,

constraint_type=constraint_type,

dropout=dropout,

include_start_end_transitions=include_start_end_transitions,

initializer=initializer,

  • have to write all the parameters twice
  • better make sure you use the same default values in both places!
  • tedious + error-prone
  • the way from_params works should (in most cases) be obvious from the constructor

172 of 254

from_params, now

class FromParams:

@classmethod

def from_params(cls: Type[T], params: Params, **extras) -> T:

from allennlp.common.registrable import Registrable # import here to avoid circular imports

if params is None: return None

registered_subclasses = Registrable._registry.get(cls)

if registered_subclasses is not None:

as_registrable = cast(Type[Registrable], cls)

default_to_first_choice = as_registrable.default_implementation is not None

choice = params.pop_choice("type",

choices=as_registrable.list_available(),

default_to_first_choice=default_to_first_choice)

subclass = registered_subclasses[choice]

if not takes_arg(subclass.from_params, 'extras'):

extras = {k: v for k, v in extras.items() if takes_arg(subclass.from_params, k)}

return subclass.from_params(params=params, **extras)

else:

if cls.__init__ == object.__init__:

kwargs: Dict[str, Any] = {}

else:

kwargs = create_kwargs(cls, params, **extras)

return cls(**kwargs) # type: ignore

173 of 254

from_params, now

def create_kwargs(cls: Type[T], params: Params, **extras) -> Dict[str, Any]:

"""

Given some class, a `Params` object, and potentially other keyword arguments,

create a dict of keyword args suitable for passing to the class's constructor.

The function does this by finding the class's constructor, matching the constructor

arguments to entries in the `params` object, and instantiating values for the parameters

using the type annotation and possibly a from_params method.

Any values that are provided in the `extras` will just be used as is.

For instance, you might provide an existing `Vocabulary` this way.

"""

...

174 of 254

Trainer

class Trainer(Registrable):

def __init__(

self,

model: Model,

optimizer: torch.optim.Optimizer,

iterator: DataIterator,

train_dataset: Iterable[Instance],

validation_dataset: Optional[Iterable[Instance]] = None,

patience: Optional[int] = None,

validation_metric: str = "-loss",

validation_iterator: DataIterator = None,

shuffle: bool = True,

num_epochs: int = 20,

serialization_dir: Optional[str] = None,

num_serialized_models_to_keep: int = 20,

keep_serialized_model_every_num_seconds: int = None,

model_save_interval: float = None,

cuda_device: Union[int, List] = -1,

grad_norm: Optional[float] = None,

grad_clipping: Optional[float] = None,

learning_rate_scheduler: LearningRateScheduler = None,

summary_interval: int = 100,

histogram_interval: int = None,

should_log_parameter_statistics: bool = True,

should_log_learning_rate: bool = False) -> None:

  • configurable training loop with tons of options
    • your favorite PyTorch optimizer
    • early stopping
    • many logging options
    • many serialization options
    • learning rate schedulers
  • (almost all of them optional)
  • as always, configuration happens in your JSON experiment config

175 of 254

Model archives

  • training loop produces a model.tar.gz
    • config.json + vocabulary + trained model weights
  • can be used with command line tools to evaluate on test datasets or to make predictions
  • can be used to power an interactive demo

176 of 254

Making Predictions

177 of 254

Predictor

  • models are tensor-in, tensor-out
  • for creating a web demo, want JSON-in, JSON-out
  • same for making predictions interactively
  • Predictor is just a simple JSON wrapper for your model

@Predictor.register('sentence-tagger')

class SentenceTaggerPredictor(Predictor):

def __init__(self,

model: Model,

dataset_reader: DatasetReader) -> None:

super().__init__(model, dataset_reader)

self._tokenizer = SpacyWordSplitter(language='en_core_web_sm',

pos_tags=True)

def predict(self, sentence: str) -> JsonDict:

return self.predict_json({"sentence" : sentence})

@overrides

def _json_to_instance(self, json_dict: JsonDict) -> Instance:

sentence = json_dict["sentence"]

tokens = self._tokenizer.split_words(sentence)

return self._dataset_reader.text_to_instance(tokens)

this is (partly) why we split out text_to_instance as its own function in the dataset reader

and this is enabled by all of our models taking optional labels and returning an optional loss and also various model internals and interesting results

178 of 254

Serving a demo

With this setup, serving a demo is easy.

    • DatasetReader gives us text_to_instance
    • Labels are optional in the model and dataset reader
    • Model returns an arbitrary dict, so can get and visualize model internals
    • Predictor wraps it all in JSON
    • Archive lets us load a pre-trained model in a server
    • Even better: pre-built UI components (using React) to visualize standard pieces of a model, like attentions, or span labels

179 of 254

We don't have it all figured out!

still figuring out some abstractions that we may not have correct

  • regularization and initialization
  • models with pretrained components
  • more complex training loops
    • e.g. multi-task learning
  • Caching preprocessed data
  • Expanding vocabulary / embeddings at test time
  • Discoverability of config options

you can do all these things, but almost certainly not in the most optimal / generalizable way

180 of 254

Case study

181 of 254

"an LSTM for part-of-speech tagging"

182 of 254

The Problem

Given a training dataset that looks like

The###DET dog###NN ate###V the###DET apple###NN

Everybody###NN read###V that###DET book###NN

learn to predict part-of-speech tags

183 of 254

With a Few Enhancements to Make Things More Realistic

  • read data from files
  • check performance on a separate validation dataset
  • use tqdm to track training progress
  • implement early stopping based on validation loss
  • track accuracy as we're training

184 of 254

Start With a Simple Baseline Model

  • compute a vector embedding for each word
  • feed the sequence of embeddings into an LSTM
  • feed the hidden states into a feed-forward layer to produce a sequence of logits

The

dog

ate

the

apple

vThe

vdog

vate

vthe

vapple

embedding

LSTM

wThe

wdog

wate

wthe

wapple

LThe

Ldog

Late

Lthe

Lapple

Linear

word vectors

encodings

tag logits

185 of 254

v0: numpy

aka "this is why we use libraries"

186 of 254

v0: numpy (aka "this is why we use libraries")

class LSTM:

def __init__(self, input_size: int, hidden_size: int) -> None:

self.params = {

# forget gate

"w_f": np.random.randn(input_size, hidden_size)

"b_f": np.random.randn(hidden_size)

"u_f": np.random.randn(hidden_size, hidden_size)

# external input gate

"w_g": np.random.randn(input_size, hidden_size)

"b_g": np.random.randn(hidden_size)

"u_g": np.random.randn(hidden_size, hidden_size)

# output gate

"w_q": np.random.randn(input_size, hidden_size)

"b_q": np.random.randn(hidden_size)

"u_q": np.random.randn(hidden_size, hidden_size)

# usual params

"w": np.random.randn(input_size, hidden_size)

"b": np.random.randn(hidden_size)

"u": np.random.randn(hidden_size, hidden_size)

}

self.grads = {name: None for name in self.params}

187 of 254

v1: PyTorch

188 of 254

v1: PyTorch - Load Data

def load_data(file_path: str) -> List[Tuple[str, str]]:

"""

One sentence per line, formatted like

The###DET dog###NN ate###V the###DET apple###NN

Returns a list of pairs (tokenized_sentence, tags)

"""

data = []

with open(file_path) as f:

for line in f:

pairs = line.strip().split()

sentence, tags = zip(*(pair.split("###") for pair in pairs))

data.append((sentence, tags))

return data

seems reasonable

189 of 254

v1: PyTorch - Define Model

class LSTMTagger(nn.Module):

def __init__(self, embedding_dim: int, hidden_dim: int, vocab_size: int, tagset_size: int) -> None:

super().__init__()

self.hidden_dim = hidden_dim

self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

# The LSTM takes word embeddings as inputs,

# and outputs hidden states with dimensionality hidden_dim.

self.lstm = nn.LSTM(embedding_dim, hidden_dim)

# The linear layer that maps from hidden state space to tag space

self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

self.hidden = self.init_hidden()

def forward(self, sentence: torch.Tensor) -> torch.Tensor:

embeds = self.word_embeddings(sentence)

lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)

tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))

tag_scores = F.log_softmax(tag_space, dim=1)

return tag_scores

much nicer than writing our own LSTM!

190 of 254

v1: PyTorch - Train Model

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM,

len(word_to_ix), len(tag_to_ix))

loss_function = nn.NLLLoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

validation_losses = []

patience = 10

for epoch in range(1000):

training_loss = 0.0

validation_loss = 0.0

for dataset, training in [(training_data, True),

(validation_data, False)]:

correct = total = 0

torch.set_grad_enabled(training)

t = tqdm.tqdm(dataset)

for i, (sentence, tags) in enumerate(t):

model.zero_grad()

model.hidden = model.init_hidden()

sentence_in = prepare_sequence(sentence, word_to_ix)

targets = prepare_sequence(tags, tag_to_ix)

tag_scores = model(sentence_in)

loss = loss_function(tag_scores, targets)

predictions = tag_scores.max(-1)[1]

correct += (predictions == targets).sum().item()

total += len(targets)

accuracy = correct / total

if training:

loss.backward()

training_loss += loss.item()

t.set_postfix(training_loss=training_loss/(i + 1),

accuracy=accuracy)

optimizer.step()

else:

validation_loss += loss.item()

t.set_postfix(validation_loss=validation_loss/(i + 1),

accuracy=accuracy)

validation_losses.append(validation_loss)

if (patience and

len(validation_losses) >= patience and

validation_losses[-patience] ==

min(validation_losses[-patience:])):

print("patience reached, stopping early")

break

this part is maybe less than ideal

191 of 254

v2: AllenNLP

(but without config files)

192 of 254

v2: AllenNLP - Dataset Reader

class PosDatasetReader(DatasetReader):

def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:

super().__init__(lazy=False)

self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}

def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:

sentence_field = TextField(tokens, self.token_indexers)

fields = {"sentence": sentence_field}

if tags:

label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)

fields["labels"] = label_field

return Instance(fields)

def _read(self, file_path: str) -> Iterator[Instance]:

with open(file_path) as f:

for line in f:

pairs = line.strip().split()

sentence, tags = zip(*(pair.split("###") for pair in pairs))

yield self.text_to_instance([Token(word) for word in sentence], tags)

193 of 254

v2: AllenNLP - Model

class LstmTagger(Model):

def __init__(self, word_embeddings: TextFieldEmbedder, encoder: Seq2SeqEncoder, vocab: Vocabulary) -> None:

super().__init__(vocab)

self.word_embeddings = word_embeddings

self.encoder = encoder

self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),

out_features=vocab.get_vocab_size('labels'))

self.accuracy = CategoricalAccuracy()

def forward(self, sentence: Dict[str, torch.Tensor], labels: torch.Tensor = None) -> torch.Tensor:

mask = get_text_field_mask(sentence)

embeddings = self.word_embeddings(sentence)

encoder_out = self.encoder(embeddings, mask)

tag_logits = self.hidden2tag(encoder_out)

output = {"tag_logits": tag_logits}

if labels is not None:

self.accuracy(tag_logits, labels, mask)

output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

return output

def get_metrics(self, reset: bool = False) -> Dict[str, float]:

return {"accuracy": self.accuracy.get_metric(reset)}

194 of 254

v2: AllenNLP - Training

reader = PosDatasetReader()

train_dataset = reader.read(cached_path('https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/training.txt'))

validation_dataset = reader.read(

cached_path(https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/validation.txt'))

vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

EMBEDDING_DIM = 6

HIDDEN_DIM = 6

token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'), embedding_dim=EMBEDDING_DIM)

word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model = LstmTagger(word_embeddings, lstm, vocab)

optimizer = optim.SGD(model.parameters(), lr=0.1)

iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])

iterator.index_with(vocab)

trainer = Trainer(model=model, optimizer=optimizer, iterator=iterator,

train_dataset=train_dataset, validation_dataset=validation_dataset,

patience=10, num_epochs=1000)

trainer.train()

this is where the config-driven approach would make our lives a lot easier

195 of 254

v3: AllenNLP + config

196 of 254

v3: AllenNLP - config

local embedding_dim = 6;

local hidden_dim = 6;

local num_epochs = 1000;

local patience = 10;

local batch_size = 2;

local learning_rate = 0.1;

{

"train_data_path": "...",

"validation_data_path": "...",

"dataset_reader": { "type": "pos-tutorial" },

"model": {

"type": "lstm-tagger",

"word_embeddings": {

"token_embedders": {

"tokens": {

"type": "embedding",

"embedding_dim": embedding_dim

}

}

},

"encoder": {

"type": "lstm",

"input_size": embedding_dim,

"hidden_size": hidden_dim

}

},

"iterator": {

"type": "bucket",

"batch_size": batch_size,

"sorting_keys": [["sentence", "num_tokens"]]

},

"trainer": {

"num_epochs": num_epochs,

"optimizer": {

"type": "sgd",

"lr": learning_rate

},

"patience": patience

}

}

params = Params.from_file('...')

serialization_dir = tempfile.mkdtemp()

model = train_model(params, serialization_dir)

197 of 254

Augmenting the Tagger with Character-Level Features

198 of 254

v1: PyTorch

class LSTMTagger(nn.Module):

def __init__(self, embedding_dim: int, hidden_dim: int,

vocab_size: int, tagset_size: int) -> None:

super().__init__()

self.hidden_dim = hidden_dim

self.word_embeddings = nn.Embedding(vocab_size,embedding_dim)

# The LSTM takes word embeddings as inputs,

# and outputs hidden states with dimensionality hidden_dim.

self.lstm = nn.LSTM(embedding_dim, hidden_dim)

# Linear layer that maps from hidden state space to tag space

self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

self.hidden = self.init_hidden()

def forward(self, sentence: torch.Tensor) -> torch.Tensor:

embeds = self.word_embeddings(sentence)

lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)

tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))

tag_scores = F.log_softmax(tag_space, dim=1)

return tag_scores

add char_embedding_dim

add char_embedding layer = embedding + LSTM?

change LSTM input dim

compute char embeddings

concatenate inputs

we really have to change our model code and how it works

199 of 254

v1: PyTorch

class LSTMTagger(nn.Module):

def __init__(self, embedding_dim: int, hidden_dim: int,

vocab_size: int, tagset_size: int) -> None:

super().__init__()

self.hidden_dim = hidden_dim

self.word_embeddings = nn.Embedding(vocab_size,embedding_dim)

# The LSTM takes word embeddings as inputs,

# and outputs hidden states with dimensionality hidden_dim.

self.lstm = nn.LSTM(embedding_dim, hidden_dim)

# Linear layer that maps from hidden state space to tag space

self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

self.hidden = self.init_hidden()

def forward(self, sentence: torch.Tensor) -> torch.Tensor:

embeds = self.word_embeddings(sentence)

lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)

tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))

tag_scores = F.log_softmax(tag_space, dim=1)

return tag_scores

I'm not really that thrilled to do this exercise

200 of 254

v2: AllenNLP

reader = PosDatasetReader()

# ...

EMBEDDING_DIM = 6

HIDDEN_DIM = 6

# ...

token_embedding = Embedding(

num_embeddings=vocab.get_vocab_size('tokens'),

embedding_dim=EMBEDDING_DIM)

word_embeddings = BasicTextFieldEmbedder(

{"tokens": token_embedding}

)

# ...

reader = PosDatasetReader(token_indexers={

"tokens": SingleIdTokenIndexer(),

"token_characters": TokenCharactersIndexer()

})

# ...

WORD_EMBEDDING_DIM = 5

CHAR_EMBEDDING_DIM = 3

EMBEDDING_DIM = WORD_EMBEDDING_DIM + CHAR_EMBEDDING_DIM

HIDDEN_DIM = 6

# ...

token_embedding = Embedding(

num_embeddings=vocab.get_vocab_size('tokens'),

embedding_dim=WORD_EMBEDDING_DIM)

char_embedding = TokenCharactersEncoder(

embedding=Embedding(

num_embeddings=vocab.get_vocab_size('token_characters'),

embedding_dim=CHAR_EMBEDDING_DIM),

encoder=PytorchSeq2VecWrapper(

torch.nn.LSTM(CHAR_EMBEDDING_DIM, CHAR_EMBEDDING_DIM,

batch_first=True))

word_embeddings = BasicTextFieldEmbedder({

"tokens": token_embedding,

"token_characters": char_embedding})

# ...

add a second token indexer

add an extra parameter

add a character embedder

use the character embedder

no changes to the model itself!

201 of 254

v3: AllenNLP - config

local embedding_dim = 6;

local hidden_dim = 6;

local num_epochs = 1000;

local patience = 10;

local batch_size = 2;

local learning_rate = 0.1;

{

"train_data_path": "...",

"validation_data_path": "...",

"dataset_reader": { "type": "pos-tutorial" },

"model": {

"type": "lstm-tagger",

"word_embeddings": {

"token_embedders": {

"tokens": {

"type": "embedding",

"embedding_dim": embedding_dim

}

}

},

"encoder": {

"type": "lstm",

"input_size": embedding_dim,

"hidden_size": hidden_dim

}

},

we can accomplish this with just a couple of minimal config changes

202 of 254

v3: AllenNLP - config

local embedding_dim = 6;

local hidden_dim = 6;

local num_epochs = 1000;

local patience = 10;

local batch_size = 2;

local learning_rate = 0.1;

local word_embedding_dim = 5;

local char_embedding_dim = 3;

local embedding_dim = word_embedding_dim + char_embedding_dim;

local hidden_dim = 6;

local num_epochs = 1000;

local patience = 10;

local batch_size = 2;

local learning_rate = 0.1;

add a couple of new Jsonnet variables

203 of 254

v3: AllenNLP - config

"dataset_reader": { "type": "pos-tutorial" }

"dataset_reader": {

"type": "pos-tutorial",

"token_indexers": {

"tokens": { "type": "single_id" },

"token_characters": { "type": "characters" }

}

}

add a second token indexer

204 of 254

v3: AllenNLP - config

"model": {

"type": "lstm-tagger",

"word_embeddings": {

"token_embedders": {

"tokens": {

"type": "embedding",

"embedding_dim": embedding_dim

}

}

},

"encoder": {

"type": "lstm",

"input_size": embedding_dim,

"hidden_size": hidden_dim

}

}

"model": {

"type": "lstm-tagger",

"word_embeddings": {

"token_embedders": {

"tokens": {

"type": "embedding",

"embedding_dim": word_embedding_dim

},

"token_characters": {

"type": "character_encoding",

"embedding": {

"embedding_dim": char_embedding_dim,

},

"encoder": {

"type": "lstm",

"input_size": char_embedding_dim,

"hidden_size": char_embedding_dim

}

}

},

},

"encoder": {

"type": "lstm",

"input_size": embedding_dim,

"hidden_size": hidden_dim

}

}

add a corresponding token embedder

205 of 254

For a one-time change this is maybe not such a big win.

206 of 254

But being able to experiment with lots of architectures without having to change any code (and with a reproducible JSON description of each experiment) is a huge boon to research! (we think)

207 of 254

Sharing Your Research

How to make it easy to release your code

208 of 254

209 of 254

In the least amount of time possible:

Simplify your workflow for installation and data

Make your code run anywhere*

Isolated environments for your project

210 of 254

Docker

211 of 254

Objective: You don’t feel like this about Docker

212 of 254

What does Docker Do?

  • Creates a virtual machine that will always run the same anywhere (In theory)

  • Allows you to package up a virtual machine and some code and send it to someone, knowing the same thing will run

  • Includes operating systems, dependencies for your code, your code etc.

  • Let’s you specify in a series of steps how to create this virtual machine and does clever caching when you change it.

213 of 254

214 of 254

215 of 254

3 Ideas: Dockerfiles, Images and Containers

216 of 254

Step 1: Write a Dockerfile

Here is a finished Dockerfile.

How does this work?

217 of 254

Step 1: Write a Dockerfile

COMMAND <command>

218 of 254

Step 1: Write a Dockerfile

COMMAND <command>

Dockerfile commands are capitalised. Some important ones are:

FROM, RUN, ENV, COPY and CMD

219 of 254

Step 1: Write a Dockerfile

FROM python:3.6.3-jessie

FROM includes another Dockerfile in your one. Here we start from a base Python Dockerfile.

220 of 254

Step 1: Write a Dockerfile

RUN pip install -r requirements.txt

RUN … runs a command. To use a command, it must be installed in a previous step!

221 of 254

Step 1: Write a Dockerfile

ENV LANG=C.UTF-8

ENV sets an environment variable which can be used inside the container.

222 of 254

Step 1: Write a Dockerfile

COPY my_research/ my_research/

COPY copies code from your current folder into the Docker image.

223 of 254

Step 1: Write a Dockerfile

COPY my_research/ my_research/

Do yourself a favour. Don’t change the names of things during this step.

224 of 254

Step 1: Write a Dockerfile

CMD [“/bin/bash”]

CMD [“python”, “my/script.py”]

CMD is what gets run when you run a built image.

225 of 254

Step 1: Write a Dockerfile

Here is a finished Dockerfile.

226 of 254

Step 2: Build your Dockerfile into an Image

docker build --tag <name> .

227 of 254

Step 2: Build your Dockerfile into an Image

docker build --tag <name> .

This is what you want the image to be called, e.g

markn/my-paper-code.

228 of 254

Step 2: Build your Dockerfile into an Image

docker build --tag <name> .

You can see what images you have built already by running

docker images

229 of 254

Step 2: Build your Dockerfile into an Image

docker build --tag <name> .

This describes where docker should look for a Dockerfile. It can also be a URL.

230 of 254

Step 2: Build your Dockerfile into an Image

docker build --tag <name> .

If you’ve already built a line of your dockerfile before, Docker will remember and not build it again (so long as things before it haven’t changed.)

231 of 254

Step 2: Build your Dockerfile into an Image

docker build --tag <name> .

TIP: Put things that change more frequently (like your code) lower down in your Dockerfile.

232 of 254

Step 3: Run your Image as a Container

docker run <image-name>

233 of 254

Step 3: Run your Image as a Container

docker run -it <image-name>

-i: interactive

-t: tty (with a terminal)

234 of 254

Step 3: Run your Image as a Container

docker run -it -e /bin/bash ...

These arguments will give you a command prompt inside any docker container, regardless of the CMD in the Dockerfile.

235 of 254

Optional Step 4: DockerHub

docker push <image-name>

DockerHub is to Docker as Github is to Git

Docker automatically looks at dockerhub to find Docker images to run

236 of 254

Pros of Docker

  • Good for running CI - ALL your code dependencies are pinned, even system level stuff.

  • Good for debugging people’s problems with your code - just ask: Can you reproduce bug that in a Docker Container

  • Great for deploying demos where you just need a model to run as a service.

237 of 254

Cons of Docker

  • Docker is designed for production systems - it is very hard to debug inside a minimal docker container

  • Takes up a lot of memory if you have a lot of large dependencies (e.g the JVM makes up about half of the AllenNLP Docker image)

  • Just because your code is exactly reproducible doesn’t mean that it’s any good

238 of 254

Releasing your data

239 of 254

Use a simple file cache

There are currently 27 CoreNLP Jar files you could download from the CoreNLP website

240 of 254

Use a simple file cache

embedding_file = cached_path(“embedding_url”)

datasets = cached_path(“dataset_url”)

241 of 254

Use a simple file cache

But now I have to write a file cache ....

242 of 254

Use a simple file cache

Copy this file into your project

from file_cache import cached_path

embeddings = cached_path(url)

243 of 254

Isolated (Python) environments

244 of 254

Python environments

Stable environments for Python can be tricky

This makes releasing code very annoying

245 of 254

Python environments

Docker is ideal, but not great for developing locally. For this, you should either use virtualenvs or anaconda.

Here we will talk about anaconda, because it’s what we use.

246 of 254

Python environments

Anaconda is a very stable distribution of Python (amongst other things). Installing it is easy:

https://www.anaconda.com/

247 of 254

Python environments

One annoying install step - adding where you installed it to the front of your PATH environment variable.

export PATH=”/path/to/anaconda/bin:PATH”

248 of 254

Python environments

Now, your default python should be an anaconda one (you did install python > 3.6, didn’t you).

249 of 254

Virtual environments

Every time you start a new project, make a new virtual environment which has only its dependencies in.

conda create -n your-env-name python=3.6

250 of 254

Virtual environments

Before you work on your project, run this command. This prepends the location of this particular copy of Python to your PATH.

source activate your-project-name

pip install -r requirements.txt

etc

251 of 254

Virtual environments

When you’re done, or you want to work on a different project, run:

source deactivate your-project-name

252 of 254

In Conclusion

253 of 254

In Conclusion

  • Prototype fast (but still safely)
  • Write production code safely (but still fast)
  • Good processes => good science
  • Use the right abstractions
  • Check out AllenNLP

254 of 254

Thanks for Coming!

Questions?

please fill out our survey:

https://tinyurl.com/emnlp-tutorial-survey

will tweet out link to slides after talk

@ai2_allennlp