1 of 23

Natural Language Processing...

In the kitchen

@anthonyjpesce / Los Angeles Times

2 of 23

Starting point: 465,590 line text dump

..TE:

Bread pudding and assembly

..TE:

6 egg yolks

..TE:

1 quart heavy cream

..TE:

3/4 cups sugar

..TE:

1 1/2 tablespoons brandy

..TE:

1 tablespoon vanilla extract

..TE:

1/2 vanilla bean, scraped

..TE:

1 pound brioche bread, cut into 1-inch cubes and dried

..TE:

Creme anglaise

..TE:

Caramel sauce

..TE:

1. Heat the oven to 350 degrees.

..TE:

2. In a large bowl, whisk together the egg yolks, cream, sugar,

brandy, vanilla extract and the seeds from the bean to form a custard

base. Add the bread cubes and toss to combine. Set the mixture aside

until the bread cubes are soaked with the custard base, about 30

minutes, tossing every 10 minutes.

..TE:

3. Spoon the mixture into a greased 13-by-9-inch baking dish.

Place the dish in a larger roasting pan, and fill the pan with hot

water until it comes up the baking dish halfway. Wrap the top of the

baking dish with foil and place the roasting pan in the oven.

..TE:

4. Bake the bread pudding until the custard is set, about 45

minutes.

..TE:

5. Remove the roasting pan from the oven and carefully remove the

baking dish from the pan. Uncover the bread pudding and place the

baking dish back in the oven. Increase the oven temperature to 400

degrees. Continue to bake the bread pudding until the top is golden

brown, an additional 8 to 10 minutes.

..TE:

5. Remove the baking dish to a rack to cool slightly. Serve the

bread pudding warm topped with creme anglaise and caramel sauce.

..TE:

At your door, at your service

..TE:

Many fresh meal delivery services operate in Los Angeles and offer

diet programs. Here are the three featured in the story.

..TE:

Fresh Dining. (818) 981-4700, www.freshdining.com. Delivers to Los

Angeles and Orange counties. Daily menus include organic or

conventional ingredients in three meals and two daily snacks. Prices

range from $42.95 a day for a 90-day subscription to $55.95 a day for

two weeks of organic meals. A corporate program will offer meals

delivered to offices.

..TE:

Zone Los Angeles. (323) 290-0200; www.zone-la.com. Delivers

throughout the L.A. area, Malibu to Burbank, Huntington Beach to

Hollywood and points in between. The Zone-compliant programs consist

of three meals and two snacks delivered daily (two days' worth in one

weekend delivery). A five-day trial can be credited to a longer

subscription. All programs are $45 per day; free days with longer

commitments.

..TE:

Zone Chefs. (800) 939-0663, Ext. 1; www.zonechefs.com. Delivers

three-meal, two-snack daily packages throughout Los Angeles, San

Bernardino, Riverside and Orange counties. Prices range from $37 a

day for a 21-day plan to $42.50 for a seven-day trial.

..TE:

..GT:

..CP:

PHOTO: (no caption)

..CP:

ID NUMBER:20050727iixabdkn

..CP:

PHOTOGRAPHER: Bryan Chan Los Angeles Times

..CP:

PHOTO: (no caption)

..CP:

ID NUMBER:20050727iixagkkn

..CP:

PHOTOGRAPHER: Bryan Chan Los Angeles Times

..CP:

PHOTO: MIDDAY CHOICES: Zone Los Angeles' shrimp salad with

edamame, left, and a snack of string cheese, grapes and tapenade.

..CP:

ID NUMBER:20050727ih5qurkf

..CP:

3 of 23

There were about 6,000 recipes in there.

Um, where?

4 of 23

We needed:

  • Each individual recipe
  • Sub recipes (pie, crust, filling)
  • Related recipes
  • Description
  • Title
  • Ingredients
  • Steps
  • Time and serving
  • Nutrition
  • And more

5 of 23

6 of 23

>>> import nltk

7 of 23

The basic approach

  1. Walk through the text file line by line
  2. For each line, classify it as one of several fields (ingredient, step, title, etc.)
  3. Circle back and reassemble into a database
  4. Human review

8 of 23

NLTK can:

  • Tokenize: sentences to words, paragraphs to sentences, etc.
  • Part-of-speech tag text
  • Find “named entities”
  • N-grams (groups of N words)
  • Stem words
  • More!

9 of 23

NLTK also includes

Several classifiers you can train to tag text.

You just need to teach it what to look for.

The more you teach, the better it gets!

10 of 23

It works like this:

  1. Train a classifier on a small, random sample of your data
  2. Try it out, train on a larger sample if necessary
  3. Accurately classify huge amounts of new data based on the sample

11 of 23

Bag of words classification

whip

saucepan

season

while

bake

chop

cover

reduce

=

Step

Ingredient

cup

3/4

scant

kale

chicken

quartered

sugar

dried

=

12 of 23

But...

You can really train a classifier to use anything.

13 of 23

The trick...

Is finding the combination of approaches that produces the most accurate classification.

14 of 23

I used...

  1. Individual words
  2. Trigrams -- groups of three words
  3. Parts of speech

15 of 23

import nltk

from nltk.tag.simplify import simplify_wsj_tag

def get_features(text):

words = []

sentences = nltk.sent_tokenize(text)

for sentence in sentences:

words = words + nltk.word_tokenize(sentence)

# part of speech tag each of the words

pos = nltk.pos_tag(words)

# Sometimes it's helpful to simplify the tags NLTK returns by default.

pos = [simplify_wsj_tag(tag) for word, tag in pos]

# Then, convert the words to lowercase like before

words = [i.lower() for i in words]

# Grab the trigrams

trigrams = nltk.trigrams(words)

# We need to concatinate the trigrams into a single string to process

trigrams = ["%s/%s/%s" % (i[0], i[1], i[2]) for i in trigrams]

# Get our final dict rolling

features = words + pos + trigrams

# get our feature dict rolling

features = dict([(i, True) for i in features])

return features

A function to grab the features I want

16 of 23

In:

Stir the lentils into the cooked kale

Out:

{

'lentils': True,

'the/cooked/kale': True,

'kale': True,

'P': True,

'ADJ': True,

'into': True,

'DET': True,

'lentils/into/the': True,

'N': True,

'cooked': True,

'into/the/cooked': True,

'NP': True,

'the/lentils/into': True,

'the': True,

'stir': True,

'stir/the/lentils': True

}

17 of 23

Pull it all together

im­port nltk

from nltk.clas­si­fy im­port Max­ent­Clas­si­fi­er

txt = “Stir the lentils into the cooked kale”

feats = get_features(txt)

clas­si­fi­er = Max­ent­Clas­si­fi­er.train((feats, “Step”))

clas­si­fi­er.classify(feats)

>>> “Step”

18 of 23

19 of 23

Structured data

20 of 23

This approach works great...

When you have consistent fields that are (at least) somewhat well defined.

21 of 23

22 of 23

23 of 23

More info:

LA Times internships:

Slides:

lat.ms/nltk

latimes.com/interns

lat.ms/nlpslides

@anthonyjpesce / Los Angeles Times