1 of 14

FIRE: Food Image to REcipe generation

Prateek Chhikara, Dhiraj Chaurasia, Yifan Jiang, Omkar Masur, and Filip Ilievski

{pchhikar, dchauras, yjiang44, omasur}@usc.edu, f.ilievski@vu.nl

2 of 14

Background

  • Food and Identity: Food shapes cultural identity and reflects individuality, influencing social connections.

  • Social Media Impact: Social media’s prevalence underscores food’s societal importance, with millions of posts tagged as #food and #foodie.

  • Ambitious Food Computing: Food computing aims to generate recipes from images, bridging computer vision and natural language processing for personalized recommendations and automated cooking.

3 of 14

Goal

  • To create an efficient way to generate recipe from food images leveraging the capabilities of state-of-the-art (SotA) models.

  • Leverage LLMs to extend recipe generation process to advanced applications.

4 of 14

Contributions

  1. Vision Transformers (ViT) to get expressive embeddings from food images and attention-based decoder to extract the ingredients of the recipe.

  • An end-to-end pipeline for generating recipe titles and cooking instructions, utilizing SotA vision (BLIP) and language (T5) models, respectively.

  • Showcase the ability of FIRE to support two novel food computing applications:
    1. Recipe Customization
    2. Recipe-to-Machine-Code Generation,

through integration with few-shot prompting of large LMs.

5 of 14

Proposed Methodology

Proposed architecture to extract ingredients, and generate the recipe title and cooking instructions from a food image. (Ingredients with quantity is passed during the train time only)

1. Title Generation: Used BLIP model which is fine-tuned on 10% of Recipe1M dataset.

2. Ingredient Extraction: Extracted features from input food image using Vision Transformer (ViT). Image embeddings are passed through an ingredient decoder.

6 of 14

Proposed Methodology (... continued)

3. Cooking Instruction Generation: We used recipe title and ingredients to fine-tune a T5 model. During test time, title and ingredients from previous two stages are passed to fine-tuned T5 model.

7 of 14

Experiment Setup

  1. Datasets
    1. Recipe1M
      1. 259,932 training
      2. 55,773 validation
      3. 56,029 testing
  2. Baselines
    • Ingredient Extraction:

RI2L (retrieval-based), RI2LR (retrieval-based), FFTD, and InverseCooking

    • Cooking Instruction Generation:

InverseCooking and ChefTransformer

  • Evaluation Metrics
    • Ingredient Extraction:

IoU and F1

    • Cooking Instruction Generation:

SacreBLEU and RougeL

8 of 14

Results

End-to-end scores

9 of 14

Ablation Study

  1. Ingredient Extraction (Table 2)
  2. Image Feature Extraction (Table 3)
  3. Zero-shot vs Fine-tuned (Table 4)

10 of 14

Case Study

  • FIRE is often able to generate a correct recipe for dishes similar to those present in the Recipe1M dataset.

  • For Pav Bhaji (a popular Indian dish not present in Recipe1M) it gave a result which is unrelated to the intended dish.

  • Conventional evaluation metrics such as SacreBLEU and ROUGE, failed to capture the accuracy of the generated recipes and detect certain text hallucinations.

11 of 14

FIRE Applications

12 of 14

FIRE Applications – Analysis

Conducted a human evaluation with seven experts involving 10 recipes and their customizations.

1. Recipe Customization

  • Evaluators rated four attributes: efficacy, coherence, soundness, and proportions and measurements, on a 0 to 4 scale (0:strongly disagree, 1: disagree, 2: neutral, 3: agree, 4: strongly agree).
  • On average, each attribute has a high result 3.5 to 3.76 with high Fleiss kappa inter-annotator agreement 0.78 to 0.92.

2. Recipe-to-Machine-Code Generation

  • Similar human evaluation process focusing on how well ingredients, cooking instructions, and their descriptions are translated to code format on a scale of 0 (extremely poor) to 5 (excellent).
  • Each property is rated on average between 4.27 and 4.47, with an inter-annotator agreement between 0.75 and 0.83.

13 of 14

Future Work

  1. Conventional metrics are insufficient to capture the language grounding. – Develop a metric that effectively captures the coherence and plausibility of generated recipes?

  • The diversity and availability of recipes are heavily dependent on the locations, climates, and religions, which prevent users from preparing food based on predefined recipes. – Injection of knowledge graphs could inform the models about alternative ingredients?

  • Hallucination remains a critical challenge in recipe generation by natural language and vision models. – Incorporate methods for state tracking of participants to enhance the production of reasonable and accurate results.

14 of 14

Thanks

Reach out for questions at pchhikar@usc.edu or connect at