A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Model Name | About | Year Built | Number of Parameters | Creator | Limitations | Training Time | Model Definition | Dataset Used for Training | Hardware Configuration | |||||||
2 | ALICE | Chatbot | 1995 | N/A | Richard Wallace | Limited to scripted responses, lacks general knowledge, can be prone to repeating itself | 2 weeks | ALICE (Artificial Linguistic Internet Computer Entity) is a chatbot framework that uses AIML (Artificial Intelligence Markup Language) for natural language processing and rule-based responses. Its key features include the ability to engage in conversation, answer questions, and provide information. Its limitations include the need for manual programming of rules, which can be time-consuming and may not cover all possible scenarios, and a lack of ability to learn from new data. | Cornell Movie Dialogs | Not Specified | |||||||
3 | AWD-LSTM | LSTM-based language model | 2017 | 33 million (base) | Salesforce Research | Limited to sequential data, requires large amounts of training data | 12 hours (on WikiText) | AWD-LSTM (ASGD Weight-Dropped LSTM) is a language model architecture that uses LSTM (Long Short-Term Memory) cells with weight dropout and activation dropout to prevent overfitting. Its key feature is improved training on small datasets. Its limitations include the potential for slower training and the need for more memory compared to simpler models. | Wikitext-103 | Not Specified | |||||||
4 | GPT | Transformer-based LM | 2018 | 117M | OpenAI | Limited to language tasks, lack of memory, can generate nonsensical responses | 4 days | GPT (Generative Pre-trained Transformer) is a generative language model that uses the Transformer architecture with self-attention to generate text. Its key feature is the ability to generate coherent and diverse text, but its limitation is that it may generate biased or inappropriate text. | BooksCorpus, English Wikipedia | 8 NVIDIA V100 GPUs, 128GB of GPU memory, 40 CPU cores | |||||||
5 | BERT | Transformer-based LM | 2018 | 110M | Computationally expensive, lacks memory, prone to overfitting, not suited for generative tasks | 4 days | Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based language model designed to pre-train deep bidirectional representations from a large unlabeled text corpus, which can then be fine-tuned for a variety of NLP tasks such as question-answering and sentiment analysis. Limitations include its large size and computational requirements for pre-training. | BooksCorpus, English Wikipedia | 64 Cloud TPUs | ||||||||
6 | AdaNet | AutoML-based LM | 2018 | varies | Limited to language tasks, lack of interpretability, requires high computational power | 2 hours | AdaNet is a neural architecture search algorithm that automatically designs high-quality models by iteratively learning from previous architectures. It trains a custom ensemble of neural networks, each of which is trained with a different architecture and dataset weighting. Its limitation includes its high computational and time requirements, and its dependency on the quality and quantity of training data. | CIFAR-10 | 1 Google Cloud TPU | ||||||||
7 | QuickThought | Encoder-based LM | 2018 | varies | Limited to encoding sentences, not suitable for generation tasks | 1 week | QuickThought is a language model architecture that uses unsupervised learning to generate sentence embeddings. Its key feature is the ability to learn from large amounts of unstructured data without the need for labeled data. Its limitations include the lack of control over the embeddings generated, which may not be optimal for specific downstream tasks. | BookCorpus | Not Specified | ||||||||
8 | ELMo | Contextual LM | 2018 | 94 million | Allen Institute for AI | Computationally expensive, not suitable for generation tasks, lack of interpretability | 3 days | ELMo (Embeddings from Language Models) is a deep contextualized word representation model that uses a bi-directional language model to generate word embeddings that capture contextual information. | Common Crawl | 16 NVIDIA P100 GPUs | |||||||
9 | LASER | Encoder-based LM | 2018 | varies | Facebook AI Research, FAIR | Limited to encoding sentences, not suitable for generation tasks | 10 days | LASER (Language-Agnostic SEntence Representations) is a multilingual sentence embedding model that is pre-trained on various languages. | Common Crawl | 8 NVIDIA V100 GPUs, 192GB of GPU memory, 40 CPU cores | |||||||
10 | COCOA | Encoder-based LM | 2018 | varies | Carnegie Mellon University | Limited to sentence classification tasks, requires high computational power | 1 week | COCOA (Conversational Contextual Cues for Online Abuse) is a language model designed to detect and flag online abuse in conversations. Its key feature is the ability to analyze the context of conversations and detect abusive language. Its limitations include the potential for false positives or false negatives, and the need for ongoing training and updating to adapt to evolving language patterns and new types of abuse. | COCOA dataset | 8 NVIDIA V100 GPUs, 128GB of GPU memory, 40 CPU cores | |||||||
11 | GPT-2 | Transformer-based LM | 2019 | 1.5B | OpenAI | Computationally expensive, prone to generating biased or offensive language, lacks control over generated text | 4 days (on WebText) | A transformer-based language model that is capable of generating high-quality natural language text. Key features include a large number of parameters (up to 1.5 billion), multi-head attention, and transformer blocks. Its limitations include the potential for bias and ethical concerns in generating text. | WebText, Books1, Books2, and English Wikipedia | 8 Nvidia V100 GPUs | |||||||
12 | XLNet | Transformer-based LM | 2019 | 340M | Computationally expensive, lacks memory, prone to overfitting, not suited for generative tasks | 5 days | XLNet is a language model that uses a permutation-based approach to train on all possible combinations of the input data, allowing it to capture bidirectional relationships while avoiding the drawbacks of standard bidirectional models. Its key feature is its ability to capture complex relationships in the input data, but its limitation is that it can be computationally expensive. | BooksCorpus and English Wikipedia | 64 Cloud TPUs or 16 V100 GPUs | ||||||||
13 | RoBERTa | Transformer-based LM | 2019 | 355M | Computationally expensive, lacks memory, not suitable for generation tasks | 4 days (on C4) | RoBERTa (Robustly Optimized BERT Pretraining Approach) is a language model that uses the BERT architecture with additional optimization techniques, resulting in improved performance on various NLP tasks. Its key feature is its robustness and improved performance on certain tasks compared to BERT, but its limitation is that it can be computationally expensive. | BooksCorpus and English Wikipedia | 32 Cloud TPUs or 16 V100 GPUs | ||||||||
14 | ALBERT | Transformer-based LM | 2019 | 11M | Computationally expensive, lacks memory, not suitable for generation tasks | 3.5 days (on C4) | A Lite BERT (ALBERT) is a lightweight version of BERT that aims to reduce the model size and improve training efficiency while achieving similar or better performance than BERT. Limitations include longer training times compared to BERT and potential trade-offs in model size and performance. | BooksCorpus and English Wikipedia | 64 Cloud TPUs or 16 V100 GPUs | ||||||||
15 | CTRL | Transformer-based LM | 2019 | 110M | SalesForce | Computationally expensive, requires domain-specific prompts, not suitable for generation tasks | 4 days | CTRL (Conditional Transformer Language Model) is a text generation model that can be conditioned on a specific task or topic. | WebText and Books1 | 16 V100 GPUs | |||||||
16 | UniLM | Transformer-based LM | 2019 | 340M | Microsoft | Computationally expensive, requires domain-specific prompts, not suitable for generation tasks | 3 days (on Wiki) | UniLM (Unified Language Model) is a language model that can perform both language generation and comprehension tasks by pre-training on both masked language modeling and sequence-to-sequence tasks. Its key feature is its ability to perform both tasks with high accuracy, but its limitation is that it requires a large amount of training data and computational resources. | English Wikipedia and BookCorpus (English + Chinese) | 16 V100 GPUs | |||||||
17 | ERNIE | Transformer-based LM | 2019 | 340M | Baidu | Limited to Chinese language | 2.5 days | ERNIE (Enhanced Representation through kNowledge IntEgration) is a pre-trained language model that integrates knowledge graphs into its training to enhance its understanding of concepts. | Chinese Wikipedia, Baidu Baike, and BookCorpus (Chinese) | 8 Nvidia V100 GPUs | |||||||
18 | Flair | Non-Transformer-based LM | 2019 | 134M | Zalando Research | Limited to NLP tasks | not mentioned in the research paper | Flair is a language model that combines traditional word embeddings with contextual embeddings to improve performance on NLP tasks. Its key feature is its ability to capture complex relationships and context in language data, but its limitation is that it may not perform as well on very large datasets. | WikiText-2, IMDb, and English Wikipedia | Not Specified | |||||||
19 | DistilBERT | Transformer-based LM | 2019 | 66M | Hugging Face | Reduced accuracy compared to BERT | 39 minutes | DistilBERT is a distilled version of BERT that has a smaller model size and faster inference time while maintaining similar performance on various NLP tasks. Limitations include a trade-off between model size and performance, with a slight decrease in accuracy compared to BERT. | BooksCorpus and English Wikipedia | CPU and GPU | |||||||
20 | XLM-RoBERTa | Transformer-based LM | 2019 | 550M | Limited to NLP tasks | 4 days | XLM-RoBERTa is a cross-lingual language model that is pre-trained on a large multilingual corpus and fine-tuned for specific tasks. | 100+ languages and English Wikipedia | 64 Cloud TPUs or 16 V100 GPUs | ||||||||
21 | MASS | Transformer-based LM | 2019 | 110 million | Microsoft Research Asia | Limited to machine translation and text generation | 3 days | MASS (Masked Sequence-to-Sequence Pre-training) is a sequence-to-sequence pre-training approach that uses masked tokens to predict target sequences. | BookCorpus and English Wikipedia | 64 NVIDIA V100 GPUs | |||||||
22 | SpanBERT | Transformer-based LM | 2019 | 108 million | Facebook AI Research | Limited to NLP tasks | 4 days | A language model developed by researchers at Google that extends the BERT architecture by incorporating span representations to capture the relationships between different spans of text. It has shown strong performance on several NLP benchmarks, including coreference resolution and named entity recognition. Limitations: SpanBERT's performance is limited by the quality and size of the training data it has been exposed to, and it may not generalize well to tasks outside of its training data. | BookCorpus and English Wikipedia | 32 NVIDIA V100 GPUs | |||||||
23 | MobileBERT | Transformer-based LM | 2019 | 25 million | Reduced accuracy compared to BERT | 4 days | A compact version of BERT optimized for mobile and embedded devices. Utilizes a task-specific loss function to further optimize the model's efficiency. Limitations include reduced performance compared to full-size BERT, as well as potential accuracy trade-offs due to its smaller size. | BooksCorpus and English Wikipedia | 8 TPUs and 2 GPUs | ||||||||
24 | Funnel Transformer | Transformer-based LM | 2019 | 175 million | Zhejiang University | Limited to NLP tasks | 3 days | A hierarchical architecture that reduces the computational and memory requirements of large-scale language models. Utilizes a decreasing attention mechanism that allows for the processing of longer sequences. Limitations include the need for specialized hardware, as well as potential issues with capturing long-term dependencies in language. | BooksCorpus and English Wikipedia | 64 NVIDIA V100 GPUs | |||||||
25 | SuperGLUE | Transformer-based LM | 2019 | varies | Stanford University, Google | Limited to NLP tasks | 2-3 days | A benchmark suite developed by researchers at Stanford University that evaluates the performance of NLP models on more challenging tasks than previous benchmarks such as GLUE. It includes a diverse set of tasks such as natural language inference, question answering, and coreference resolution. Limitations: SuperGLUE is a benchmark suite and not a language model in itself, and its performance is dependent on the specific models used to evaluate it. | Various, including MultiNLI and RACE | 8 NVIDIA V100 GPUs | |||||||
26 | Transformer-XL | Transformer-based LM | 2019 | 257 million | Limited to sequence modeling tasks | 2.5 days | Transformer-XL is a language model that uses the Transformer architecture with recurrence to overcome the limitations of standard Transformers in capturing long-term dependencies. Its key feature is its ability to capture longer-term dependencies in language data, but its limitation is that it can be computationally expensive. | Various, including Wikitext-103 | 16 NVIDIA V100 GPUs | ||||||||
27 | BART | Transformer-based LM | 2019 | 406 million (base) | Facebook AI Research | Limited to NLP tasks | 3 days | A sequence-to-sequence model that combines bidirectional and unidirectional transformer architectures for pre-training and fine-tuning. It can handle tasks such as text generation and summarization. Its limitations include high computational requirements and the potential for bias in text generation. | Various, including CNN/Daily Mail and XSum | 16 NVIDIA V100 GPUs | |||||||
28 | TAPAS | Transformer-based LM | 2019 | 220 million | Limited to table-based question answering tasks | 1 day | A model designed for table-based question-answering tasks. Utilizes a table embedding layer and a novel table-specific attention mechanism to enable effective table reasoning. Limitations include the need for specific training data for table-based tasks, as well as potential issues with handling tables with complex structures. | Various, including WikiTables | 16 NVIDIA V100 GPUs | ||||||||
29 | XLm-R | Transformer-based LM | 2019 | 550 million (base) | Facebook AI Research | Limited to NLP tasks | 3 days | XLM-R (Cross-lingual Language Model - RoBERTa) is a language model that is pre-trained on multiple languages, allowing it to perform well on cross-lingual NLP tasks. Its key feature is its ability to perform well on multiple languages, but its limitation is that it can be computationally expensive. | Various, including WikiMatrix and UN Parallel Corpus | 8 NVIDIA V100 GPUs | |||||||
30 | GPT-3 | Transformer-based LM | 2020 | 175B | OpenAI | Limited to NLP tasks | not mentioned in the research paper | A large transformer-based language model with up to 175 billion parameters. It is capable of generating high-quality natural language text and can perform various language tasks such as language translation and question-answering. Its limitations include the potential for bias and ethical concerns in generating text. | Various, including CommonCrawl and Wikipedia | 355 TFLOPs | |||||||
31 | T5 | Transformer-based LM | 2020 | 11B | Limited to text-to-text tasks | 4 days | A transformer-based encoder-decoder architecture designed for various natural language tasks. It can perform tasks such as language translation, text summarization, and question-answering. Its limitations include the need for large amounts of training data and computation power. | C4, WebText, and books | 64 TPUv3 | ||||||||
32 | ELECTRA | Transformer-based LM | 2020 | 110M | Limited to NLP tasks | 4 days | ELECTRA is a pre-training approach that replaces a portion of the text with a plausible substitute and trains the model to distinguish between the original and the substitute, resulting in better efficiency and accuracy compared to BERT. Limitations include increased complexity in the pre-training process and potential loss of information due to the replacement of text. | Wikipedia and BooksCorpus | 8 TPUv3 | ||||||||
33 | DeBERTa | Transformer-based LM | 2020 | 345M | Microsoft | Limited to NLP tasks | 7 days (large model) | DeBERTa is a transformer-based language model that integrates various pre-training objectives, including masked language modeling and next sentence prediction, to enhance its performance on a wide range of NLP tasks. Limitations include longer training times and increased complexity compared to simpler pre-training approaches. | Various, including English Wikipedia and OpenWebText | 128 GPUs | |||||||
34 | MiniLM | Transformer-based LM | 2020 | 50M | Microsoft | Reduced accuracy compared to BERT | 6 hours | MiniLM is a compact version of the BERT language model designed to reduce model size and improve efficiency while maintaining similar performance on various NLP tasks. Limitations include trade-offs between model size and performance and potential difficulties in adapting the model to specific tasks. | Various, including English Wikipedia and BookCorpus | 8 V100 GPUs | |||||||
35 | Longformer | Transformer-based LM | 2020 | 400M | Allen Institute for AI | Limited to sequence modeling tasks | 1 hour per epoch | A variant of Transformer architecture that enables processing of long sequences (up to 4,096 tokens). It uses a combination of sliding window attention mechanism and global attention to reduce computation and memory requirements. Limitations include slower training and inference times than standard Transformers. | Various, including English Wikipedia and CommonCrawl | 8 V100 GPUs | |||||||
36 | MBART | Transformer-based LM | 2020 | 6B | Limited to machine translation tasks | 4 days | A multilingual variant of BART, capable of translating between 25 languages. Utilizes a shared encoder-decoder architecture and a task-specific pre-training objective to improve multilingual performance. Limitations include reduced performance on certain language pairs and potential issues with maintaining language-specific nuances during translation. | Various, including Wikipedia and CommonCrawl | 8 TPUv3 | ||||||||
37 | Marian | 2020 | 56M | Microsoft | Not specified | A fast and efficient neural machine translation framework. Utilizes a sequence-to-sequence architecture with attention mechanisms and a hybrid convolutional-recurrent network for improved translation quality and speed. Limitations include the need for specialized hardware for larger models, as well as potential issues with handling long and complex sentences. | Various, including JW300 and OpenSubtitles | Unknown | |||||||||
38 | ProphetNet | Transformer-based LM | 2020 | 554 million | Microsoft Research Asia | Limited to sequence modeling tasks | not mentioned in the research paper | A model designed for both sequence-to-sequence tasks and language modeling. Utilizes a masked self-attention mechanism and a future-aware position encoding scheme for improved long-term sequence modeling. Limitations include the need for larger amounts of training data to achieve optimal performance, as well as potential issues with capturing long-term dependencies. | Various, including English Wikipedia and BooksCorpus | 32 V100 GPUs | |||||||
39 | Performer | Transformer-based LM | 2020 | 37 million (base) | Limited to sequence modeling tasks | 4.5 hours | A transformer-based model that uses an attention mechanism that is more efficient and easier to compute than standard self-attention. It has achieved state-of-the-art performance on several benchmarks while being more computationally efficient than other models. Its main limitation is that it may not be as effective as standard self-attention on tasks that require more complex attention patterns. | Various, including WMT16 and WMT18 news translation | 1 V100 GPU | ||||||||
40 | Meena | Conversational AI | 2020 | 2.6 billion (base) | May have ethical implications | (Need to confirm) | A conversational AI model that uses a seq2seq architecture and a human-like persona to generate natural language responses to user inputs. It has been shown to be more engaging and emotionally intelligent than other conversational AI models. One limitation is that it requires a large amount of data and computational resources to train effectively. | Multi-repository corpus with 341GB of text | 2048-v3 TPU | ||||||||
41 | MASSIVE | Multilingual (Asian languages) | 2020 | 400 million | Microsoft Research Asia | Focuses mainly on Asian languages | not mentioned in research paper | A transformer-based model for multi-agent reinforcement learning that can handle large state and action spaces. It has achieved state-of-the-art performance on several benchmarks. Its main limitation is that it can be computationally expensive, especially when working with large state and action spaces. | English Wikipedia + BookCorpus | 1024-v100 GPUs | |||||||
42 | SemBERT | Semantic Understanding | 2020 | 340 million (base) | Zhejiang University | Limited to English and Chinese languages | 40 hours | A model designed for joint modeling of semantic and syntactic information in text. Utilizes a BERT-like architecture with additional semantic and syntactic modules to enable fine-grained analysis of text. Limitations include the need for specialized training data for semantic and syntactic parsing tasks, as well as potential issues with maintaining model interpretability. | English Wikipedia + BookCorpus | 64-v100 GPUs | |||||||
43 | RAG | Retrieval-Augmented Generation | 2020 | 550 million (base) | Facebook AI Research | Limited to short-text generation tasks | 2 days | A transformer-based model that performs retrieval-augmented generation, which enables more precise and informative generation by incorporating relevant context from a large database of text. Its main limitation is that it can be computationally expensive, especially when working with large databases. | English Wikipedia + Common Crawl | 8-v100 GPUs (for fine-tuning), 64-v100 GPUs (for pre-training) | |||||||
44 | PEGASUS | Abstractive Text Summarization | 2020 | 568 million (large) | Heavy computational resource requirements | 9.5 days | A transformer-based sequence-to-sequence model designed for text summarization tasks. It uses a novel pre-training approach called "unsupervised massive-scale self-supervised learning" to learn text representations. Its limitations include the need for large amounts of training data and the potential for bias in text generation. | English Wikipedia + BooksCorpus + OpenWebText + GitHub | 32-v100 GPUs | ||||||||
45 | GShard-OLM | Cross-lingual language modeling | 2020 | 600 billion | OpenAI | Requires large-scale distributed training | not mentioned in research paper | A large-scale pre-training method for natural language generation tasks using GShard architecture. Its limitation is that it requires massive computational resources for pre-training. | Large text corpora | 2,048-cores (for training), 512-cores (for evaluation) | |||||||
46 | X-Transformer | Cross-lingual language modeling | 2020 | 102 million (base) | Huawei Noah's Ark Lab | Requires specialized hardware for training | not mentioned in research paper | A language model developed by Facebook AI that extends the Transformer architecture by introducing cross-attention between different layers of the model. It has shown strong performance on several NLP benchmarks, including machine translation and language understanding. | Large-scale text corpora | 512-v100 GPUs | |||||||
47 | RoBERTa-large | Robustly Optimized BERT Pretraining Approach, Large | 2020 | 355M | Facebook AI Research (FAIR) | High computational requirements | 4 days | A larger version of RoBERTa, developed by Facebook AI, with 355 million parameters. It has been trained on massive amounts of data and has shown strong performance on several NLP benchmarks, including GLUE and SuperGLUE. Limitations: RoBERTa-large's large size makes it computationally expensive, and it may not be suitable for all applications due to its high memory requirements. | English Wikipedia + Common Crawl | 8-v100 GPUs | |||||||
48 | ERNIE 2.0 | Enhanced Representation through Knowledge Integration, version 2.0 | 2020 | 2.8B | Baidu | Limited to Chinese language processing | 1 day | An improved version of the ERNIE language model that integrates knowledge graphs and improves representation learning. It can handle various natural language tasks such as sentiment analysis and named entity recognition. Its limitations include the need for large amounts of training data and the potential for bias in generating text. | Large-scale Chinese corpora + English Wikipedia + BookCorpus | 128-v100 GPUs | |||||||
49 | SEQUENCE-TO-SEQUENCE Transformer (S2S-T5) | Sequence-to-sequence generation | 2020 | 11B | Microsoft | Requires large-scale distributed training | not mentioned in research paper | A transformer-based architecture specifically designed for sequence-to-sequence tasks, such as machine translation and summarization. It achieves state-of-the-art performance on several benchmarks and can be fine-tuned for specific tasks. Limitations include high computational requirements and large model size. | English Wikipedia + CC-News + OpenWebText + StoriesCorpus | 64-v100 GPUs (for pre-training), 1-v100 GPU (for fine-tuning) | |||||||
50 | MT5 | Multilingual T5 | 2020 | 14.7B | Heavy computational resource requirements | not mentioned in research paper | A multilingual version of T5 that can handle tasks across multiple languages. Its key features include a shared encoder-decoder architecture and the ability to handle multiple languages. Its limitations include the need for large amounts of multilingual training data. | C4 dataset | 2048-v3 TPUs | ||||||||
51 | LaBSE | Language-agnostic BERT Sentence Embedding | 2020 | 247M | Limited to sentence-level embeddings | not mentioned in research paper | Language-agnostic BERT Sentence Embedding (LaBSE) is a cross-lingual language model that produces high-quality sentence embeddings in multiple languages, allowing for language-agnostic text analysis. Limitations include potential difficulties in handling low-resource languages and the need for large amounts of pre-training data. | Multilingual Wikipedia, OSCAR, and the Common Crawl corpus | - | ||||||||
52 | Wu Dao 2.0 | Chinese language processing | 2020 | 1.75T | Beijing Academy of Artificial Intelligence (BAAI) | High computational resource requirements | not mentioned in research paper | A supercomputer developed by China's National Supercomputer Center that is specifically designed for AI applications. It has a computing power of 200 petaflops and is used for training and testing large AI models. Limitations include limited accessibility to researchers outside China. | Proprietary dataset of mixed-domain Chinese text | 1,024-Ascend 910 AI processors | |||||||
53 | GShard | Cross-lingual language modeling | 2021 | 600B | OpenAI | Requires large-scale distributed training | not mentioned in research paper | A large-scale unsupervised language model that uses a distributed architecture to enable training on massive datasets. It is based on the transformer architecture and has been shown to achieve state-of-the-art performance on a range of natural language processing tasks. | Large-scale text corpora | 2,048-cores (for training), 512-cores (for evaluation) | |||||||
54 | Swin Transformer | Image processing | 2021 | 200 million (base) | Microsoft Research Asia | Limited to image processing tasks | not mentioned in research paper | A hierarchical Transformer that incorporates local and global self-attention mechanisms. Its limitation is that it may suffer from limited attention coverage for long sequences. | ImageNet, COCO, and various object detection datasets | 1024-v100 GPUs | |||||||
55 | Speech-CLIP | Speech recognition and vision | 2021 | 400 million | Facebook AI Research | Limited to audio and visual processing | not mentioned in research paper | A model that combines vision and audio information to enable tasks like audio-visual event localization and sound separation. It uses contrastive learning to learn audio and visual representations jointly. One limitation is that it requires a large amount of data to train effectively. | Various audio datasets | - | |||||||
56 | Turing NLG | 2021 | 2.7 billion (base) | Microsoft Research | A neural language generation model designed for text completion and summarization tasks. Its limitation is that it may produce outputs that lack coherence and context. | Various dialogue datasets | 24 V100 GPUs | ||||||||||
57 | BlenderBot | Conversational LLM for chatbots and dialog systems | 2021 | 4 billion (small) | Facebook AI Research | Requires a large amount of computational resources for training | 3 weeks | A transformer-based chatbot developed by Facebook AI Research that can engage in open-domain conversations with humans. It uses a large-scale generative model trained on a diverse set of conversation topics. Its limitations include the potential for generating inappropriate or biased responses. | A combination of existing conversational and social media datasets | 2048-v3 TPUs | |||||||
58 | Evolved | LLM for long-form content generation and summarization | 2021 | 2 billion (base) | Not suitable for small-scale or low-resource environments | not mentioned | An architecture search method for Transformers that optimizes for a given task. Its limitation is that the search process is computationally expensive. | C4, CC-News, OpenWebText | 512-v100 GPUs | ||||||||
59 | PLATO | Large-scale LLM for dialogue generation | 2021 | 34 billion (base) | LingoAce, South China Morning Post | Limited to generating human-like responses in a few specific domains | not mentioned | A large-scale pre-trained model for multi-turn dialogue generation that incorporates knowledge distillation and a hierarchical Transformer architecture. Its limitation is that it requires a significant amount of pre-training data. | A large-scale proprietary dataset of diverse user interactions | 512-v100 GPUs (for pre-training) | |||||||
60 | AdaLM | Multilingual LLM for natural language processing | 2021 | 1.3 billion (base) | RWTH Aachen University | Not explicitly designed for any specific task | 7 days | A large-scale adaptive language model that dynamically adjusts its parameters based on the input sequence. Its limitation is that it requires a large amount of computation for training and inference. | Proprietary dataset of user-generated text | 256-v100 GPUs | |||||||
61 | GPT-Neo | LLM designed for natural language processing and generation | 2021 | 2.7 billion (base) | EleutherAI | Can exhibit gender and racial bias | not mentioned | GPT-Neo is a transformer-based language model developed by EleutherAI that uses the same architecture as GPT but is trained on a larger dataset with a wider range of data sources, resulting in improved performance on various NLP tasks. Limitations include potential difficulties in adapting to specific domains and the need for large amounts of training data. | WebText, Books1, Books2, CC-News, Stories, Wikipedia | Up to 2,048-v100 GPUs or 512-CPUs | |||||||
62 | Codex | LLM designed for programming language processing | 2021 | 6 billion (base) | GitHub | Limited to a specific task of programming language processing | not mentioned | Codex is a language model developed by OpenAI that is specifically trained on code, enabling it to perform various programming tasks, such as code completion and bug detection. Limitations include a limited range of programming languages and potential difficulties in adapting to new programming paradigms. | Public code repositories and Stack Overflow | - | |||||||
63 | MASS-T5 | Multilingual LLM for natural language processing and generation | 2021 | 14.5 billion (base) | Microsoft Research Asia | Limited to the T5 architecture and focused on English | not mentioned | A large-scale language model based on the transformer architecture that has been trained on a diverse range of tasks, including language modeling, translation, and summarization. It has been shown to perform well on a wide range of natural language processing tasks. | C4 dataset | 1024-v100 GPUs | |||||||
64 | BigBird | LLM designed for long-form content processing | 2021 | 1.3B | Hugging Face | Requires high computational resources and time for training | not mentioned | Hugging Face's BigBird (Bidirectional Encoder Representations from Transformers): It is a transformer-based language model that can process sequences of up to 8,192 tokens. It uses a sparse attention mechanism to reduce the computational cost of the transformer architecture. | English Wikipedia, GigaWord, ClueWeb and Common Crawl | 512-v100 GPUs | |||||||
65 | DALL-E 2 | LLM designed for image generation | 2021 | 250M | OpenAI | Limited to generating images with 256x256 pixels | not mentioned | A large-scale generative model developed by OpenAI that can generate images from textual descriptions. It uses a transformer-based architecture and a unique pre-training approach called "clip contrastive learning" to generate images. Its limitations include the potential for generating inappropriate or biased images. | A curated subset of image-text pairs from the internet and CC-WebVideo | 8192-TPUs | |||||||
66 | GShard-XL | Large-scale LLM with cross-lingual pre-training | 2021 | 6.9B | Limited to the GShard architecture and focused on English | not mentioned | A transformer-based model that allows for training on extremely large datasets by splitting the model parameters across multiple devices. The different versions correspond to the number of parameters and the size of the dataset used for training. Their main limitation is that they require specialized infrastructure to train and use effectively. | Large-scale text corpora | 8,192-cores (for training), 2,048-cores (for evaluation) | ||||||||
67 | UniLMv2 | LLM for natural language processing and generation | 2022 | 340M | Microsoft | Limited to the UniLM architecture and focused on English | not mentioned | An extension of the UniLM model that incorporates pre-training on multiple tasks to improve performance on downstream tasks. It has achieved state-of-the-art performance on several benchmarks. One limitation is that it may require more computational resources to train and use than other models. | English Wikipedia, BooksCorpus, OpenWebText, STS-B, SQuAD and a QA dataset | 16-v100 GPUs, 16-TPUs, or 8-GPUs | |||||||
68 | GShard-3B | Large-scale LLM with unsupervised training | 2022 | 3B | Limited to the GShard architecture and focused on English | not mentioned | A large-scale multilingual model with 3 billion parameters trained by Google on its GShard distributed training system. It can handle 499 different languages and shows state-of-the-art performance on several NLP tasks. Limitations: Due to its large size, GShard-3B is computationally expensive and requires specialized hardware for training and inference. | Large-scale text corpora | 2,048-cores (for training), 512-cores (for evaluation) | ||||||||
69 | GShard-13B | Large-scale LLM with unsupervised training | 2022 | 13B | Limited to the GShard architecture and focused on English | not mentioned | A further extension of GShard-3B, with 13 billion parameters, making it one of the largest language models available. It can handle even more languages than GShard-3B and has shown strong performance on various NLP benchmarks. Limitations: As with GShard-3B, GShard-13B requires significant computational resources and specialized hardware for training and inference. | Large-scale text corpora | 16,384-cores (for training), 4,096-cores (for evaluation) | ||||||||
70 | Chinchilla | LLM designed for code generation | 2022 | 70 billion | DeepMind | Limited to a specific task of code generation | not mentioned | A language model designed for natural language understanding and generation tasks. It uses a sparse attention mechanism and a novel pre-training approach called "knowledge transfer" to improve text representation learning. Its limitations include the need for large amounts of training data and computation power. | BookCorpus, Wikipedia, Common Crawl and OpenWebText | 8-TPUs | |||||||
71 | PaLM | LLM designed for natural language processing | 2022 | 540 billion | Limited to a specific task of natural language processing | not mentioned | A language model based on the neural network architecture, which uses self-attention and LSTMs to learn contextual word representations. It has been trained on a large corpus of text from different domains and languages. | WikiText-103, Toronto Books Corpus, UMBC corpus, and Common Crawl | 8 NVIDIA V100 GPUs | ||||||||
72 | OPT (Open Pretrained Transformer) | LLM designed for natural language processing | 2022 | 175 billion | Meta | Limited flexibility for fine-tuning on specific tasks due to a fixed architecture and pre-defined set of parameters. | not mentioned | Open Pretrained Transformer (OPT) is a transformer-based language model developed by Microsoft Research Asia that is pre-trained on a large-scale dataset and can be fine-tuned for various NLP tasks. Its key feature is its adaptability to a wide range of NLP tasks with good performance. Limitations include potential difficulties in handling low-resource languages and the need for large amounts of pre-training data. | Various large-scale text corpora, including Common Crawl, BooksCorpus, and Wikipedia | 256 TPU v3s | |||||||
73 | YaLM 100B | LLM for natural language processing and generation | 2022 | 100 billion | Yandex | Limited accessibility for many researchers and developers due to the high computational requirements and cost. | not mentioned | Yet another Language Model with 100 Billion parameters: It is a transformer-based language model with 100 billion parameters, trained on a diverse range of datasets. It is capable of generating high-quality text in various languages. | Large-scale text corpora, including Common Crawl, and various scientific and academic datasets | 8192 NVIDIA V100 GPUs | |||||||
74 | Minerva | LLM designed for natural language processing | 2022 | 540 billion | Limited interpretability and transparency of its decision-making process due to its complex architecture and training process. | not mentioned | A language model trained on a diverse range of texts from various domains and languages. It is based on a transformer-based architecture that uses self-attention to learn contextual representations of words. | OpenWebText2, Common Crawl, PubMed, and other public datasets | 2x2 NVIDIA V100 GPUs | ||||||||
75 | BLOOM | LLM designed for natural language processing | 2022 | 175 billion | Large collaboration led by Hugging Face | Limited scalability for large-scale language understanding tasks due to its relatively small model size. | not mentioned | A large-scale language model based on a transformer architecture, trained on a diverse range of datasets. It has been shown to perform well on a wide range of natural language processing tasks, including language modeling, question-answering, and sentiment analysis. | OpenWebText2 and BooksCorpus | Not specified | |||||||
76 | AlexaTM (Teacher Models) | LLM designed for natural language processing | 2022 | 20 billion | Amazon | Limited generalizability to tasks beyond the scope of the specific domains they are trained on. | not mentioned | It is a family of pre-trained language models developed by Amazon for various natural language processing tasks, including text classification, entity recognition, and question-answering. They are trained on a large corpus of text and can be fine-tuned on specific tasks. | Various datasets, including BooksCorpus, Common Crawl, and other web text | Not specified | |||||||
77 | LLaMA (Large Language Model Meta AI) | LLM designed for natural language processing and generation | 2023 | 65 billion | Meta | Limited interpretability and control over the learned representations due to its highly automated training process. | not mentioned | It is a language model developed by Intel that leverages meta-learning to adapt to new tasks with few examples. It is based on a transformer architecture and has been shown to outperform other language models on few-shot learning tasks. | Various large-scale text corpora, including Common Crawl, BooksCorpus, and Wikipedia | Not specified | |||||||
78 | GPT-4 | LLM designed for natural language processing and generation | 2023 | Unknown | OpenAI | Unknown | not mentioned | A hypothetical successor to GPT-3. Its potential features and limitations are unknown at this time. | Not released yet | Not released yet | |||||||
79 | ERNIE Bot | Conversational LLM for chatbots and dialog systems | 2023 | Unknown | Baidu | Unknown | not mentioned | A chatbot model developed by Baidu, based on the ERNIE pretraining framework. It has been trained on large amounts of dialogue data and can engage in natural-sounding conversations with users. Limitations: ERNIE Bot's performance is limited by the quality of the training data it has been exposed to, and it may struggle with complex or abstract topics. | Not released yet | Not released yet | |||||||
80 | |||||||||||||||||
81 | |||||||||||||||||
82 | |||||||||||||||||
83 | |||||||||||||||||
84 | |||||||||||||||||
85 | |||||||||||||||||
86 | |||||||||||||||||
87 | |||||||||||||||||
88 | |||||||||||||||||
89 | |||||||||||||||||
90 | |||||||||||||||||
91 | |||||||||||||||||
92 | |||||||||||||||||
93 | |||||||||||||||||
94 | |||||||||||||||||
95 | |||||||||||||||||
96 | |||||||||||||||||
97 | |||||||||||||||||
98 | |||||||||||||||||
99 | |||||||||||||||||
100 |