ABCDEFGHIJKLMNOPQ
1
Model NameAboutYear BuiltNumber of
Parameters
CreatorLimitationsTraining TimeModel DefinitionDataset Used for TrainingHardware Configuration
2
ALICEChatbot1995N/ARichard
Wallace
Limited to scripted responses, lacks general knowledge, can be prone to repeating itself2 weeksALICE (Artificial Linguistic Internet Computer Entity) is a chatbot framework that uses AIML (Artificial Intelligence Markup Language) for natural language processing and rule-based responses. Its key features include the ability to engage in conversation, answer questions, and provide information. Its limitations include the need for manual programming of rules, which can be time-consuming and may not cover all possible scenarios, and a lack of ability to learn from new data.Cornell Movie DialogsNot Specified
3
AWD-LSTMLSTM-based language model201733 million (base)Salesforce
Research
Limited to sequential data, requires large amounts of training data12 hours (on WikiText)AWD-LSTM (ASGD Weight-Dropped LSTM) is a language model architecture that uses LSTM (Long Short-Term Memory) cells with weight dropout and activation dropout to prevent overfitting. Its key feature is improved training on small datasets. Its limitations include the potential for slower training and the need for more memory compared to simpler models.Wikitext-103Not Specified
4
GPTTransformer-based LM2018117MOpenAILimited to language tasks, lack of memory, can generate nonsensical responses4 daysGPT (Generative Pre-trained Transformer) is a generative language model that uses the Transformer architecture with self-attention to generate text. Its key feature is the ability to generate coherent and diverse text, but its limitation is that it may generate biased or inappropriate text.BooksCorpus, English Wikipedia8 NVIDIA V100 GPUs, 128GB of GPU memory, 40 CPU cores
5
BERTTransformer-based LM2018110MGoogleComputationally expensive, lacks memory, prone to overfitting, not suited for generative tasks4 daysBidirectional Encoder Representations from Transformers (BERT) is a transformer-based language model designed to pre-train deep bidirectional representations from a large unlabeled text corpus, which can then be fine-tuned for a variety of NLP tasks such as question-answering and sentiment analysis. Limitations include its large size and computational requirements for pre-training.BooksCorpus, English Wikipedia64 Cloud TPUs
6
AdaNetAutoML-based LM2018variesGoogleLimited to language tasks, lack of interpretability, requires high computational power2 hoursAdaNet is a neural architecture search algorithm that automatically designs high-quality models by iteratively learning from previous architectures. It trains a custom ensemble of neural networks, each of which is trained with a different architecture and dataset weighting. Its limitation includes its high computational and time requirements, and its dependency on the quality and quantity of training data.CIFAR-101 Google Cloud TPU
7
QuickThoughtEncoder-based LM2018variesGoogleLimited to encoding sentences, not suitable for generation tasks1 weekQuickThought is a language model architecture that uses unsupervised learning to generate sentence embeddings. Its key feature is the ability to learn from large amounts of unstructured data without the need for labeled data. Its limitations include the lack of control over the embeddings generated, which may not be optimal for specific downstream tasks.BookCorpusNot Specified
8
ELMoContextual LM201894 millionAllen
Institute for AI
Computationally expensive, not suitable for generation tasks, lack of interpretability3 daysELMo (Embeddings from Language Models) is a deep contextualized word representation model that uses a bi-directional language model to generate word embeddings that capture contextual information.Common Crawl16 NVIDIA P100 GPUs
9
LASEREncoder-based LM2018variesFacebook
AI Research, FAIR
Limited to encoding sentences, not suitable for generation tasks10 daysLASER (Language-Agnostic SEntence Representations) is a multilingual sentence embedding model that is pre-trained on various languages. Common Crawl8 NVIDIA V100 GPUs, 192GB of GPU memory, 40 CPU cores
10
COCOAEncoder-based LM2018variesCarnegie
Mellon University
Limited to sentence classification tasks, requires high computational power1 weekCOCOA (Conversational Contextual Cues for Online Abuse) is a language model designed to detect and flag online abuse in conversations. Its key feature is the ability to analyze the context of conversations and detect abusive language. Its limitations include the potential for false positives or false negatives, and the need for ongoing training and updating to adapt to evolving language patterns and new types of abuse.COCOA dataset8 NVIDIA V100 GPUs, 128GB of GPU memory, 40 CPU cores
11
GPT-2Transformer-based LM20191.5BOpenAIComputationally expensive, prone to generating biased or offensive language, lacks control over generated text4 days (on WebText)A transformer-based language model that is capable of generating high-quality natural language text. Key features include a large number of parameters (up to 1.5 billion), multi-head attention, and transformer blocks. Its limitations include the potential for bias and ethical concerns in generating text.WebText, Books1, Books2, and English Wikipedia8 Nvidia V100 GPUs
12
XLNetTransformer-based LM2019340MGoogleComputationally expensive, lacks memory, prone to overfitting, not suited for generative tasks5 daysXLNet is a language model that uses a permutation-based approach to train on all possible combinations of the input data, allowing it to capture bidirectional relationships while avoiding the drawbacks of standard bidirectional models. Its key feature is its ability to capture complex relationships in the input data, but its limitation is that it can be computationally expensive.BooksCorpus and English Wikipedia64 Cloud TPUs or 16 V100 GPUs
13
RoBERTaTransformer-based LM2019355MFacebookComputationally expensive, lacks memory, not suitable for generation tasks4 days (on C4)RoBERTa (Robustly Optimized BERT Pretraining Approach) is a language model that uses the BERT architecture with additional optimization techniques, resulting in improved performance on various NLP tasks. Its key feature is its robustness and improved performance on certain tasks compared to BERT, but its limitation is that it can be computationally expensive.BooksCorpus and English Wikipedia32 Cloud TPUs or 16 V100 GPUs
14
ALBERTTransformer-based LM201911MGoogleComputationally expensive, lacks memory, not suitable for generation tasks3.5 days (on C4)A Lite BERT (ALBERT) is a lightweight version of BERT that aims to reduce the model size and improve training efficiency while achieving similar or better performance than BERT. Limitations include longer training times compared to BERT and potential trade-offs in model size and performance.BooksCorpus and English Wikipedia64 Cloud TPUs or 16 V100 GPUs
15
CTRLTransformer-based LM2019110MSalesForceComputationally expensive, requires domain-specific prompts, not suitable for generation tasks4 days
CTRL (Conditional Transformer Language Model) is a text generation model that can be conditioned on a specific task or topic.
WebText and Books116 V100 GPUs
16
UniLMTransformer-based LM2019340MMicrosoftComputationally expensive, requires domain-specific prompts, not suitable for generation tasks3 days (on Wiki)UniLM (Unified Language Model) is a language model that can perform both language generation and comprehension tasks by pre-training on both masked language modeling and sequence-to-sequence tasks. Its key feature is its ability to perform both tasks with high accuracy, but its limitation is that it requires a large amount of training data and computational resources.English Wikipedia and BookCorpus (English + Chinese)16 V100 GPUs
17
ERNIETransformer-based LM2019340MBaiduLimited to Chinese language2.5 daysERNIE (Enhanced Representation through kNowledge IntEgration) is a pre-trained language model that integrates knowledge graphs into its training to enhance its understanding of concepts. Chinese Wikipedia, Baidu Baike, and BookCorpus (Chinese)8 Nvidia V100 GPUs
18
FlairNon-Transformer-based LM2019134MZalando
Research
Limited to NLP tasksnot mentioned in the research paperFlair is a language model that combines traditional word embeddings with contextual embeddings to improve performance on NLP tasks. Its key feature is its ability to capture complex relationships and context in language data, but its limitation is that it may not perform as well on very large datasets.WikiText-2, IMDb, and English WikipediaNot Specified
19
DistilBERTTransformer-based LM201966MHugging
Face
Reduced accuracy compared to BERT39 minutesDistilBERT is a distilled version of BERT that has a smaller model size and faster inference time while maintaining similar performance on various NLP tasks. Limitations include a trade-off between model size and performance, with a slight decrease in accuracy compared to BERT.BooksCorpus and English WikipediaCPU and GPU
20
XLM-RoBERTaTransformer-based LM2019550MFacebookLimited to NLP tasks4 daysXLM-RoBERTa is a cross-lingual language model that is pre-trained on a large multilingual corpus and fine-tuned for specific tasks.100+ languages and English Wikipedia64 Cloud TPUs or 16 V100 GPUs
21
MASSTransformer-based LM2019110
million
Microsoft
Research Asia
Limited to machine translation and text generation3 daysMASS (Masked Sequence-to-Sequence Pre-training) is a sequence-to-sequence pre-training approach that uses masked tokens to predict target sequences.
BookCorpus and English Wikipedia64 NVIDIA V100 GPUs
22
SpanBERTTransformer-based LM2019108
million
Facebook
AI Research
Limited to NLP tasks4 daysA language model developed by researchers at Google that extends the BERT architecture by incorporating span representations to capture the relationships between different spans of text. It has shown strong performance on several NLP benchmarks, including coreference resolution and named entity recognition.
Limitations: SpanBERT's performance is limited by the quality and size of the training data it has been exposed to, and it may not generalize well to tasks outside of its training data.
BookCorpus and English Wikipedia32 NVIDIA V100 GPUs
23
MobileBERTTransformer-based LM201925
million
GoogleReduced accuracy compared to BERT4 daysA compact version of BERT optimized for mobile and embedded devices.
Utilizes a task-specific loss function to further optimize the model's efficiency.
Limitations include reduced performance compared to full-size BERT, as well as potential accuracy trade-offs due to its smaller size.
BooksCorpus and English Wikipedia8 TPUs and 2 GPUs
24
Funnel TransformerTransformer-based LM2019175
million
Zhejiang
University
Limited to NLP tasks3 daysA hierarchical architecture that reduces the computational and memory requirements of large-scale language models.
Utilizes a decreasing attention mechanism that allows for the processing of longer sequences.
Limitations include the need for specialized hardware, as well as potential issues with capturing long-term dependencies in language.
BooksCorpus and English Wikipedia64 NVIDIA V100 GPUs
25
SuperGLUETransformer-based LM2019variesStanford
University, Google
Limited to NLP tasks2-3 daysA benchmark suite developed by researchers at Stanford University that evaluates the performance of NLP models on more challenging tasks than previous benchmarks such as GLUE. It includes a diverse set of tasks such as natural language inference, question answering, and coreference resolution.
Limitations: SuperGLUE is a benchmark suite and not a language model in itself, and its performance is dependent on the specific models used to evaluate it.

Various, including MultiNLI and RACE8 NVIDIA V100 GPUs
26
Transformer-XLTransformer-based LM2019257
million
GoogleLimited to sequence modeling tasks2.5 daysTransformer-XL is a language model that uses the Transformer architecture with recurrence to overcome the limitations of standard Transformers in capturing long-term dependencies. Its key feature is its ability to capture longer-term dependencies in language data, but its limitation is that it can be computationally expensive.Various, including Wikitext-10316 NVIDIA V100 GPUs
27
BARTTransformer-based LM2019406
million (base)
Facebook
AI Research
Limited to NLP tasks3 daysA sequence-to-sequence model that combines bidirectional and unidirectional transformer architectures for pre-training and fine-tuning. It can handle tasks such as text generation and summarization. Its limitations include high computational requirements and the potential for bias in text generation.Various, including CNN/Daily Mail and XSum16 NVIDIA V100 GPUs
28
TAPASTransformer-based LM2019220
million
GoogleLimited to table-based question answering tasks1 dayA model designed for table-based question-answering tasks.
Utilizes a table embedding layer and a novel table-specific attention mechanism to enable effective table reasoning.
Limitations include the need for specific training data for table-based tasks, as well as potential issues with handling tables with complex structures.
Various, including WikiTables16 NVIDIA V100 GPUs
29
XLm-RTransformer-based LM2019550
million (base)
Facebook
AI Research
Limited to NLP tasks3 daysXLM-R (Cross-lingual Language Model - RoBERTa) is a language model that is pre-trained on multiple languages, allowing it to perform well on cross-lingual NLP tasks. Its key feature is its ability to perform well on multiple languages, but its limitation is that it can be computationally expensive.Various, including WikiMatrix and UN Parallel Corpus8 NVIDIA V100 GPUs
30
GPT-3Transformer-based LM2020175BOpenAILimited to NLP tasksnot mentioned in the research paperA large transformer-based language model with up to 175 billion parameters. It is capable of generating high-quality natural language text and can perform various language tasks such as language translation and question-answering. Its limitations include the potential for bias and ethical concerns in generating text.Various, including CommonCrawl and Wikipedia355 TFLOPs
31
T5Transformer-based LM202011BGoogleLimited to text-to-text tasks4 daysA transformer-based encoder-decoder architecture designed for various natural language tasks. It can perform tasks such as language translation, text summarization, and question-answering. Its limitations include the need for large amounts of training data and computation power.C4, WebText, and books64 TPUv3
32
ELECTRATransformer-based LM2020110MGoogleLimited to NLP tasks4 daysELECTRA is a pre-training approach that replaces a portion of the text with a plausible substitute and trains the model to distinguish between the original and the substitute, resulting in better efficiency and accuracy compared to BERT. Limitations include increased complexity in the pre-training process and potential loss of information due to the replacement of text.Wikipedia and BooksCorpus8 TPUv3
33
DeBERTaTransformer-based LM2020345MMicrosoftLimited to NLP tasks7 days (large model)DeBERTa is a transformer-based language model that integrates various pre-training objectives, including masked language modeling and next sentence prediction, to enhance its performance on a wide range of NLP tasks. Limitations include longer training times and increased complexity compared to simpler pre-training approaches.Various, including English Wikipedia and OpenWebText128 GPUs
34
MiniLMTransformer-based LM202050MMicrosoftReduced accuracy compared to BERT6 hoursMiniLM is a compact version of the BERT language model designed to reduce model size and improve efficiency while maintaining similar performance on various NLP tasks. Limitations include trade-offs between model size and performance and potential difficulties in adapting the model to specific tasks.Various, including English Wikipedia and BookCorpus8 V100 GPUs
35
LongformerTransformer-based LM2020400MAllen
Institute for AI
Limited to sequence modeling tasks1 hour per epochA variant of Transformer architecture that enables processing of long sequences (up to 4,096 tokens). It uses a combination of sliding window attention mechanism and global attention to reduce computation and memory requirements. Limitations include slower training and inference times than standard Transformers.Various, including English Wikipedia and CommonCrawl8 V100 GPUs
36
MBARTTransformer-based LM20206BFacebookLimited to machine translation tasks4 daysA multilingual variant of BART, capable of translating between 25 languages.
Utilizes a shared encoder-decoder architecture and a task-specific pre-training objective to improve multilingual performance.
Limitations include reduced performance on certain language pairs and potential issues with maintaining language-specific nuances during translation.
Various, including Wikipedia and CommonCrawl8 TPUv3
37
Marian202056MMicrosoftNot specifiedA fast and efficient neural machine translation framework.
Utilizes a sequence-to-sequence architecture with attention mechanisms and a hybrid convolutional-recurrent network for improved translation quality and speed.
Limitations include the need for specialized hardware for larger models, as well as potential issues with handling long and complex sentences.
Various, including JW300 and OpenSubtitlesUnknown
38
ProphetNetTransformer-based LM2020554
million
Microsoft
Research Asia
Limited to sequence modeling tasksnot mentioned in the research paperA model designed for both sequence-to-sequence tasks and language modeling.
Utilizes a masked self-attention mechanism and a future-aware position encoding scheme for improved long-term sequence modeling.
Limitations include the need for larger amounts of training data to achieve optimal performance, as well as potential issues with capturing long-term dependencies.
Various, including English Wikipedia and BooksCorpus32 V100 GPUs
39
PerformerTransformer-based LM202037
million (base)
GoogleLimited to sequence modeling tasks4.5 hoursA transformer-based model that uses an attention mechanism that is more efficient and easier to compute than standard self-attention. It has achieved state-of-the-art performance on several benchmarks while being more computationally efficient than other models. Its main limitation is that it may not be as effective as standard self-attention on tasks that require more complex attention patterns.Various, including WMT16 and WMT18 news translation1 V100 GPU
40
MeenaConversational AI20202.6
billion (base)
GoogleMay have ethical implications(Need to confirm)A conversational AI model that uses a seq2seq architecture and a human-like persona to generate natural language responses to user inputs. It has been shown to be more engaging and emotionally intelligent than other conversational AI models. One limitation is that it requires a large amount of data and computational resources to train effectively.Multi-repository corpus with 341GB of text2048-v3 TPU
41
MASSIVEMultilingual (Asian languages)2020400
million
Microsoft
Research Asia
Focuses mainly on Asian languagesnot mentioned in research paperA transformer-based model for multi-agent reinforcement learning that can handle large state and action spaces. It has achieved state-of-the-art performance on several benchmarks. Its main limitation is that it can be computationally expensive, especially when working with large state and action spaces.English Wikipedia + BookCorpus1024-v100 GPUs
42
SemBERTSemantic Understanding2020340
million (base)
Zhejiang
University
Limited to English and Chinese languages40 hoursA model designed for joint modeling of semantic and syntactic information in text.
Utilizes a BERT-like architecture with additional semantic and syntactic modules to enable fine-grained analysis of text.
Limitations include the need for specialized training data for semantic and syntactic parsing tasks, as well as potential issues with maintaining model interpretability.
English Wikipedia + BookCorpus64-v100 GPUs
43
RAGRetrieval-Augmented Generation2020550
million (base)
Facebook
AI Research
Limited to short-text generation tasks2 daysA transformer-based model that performs retrieval-augmented generation, which enables more precise and informative generation by incorporating relevant context from a large database of text. Its main limitation is that it can be computationally expensive, especially when working with large databases.English Wikipedia + Common Crawl8-v100 GPUs (for fine-tuning), 64-v100 GPUs (for pre-training)
44
PEGASUSAbstractive Text Summarization2020568
million (large)
GoogleHeavy computational resource requirements9.5 daysA transformer-based sequence-to-sequence model designed for text summarization tasks. It uses a novel pre-training approach called "unsupervised massive-scale self-supervised learning" to learn text representations. Its limitations include the need for large amounts of training data and the potential for bias in text generation.English Wikipedia + BooksCorpus + OpenWebText + GitHub32-v100 GPUs
45
GShard-OLMCross-lingual language modeling2020600
billion
OpenAIRequires large-scale distributed trainingnot mentioned in research paperA large-scale pre-training method for natural language generation tasks using GShard architecture. Its limitation is that it requires massive computational resources for pre-training.Large text corpora2,048-cores (for training), 512-cores (for evaluation)
46
X-TransformerCross-lingual language modeling2020102
million (base)
Huawei
Noah's Ark Lab
Requires specialized hardware for trainingnot mentioned in research paperA language model developed by Facebook AI that extends the Transformer architecture by introducing cross-attention between different layers of the model. It has shown strong performance on several NLP benchmarks, including machine translation and language understanding.Large-scale text corpora512-v100 GPUs
47
RoBERTa-largeRobustly Optimized BERT Pretraining Approach, Large2020355MFacebook
AI Research (FAIR)
High computational requirements4 days A larger version of RoBERTa, developed by Facebook AI, with 355 million parameters. It has been trained on massive amounts of data and has shown strong performance on several NLP benchmarks, including GLUE and SuperGLUE.
Limitations: RoBERTa-large's large size makes it computationally expensive, and it may not be suitable for all applications due to its high memory requirements.
English Wikipedia + Common Crawl8-v100 GPUs
48
ERNIE 2.0Enhanced Representation through Knowledge Integration, version 2.020202.8BBaiduLimited to Chinese language processing1 dayAn improved version of the ERNIE language model that integrates knowledge graphs and improves representation learning. It can handle various natural language tasks such as sentiment analysis and named entity recognition. Its limitations include the need for large amounts of training data and the potential for bias in generating text.Large-scale Chinese corpora + English Wikipedia + BookCorpus128-v100 GPUs
49
SEQUENCE-TO-SEQUENCE Transformer (S2S-T5)Sequence-to-sequence generation202011BMicrosoftRequires large-scale distributed trainingnot mentioned in research paperA transformer-based architecture specifically designed for sequence-to-sequence tasks, such as machine translation and summarization. It achieves state-of-the-art performance on several benchmarks and can be fine-tuned for specific tasks. Limitations include high computational requirements and large model size.English Wikipedia + CC-News + OpenWebText + StoriesCorpus64-v100 GPUs (for pre-training), 1-v100 GPU (for fine-tuning)
50
MT5Multilingual T5202014.7BGoogleHeavy computational resource requirementsnot mentioned in research paperA multilingual version of T5 that can handle tasks across multiple languages. Its key features include a shared encoder-decoder architecture and the ability to handle multiple languages. Its limitations include the need for large amounts of multilingual training data.C4 dataset2048-v3 TPUs
51
LaBSELanguage-agnostic BERT Sentence Embedding2020247MGoogleLimited to sentence-level embeddingsnot mentioned in research paperLanguage-agnostic BERT Sentence Embedding (LaBSE) is a cross-lingual language model that produces high-quality sentence embeddings in multiple languages, allowing for language-agnostic text analysis. Limitations include potential difficulties in handling low-resource languages and the need for large amounts of pre-training data.Multilingual Wikipedia, OSCAR, and the Common Crawl corpus-
52
Wu Dao 2.0Chinese language processing20201.75TBeijing
Academy of Artificial Intelligence (BAAI)
High computational resource requirementsnot mentioned in research paperA supercomputer developed by China's National Supercomputer Center that is specifically designed for AI applications. It has a computing power of 200 petaflops and is used for training and testing large AI models. Limitations include limited accessibility to researchers outside China.Proprietary dataset of mixed-domain Chinese text1,024-Ascend 910 AI processors
53
GShardCross-lingual language modeling2021600BOpenAIRequires large-scale distributed trainingnot mentioned in research paperA large-scale unsupervised language model that uses a distributed architecture to enable training on massive datasets. It is based on the transformer architecture and has been shown to achieve state-of-the-art performance on a range of natural language processing tasks.Large-scale text corpora2,048-cores (for training), 512-cores (for evaluation)
54
Swin TransformerImage processing2021200
million (base)
Microsoft
Research Asia
Limited to image processing tasksnot mentioned in research paperA hierarchical Transformer that incorporates local and global self-attention mechanisms. Its limitation is that it may suffer from limited attention coverage for long sequences.
ImageNet, COCO, and various object detection datasets1024-v100 GPUs
55
Speech-CLIPSpeech recognition and vision2021400
million
Facebook
AI Research
Limited to audio and visual processingnot mentioned in research paperA model that combines vision and audio information to enable tasks like audio-visual event localization and sound separation. It uses contrastive learning to learn audio and visual representations jointly. One limitation is that it requires a large amount of data to train effectively.Various audio datasets-
56
Turing
NLG
20212.7
billion (base)
Microsoft
Research
A neural language generation model designed for text completion and summarization tasks. Its limitation is that it may produce outputs that lack coherence and context.
Various dialogue datasets24 V100 GPUs
57
BlenderBotConversational LLM for chatbots and dialog systems20214
billion (small)
Facebook
AI Research
Requires a large amount of computational resources for training3 weeksA transformer-based chatbot developed by Facebook AI Research that can engage in open-domain conversations with humans. It uses a large-scale generative model trained on a diverse set of conversation topics. Its limitations include the potential for generating inappropriate or biased responses.A combination of existing conversational and social media datasets2048-v3 TPUs
58
EvolvedLLM for long-form content generation and summarization20212
billion (base)
GoogleNot suitable for small-scale or low-resource environmentsnot mentionedAn architecture search method for Transformers that optimizes for a given task. Its limitation is that the search process is computationally expensive.C4, CC-News, OpenWebText512-v100 GPUs
59
PLATOLarge-scale LLM for dialogue generation202134
billion (base)
LingoAce,
South China Morning Post
Limited to generating human-like responses in a few specific domainsnot mentionedA large-scale pre-trained model for multi-turn dialogue generation that incorporates knowledge distillation and a hierarchical Transformer architecture. Its limitation is that it requires a significant amount of pre-training data.
A large-scale proprietary dataset of diverse user interactions512-v100 GPUs (for pre-training)
60
AdaLMMultilingual LLM for natural language processing20211.3
billion (base)
RWTH
Aachen University
Not explicitly designed for any specific task7 daysA large-scale adaptive language model that dynamically adjusts its parameters based on the input sequence. Its limitation is that it requires a large amount of computation for training and inference.Proprietary dataset of user-generated text256-v100 GPUs
61
GPT-NeoLLM designed for natural language processing and generation20212.7
billion (base)
EleutherAICan exhibit gender and racial biasnot mentionedGPT-Neo is a transformer-based language model developed by EleutherAI that uses the same architecture as GPT but is trained on a larger dataset with a wider range of data sources, resulting in improved performance on various NLP tasks. Limitations include potential difficulties in adapting to specific domains and the need for large amounts of training data.WebText, Books1, Books2, CC-News, Stories, WikipediaUp to 2,048-v100 GPUs or 512-CPUs
62
CodexLLM designed for programming language processing20216
billion (base)
GitHubLimited to a specific task of programming language processingnot mentionedCodex is a language model developed by OpenAI that is specifically trained on code, enabling it to perform various programming tasks, such as code completion and bug detection. Limitations include a limited range of programming languages and potential difficulties in adapting to new programming paradigms.Public code repositories and Stack Overflow-
63
MASS-T5Multilingual LLM for natural language processing and generation202114.5
billion (base)
Microsoft
Research Asia
Limited to the T5 architecture and focused on Englishnot mentionedA large-scale language model based on the transformer architecture that has been trained on a diverse range of tasks, including language modeling, translation, and summarization. It has been shown to perform well on a wide range of natural language processing tasks.C4 dataset1024-v100 GPUs
64
BigBirdLLM designed for long-form content processing20211.3BHugging
Face
Requires high computational resources and time for trainingnot mentionedHugging Face's BigBird (Bidirectional Encoder Representations from Transformers): It is a transformer-based language model that can process sequences of up to 8,192 tokens. It uses a sparse attention mechanism to reduce the computational cost of the transformer architecture.English Wikipedia, GigaWord, ClueWeb and Common Crawl512-v100 GPUs
65
DALL-E 2LLM designed for image generation2021250MOpenAILimited to generating images with 256x256 pixelsnot mentionedA large-scale generative model developed by OpenAI that can generate images from textual descriptions. It uses a transformer-based architecture and a unique pre-training approach called "clip contrastive learning" to generate images. Its limitations include the potential for generating inappropriate or biased images.A curated subset of image-text pairs from the internet and CC-WebVideo8192-TPUs
66
GShard-XLLarge-scale LLM with cross-lingual pre-training20216.9BGoogleLimited to the GShard architecture and focused on Englishnot mentionedA transformer-based model that allows for training on extremely large datasets by splitting the model parameters across multiple devices. The different versions correspond to the number of parameters and the size of the dataset used for training. Their main limitation is that they require specialized infrastructure to train and use effectively.Large-scale text corpora8,192-cores (for training), 2,048-cores (for evaluation)
67
UniLMv2LLM for natural language processing and generation2022340MMicrosoftLimited to the UniLM architecture and focused on Englishnot mentionedAn extension of the UniLM model that incorporates pre-training on multiple tasks to improve performance on downstream tasks. It has achieved state-of-the-art performance on several benchmarks. One limitation is that it may require more computational resources to train and use than other models.English Wikipedia, BooksCorpus, OpenWebText, STS-B, SQuAD and a QA dataset16-v100 GPUs, 16-TPUs, or 8-GPUs
68
GShard-3BLarge-scale LLM with unsupervised training20223BGoogleLimited to the GShard architecture and focused on Englishnot mentionedA large-scale multilingual model with 3 billion parameters trained by Google on its GShard distributed training system. It can handle 499 different languages and shows state-of-the-art performance on several NLP tasks.
Limitations: Due to its large size, GShard-3B is computationally expensive and requires specialized hardware for training and inference.
Large-scale text corpora2,048-cores (for training), 512-cores (for evaluation)
69
GShard-13BLarge-scale LLM with unsupervised training202213BGoogleLimited to the GShard architecture and focused on Englishnot mentionedA further extension of GShard-3B, with 13 billion parameters, making it one of the largest language models available. It can handle even more languages than GShard-3B and has shown strong performance on various NLP benchmarks.
Limitations: As with GShard-3B, GShard-13B requires significant computational resources and specialized hardware for training and inference.

Large-scale text corpora16,384-cores (for training), 4,096-cores (for evaluation)
70
ChinchillaLLM designed for code generation202270 billionDeepMindLimited to a specific task of code generationnot mentioned A language model designed for natural language understanding and generation tasks. It uses a sparse attention mechanism and a novel pre-training approach called "knowledge transfer" to improve text representation learning. Its limitations include the need for large amounts of training data and computation power.BookCorpus, Wikipedia, Common Crawl and OpenWebText8-TPUs
71
PaLMLLM designed for natural language processing2022540 billionGoogleLimited to a specific task of natural language processingnot mentionedA language model based on the neural network architecture, which uses self-attention and LSTMs to learn contextual word representations. It has been trained on a large corpus of text from different domains and languages.WikiText-103, Toronto Books Corpus, UMBC corpus, and Common Crawl8 NVIDIA V100 GPUs
72
OPT (Open Pretrained Transformer)LLM designed for natural language processing2022175 billionMetaLimited flexibility for fine-tuning on specific tasks due to a fixed architecture and pre-defined set of parameters.not mentionedOpen Pretrained Transformer (OPT) is a transformer-based language model developed by Microsoft Research Asia that is pre-trained on a large-scale dataset and can be fine-tuned for various NLP tasks. Its key feature is its adaptability to a wide range of NLP tasks with good performance. Limitations include potential difficulties in handling low-resource languages and the need for large amounts of pre-training data.Various large-scale text corpora, including Common Crawl, BooksCorpus, and Wikipedia256 TPU v3s
73
YaLM 100BLLM for natural language processing and generation2022100 billionYandexLimited accessibility for many researchers and developers due to the high computational requirements and cost.not mentioned Yet another Language Model with 100 Billion parameters: It is a transformer-based language model with 100 billion parameters, trained on a diverse range of datasets. It is capable of generating high-quality text in various languages.
Large-scale text corpora, including Common Crawl, and various scientific and academic datasets8192 NVIDIA V100 GPUs
74
MinervaLLM designed for natural language processing2022540 billionGoogleLimited interpretability and transparency of its decision-making process due to its complex architecture and training process.not mentionedA language model trained on a diverse range of texts from various domains and languages. It is based on a transformer-based architecture that uses self-attention to learn contextual representations of words.OpenWebText2, Common Crawl, PubMed, and other public datasets2x2 NVIDIA V100 GPUs
75
BLOOMLLM designed for natural language processing2022175 billionLarge collaboration led by Hugging FaceLimited scalability for large-scale language understanding tasks due to its relatively small model size.not mentionedA large-scale language model based on a transformer architecture, trained on a diverse range of datasets. It has been shown to perform well on a wide range of natural language processing tasks, including language modeling, question-answering, and sentiment analysis.OpenWebText2 and BooksCorpusNot specified
76
AlexaTM (Teacher Models)LLM designed for natural language processing202220 billionAmazonLimited generalizability to tasks beyond the scope of the specific domains they are trained on.not mentionedIt is a family of pre-trained language models developed by Amazon for various natural language processing tasks, including text classification, entity recognition, and question-answering. They are trained on a large corpus of text and can be fine-tuned on specific tasks.Various datasets, including BooksCorpus, Common Crawl, and other web textNot specified
77
LLaMA (Large Language Model Meta AI)LLM designed for natural language processing and generation202365 billionMetaLimited interpretability and control over the learned representations due to its highly automated training process.not mentionedIt is a language model developed by Intel that leverages meta-learning to adapt to new tasks with few examples. It is based on a transformer architecture and has been shown to outperform other language models on few-shot learning tasks.Various large-scale text corpora, including Common Crawl, BooksCorpus, and WikipediaNot specified
78
GPT-4LLM designed for natural language processing and generation2023UnknownOpenAIUnknownnot mentionedA hypothetical successor to GPT-3. Its potential features and limitations are unknown at this time.Not released yetNot released yet
79
ERNIE BotConversational LLM for chatbots and dialog systems2023UnknownBaiduUnknownnot mentionedA chatbot model developed by Baidu, based on the ERNIE pretraining framework. It has been trained on large amounts of dialogue data and can engage in natural-sounding conversations with users.
Limitations: ERNIE Bot's performance is limited by the quality of the training data it has been exposed to, and it may struggle with complex or abstract topics.
Not released yetNot released yet
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100