LLM History (Continuously Evolving)

	A	B	C	D	E	F	G	H	I	J
1	Model Name	About	Year Built	Number of Parameters	Creator	Limitations	Training Time	Model Definition	Dataset Used for Training	Hardware Configuration
2	ALICE	Chatbot	1995	N/A	Richard Wallace	Limited to scripted responses, lacks general knowledge, can be prone to repeating itself	2 weeks	ALICE (Artificial Linguistic Internet Computer Entity) is a chatbot framework that uses AIML (Artificial Intelligence Markup Language) for natural language processing and rule-based responses. Its key features include the ability to engage in conversation, answer questions, and provide information. Its limitations include the need for manual programming of rules, which can be time-consuming and may not cover all possible scenarios, and a lack of ability to learn from new data.	Cornell Movie Dialogs	Not Specified
3	AWD-LSTM	LSTM-based language model	2017	33 million (base)	Salesforce Research	Limited to sequential data, requires large amounts of training data	12 hours (on WikiText)	AWD-LSTM (ASGD Weight-Dropped LSTM) is a language model architecture that uses LSTM (Long Short-Term Memory) cells with weight dropout and activation dropout to prevent overfitting. Its key feature is improved training on small datasets. Its limitations include the potential for slower training and the need for more memory compared to simpler models.	Wikitext-103	Not Specified
4	GPT	Transformer-based LM	2018	117M	OpenAI	Limited to language tasks, lack of memory, can generate nonsensical responses	4 days	GPT (Generative Pre-trained Transformer) is a generative language model that uses the Transformer architecture with self-attention to generate text. Its key feature is the ability to generate coherent and diverse text, but its limitation is that it may generate biased or inappropriate text.	BooksCorpus, English Wikipedia	8 NVIDIA V100 GPUs, 128GB of GPU memory, 40 CPU cores
5	BERT	Transformer-based LM	2018	110M	Google	Computationally expensive, lacks memory, prone to overfitting, not suited for generative tasks	4 days	Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based language model designed to pre-train deep bidirectional representations from a large unlabeled text corpus, which can then be fine-tuned for a variety of NLP tasks such as question-answering and sentiment analysis. Limitations include its large size and computational requirements for pre-training.	BooksCorpus, English Wikipedia	64 Cloud TPUs
6	AdaNet	AutoML-based LM	2018	varies	Google	Limited to language tasks, lack of interpretability, requires high computational power	2 hours	AdaNet is a neural architecture search algorithm that automatically designs high-quality models by iteratively learning from previous architectures. It trains a custom ensemble of neural networks, each of which is trained with a different architecture and dataset weighting. Its limitation includes its high computational and time requirements, and its dependency on the quality and quantity of training data.	CIFAR-10	1 Google Cloud TPU
7	QuickThought	Encoder-based LM	2018	varies	Google	Limited to encoding sentences, not suitable for generation tasks	1 week	QuickThought is a language model architecture that uses unsupervised learning to generate sentence embeddings. Its key feature is the ability to learn from large amounts of unstructured data without the need for labeled data. Its limitations include the lack of control over the embeddings generated, which may not be optimal for specific downstream tasks.	BookCorpus	Not Specified
8	ELMo	Contextual LM	2018	94 million	Allen Institute for AI	Computationally expensive, not suitable for generation tasks, lack of interpretability	3 days	ELMo (Embeddings from Language Models) is a deep contextualized word representation model that uses a bi-directional language model to generate word embeddings that capture contextual information.	Common Crawl	16 NVIDIA P100 GPUs
9	LASER	Encoder-based LM	2018	varies	Facebook AI Research, FAIR	Limited to encoding sentences, not suitable for generation tasks	10 days	LASER (Language-Agnostic SEntence Representations) is a multilingual sentence embedding model that is pre-trained on various languages.	Common Crawl	8 NVIDIA V100 GPUs, 192GB of GPU memory, 40 CPU cores
10	COCOA	Encoder-based LM	2018	varies	Carnegie Mellon University	Limited to sentence classification tasks, requires high computational power	1 week	COCOA (Conversational Contextual Cues for Online Abuse) is a language model designed to detect and flag online abuse in conversations. Its key feature is the ability to analyze the context of conversations and detect abusive language. Its limitations include the potential for false positives or false negatives, and the need for ongoing training and updating to adapt to evolving language patterns and new types of abuse.	COCOA dataset	8 NVIDIA V100 GPUs, 128GB of GPU memory, 40 CPU cores
11	GPT-2	Transformer-based LM	2019	1.5B	OpenAI	Computationally expensive, prone to generating biased or offensive language, lacks control over generated text	4 days (on WebText)	A transformer-based language model that is capable of generating high-quality natural language text. Key features include a large number of parameters (up to 1.5 billion), multi-head attention, and transformer blocks. Its limitations include the potential for bias and ethical concerns in generating text.	WebText, Books1, Books2, and English Wikipedia	8 Nvidia V100 GPUs
12	XLNet	Transformer-based LM	2019	340M	Google	Computationally expensive, lacks memory, prone to overfitting, not suited for generative tasks	5 days	XLNet is a language model that uses a permutation-based approach to train on all possible combinations of the input data, allowing it to capture bidirectional relationships while avoiding the drawbacks of standard bidirectional models. Its key feature is its ability to capture complex relationships in the input data, but its limitation is that it can be computationally expensive.	BooksCorpus and English Wikipedia	64 Cloud TPUs or 16 V100 GPUs
13	RoBERTa	Transformer-based LM	2019	355M	Facebook	Computationally expensive, lacks memory, not suitable for generation tasks	4 days (on C4)	RoBERTa (Robustly Optimized BERT Pretraining Approach) is a language model that uses the BERT architecture with additional optimization techniques, resulting in improved performance on various NLP tasks. Its key feature is its robustness and improved performance on certain tasks compared to BERT, but its limitation is that it can be computationally expensive.	BooksCorpus and English Wikipedia	32 Cloud TPUs or 16 V100 GPUs
14	ALBERT	Transformer-based LM	2019	11M	Google	Computationally expensive, lacks memory, not suitable for generation tasks	3.5 days (on C4)	A Lite BERT (ALBERT) is a lightweight version of BERT that aims to reduce the model size and improve training efficiency while achieving similar or better performance than BERT. Limitations include longer training times compared to BERT and potential trade-offs in model size and performance.	BooksCorpus and English Wikipedia	64 Cloud TPUs or 16 V100 GPUs
15	CTRL	Transformer-based LM	2019	110M	SalesForce	Computationally expensive, requires domain-specific prompts, not suitable for generation tasks	4 days	CTRL (Conditional Transformer Language Model) is a text generation model that can be conditioned on a specific task or topic.	WebText and Books1	16 V100 GPUs
16	UniLM	Transformer-based LM	2019	340M	Microsoft	Computationally expensive, requires domain-specific prompts, not suitable for generation tasks	3 days (on Wiki)	UniLM (Unified Language Model) is a language model that can perform both language generation and comprehension tasks by pre-training on both masked language modeling and sequence-to-sequence tasks. Its key feature is its ability to perform both tasks with high accuracy, but its limitation is that it requires a large amount of training data and computational resources.	English Wikipedia and BookCorpus (English + Chinese)	16 V100 GPUs
17	ERNIE	Transformer-based LM	2019	340M	Baidu	Limited to Chinese language	2.5 days	ERNIE (Enhanced Representation through kNowledge IntEgration) is a pre-trained language model that integrates knowledge graphs into its training to enhance its understanding of concepts.	Chinese Wikipedia, Baidu Baike, and BookCorpus (Chinese)	8 Nvidia V100 GPUs
18	Flair	Non-Transformer-based LM	2019	134M	Zalando Research	Limited to NLP tasks	not mentioned in the research paper	Flair is a language model that combines traditional word embeddings with contextual embeddings to improve performance on NLP tasks. Its key feature is its ability to capture complex relationships and context in language data, but its limitation is that it may not perform as well on very large datasets.	WikiText-2, IMDb, and English Wikipedia	Not Specified
19	DistilBERT	Transformer-based LM	2019	66M	Hugging Face	Reduced accuracy compared to BERT	39 minutes	DistilBERT is a distilled version of BERT that has a smaller model size and faster inference time while maintaining similar performance on various NLP tasks. Limitations include a trade-off between model size and performance, with a slight decrease in accuracy compared to BERT.	BooksCorpus and English Wikipedia	CPU and GPU
20	XLM-RoBERTa	Transformer-based LM	2019	550M	Facebook	Limited to NLP tasks	4 days	XLM-RoBERTa is a cross-lingual language model that is pre-trained on a large multilingual corpus and fine-tuned for specific tasks.	100+ languages and English Wikipedia	64 Cloud TPUs or 16 V100 GPUs
21	MASS	Transformer-based LM	2019	110 million	Microsoft Research Asia	Limited to machine translation and text generation	3 days	MASS (Masked Sequence-to-Sequence Pre-training) is a sequence-to-sequence pre-training approach that uses masked tokens to predict target sequences.	BookCorpus and English Wikipedia	64 NVIDIA V100 GPUs
22	SpanBERT	Transformer-based LM	2019	108 million	Facebook AI Research	Limited to NLP tasks	4 days	A language model developed by researchers at Google that extends the BERT architecture by incorporating span representations to capture the relationships between different spans of text. It has shown strong performance on several NLP benchmarks, including coreference resolution and named entity recognition. Limitations: SpanBERT's performance is limited by the quality and size of the training data it has been exposed to, and it may not generalize well to tasks outside of its training data.	BookCorpus and English Wikipedia	32 NVIDIA V100 GPUs
23	MobileBERT	Transformer-based LM	2019	25 million	Google	Reduced accuracy compared to BERT	4 days	A compact version of BERT optimized for mobile and embedded devices. Utilizes a task-specific loss function to further optimize the model's efficiency. Limitations include reduced performance compared to full-size BERT, as well as potential accuracy trade-offs due to its smaller size.	BooksCorpus and English Wikipedia	8 TPUs and 2 GPUs
24	Funnel Transformer	Transformer-based LM	2019	175 million	Zhejiang University	Limited to NLP tasks	3 days	A hierarchical architecture that reduces the computational and memory requirements of large-scale language models. Utilizes a decreasing attention mechanism that allows for the processing of longer sequences. Limitations include the need for specialized hardware, as well as potential issues with capturing long-term dependencies in language.	BooksCorpus and English Wikipedia	64 NVIDIA V100 GPUs
25	SuperGLUE	Transformer-based LM	2019	varies	Stanford University, Google	Limited to NLP tasks	2-3 days	A benchmark suite developed by researchers at Stanford University that evaluates the performance of NLP models on more challenging tasks than previous benchmarks such as GLUE. It includes a diverse set of tasks such as natural language inference, question answering, and coreference resolution. Limitations: SuperGLUE is a benchmark suite and not a language model in itself, and its performance is dependent on the specific models used to evaluate it.	Various, including MultiNLI and RACE	8 NVIDIA V100 GPUs
26	Transformer-XL	Transformer-based LM	2019	257 million	Google	Limited to sequence modeling tasks	2.5 days	Transformer-XL is a language model that uses the Transformer architecture with recurrence to overcome the limitations of standard Transformers in capturing long-term dependencies. Its key feature is its ability to capture longer-term dependencies in language data, but its limitation is that it can be computationally expensive.	Various, including Wikitext-103	16 NVIDIA V100 GPUs
27	BART	Transformer-based LM	2019	406 million (base)	Facebook AI Research	Limited to NLP tasks	3 days	A sequence-to-sequence model that combines bidirectional and unidirectional transformer architectures for pre-training and fine-tuning. It can handle tasks such as text generation and summarization. Its limitations include high computational requirements and the potential for bias in text generation.	Various, including CNN/Daily Mail and XSum	16 NVIDIA V100 GPUs
28	TAPAS	Transformer-based LM	2019	220 million	Google	Limited to table-based question answering tasks	1 day	A model designed for table-based question-answering tasks. Utilizes a table embedding layer and a novel table-specific attention mechanism to enable effective table reasoning. Limitations include the need for specific training data for table-based tasks, as well as potential issues with handling tables with complex structures.	Various, including WikiTables	16 NVIDIA V100 GPUs
29	XLm-R	Transformer-based LM	2019	550 million (base)	Facebook AI Research	Limited to NLP tasks	3 days	XLM-R (Cross-lingual Language Model - RoBERTa) is a language model that is pre-trained on multiple languages, allowing it to perform well on cross-lingual NLP tasks. Its key feature is its ability to perform well on multiple languages, but its limitation is that it can be computationally expensive.	Various, including WikiMatrix and UN Parallel Corpus	8 NVIDIA V100 GPUs
30	GPT-3	Transformer-based LM	2020	175B	OpenAI	Limited to NLP tasks	not mentioned in the research paper	A large transformer-based language model with up to 175 billion parameters. It is capable of generating high-quality natural language text and can perform various language tasks such as language translation and question-answering. Its limitations include the potential for bias and ethical concerns in generating text.	Various, including CommonCrawl and Wikipedia	355 TFLOPs
31	T5	Transformer-based LM	2020	11B	Google	Limited to text-to-text tasks	4 days	A transformer-based encoder-decoder architecture designed for various natural language tasks. It can perform tasks such as language translation, text summarization, and question-answering. Its limitations include the need for large amounts of training data and computation power.	C4, WebText, and books	64 TPUv3
32	ELECTRA	Transformer-based LM	2020	110M	Google	Limited to NLP tasks	4 days	ELECTRA is a pre-training approach that replaces a portion of the text with a plausible substitute and trains the model to distinguish between the original and the substitute, resulting in better efficiency and accuracy compared to BERT. Limitations include increased complexity in the pre-training process and potential loss of information due to the replacement of text.	Wikipedia and BooksCorpus	8 TPUv3
33	DeBERTa	Transformer-based LM	2020	345M	Microsoft	Limited to NLP tasks	7 days (large model)	DeBERTa is a transformer-based language model that integrates various pre-training objectives, including masked language modeling and next sentence prediction, to enhance its performance on a wide range of NLP tasks. Limitations include longer training times and increased complexity compared to simpler pre-training approaches.	Various, including English Wikipedia and OpenWebText	128 GPUs
34	MiniLM	Transformer-based LM	2020	50M	Microsoft	Reduced accuracy compared to BERT	6 hours	MiniLM is a compact version of the BERT language model designed to reduce model size and improve efficiency while maintaining similar performance on various NLP tasks. Limitations include trade-offs between model size and performance and potential difficulties in adapting the model to specific tasks.	Various, including English Wikipedia and BookCorpus	8 V100 GPUs
35	Longformer	Transformer-based LM	2020	400M	Allen Institute for AI	Limited to sequence modeling tasks	1 hour per epoch	A variant of Transformer architecture that enables processing of long sequences (up to 4,096 tokens). It uses a combination of sliding window attention mechanism and global attention to reduce computation and memory requirements. Limitations include slower training and inference times than standard Transformers.	Various, including English Wikipedia and CommonCrawl	8 V100 GPUs
36	MBART	Transformer-based LM	2020	6B	Facebook	Limited to machine translation tasks	4 days	A multilingual variant of BART, capable of translating between 25 languages. Utilizes a shared encoder-decoder architecture and a task-specific pre-training objective to improve multilingual performance. Limitations include reduced performance on certain language pairs and potential issues with maintaining language-specific nuances during translation.	Various, including Wikipedia and CommonCrawl	8 TPUv3
37	Marian		2020	56M	Microsoft		Not specified	A fast and efficient neural machine translation framework. Utilizes a sequence-to-sequence architecture with attention mechanisms and a hybrid convolutional-recurrent network for improved translation quality and speed. Limitations include the need for specialized hardware for larger models, as well as potential issues with handling long and complex sentences.	Various, including JW300 and OpenSubtitles	Unknown
38	ProphetNet	Transformer-based LM	2020	554 million	Microsoft Research Asia	Limited to sequence modeling tasks	not mentioned in the research paper	A model designed for both sequence-to-sequence tasks and language modeling. Utilizes a masked self-attention mechanism and a future-aware position encoding scheme for improved long-term sequence modeling. Limitations include the need for larger amounts of training data to achieve optimal performance, as well as potential issues with capturing long-term dependencies.	Various, including English Wikipedia and BooksCorpus	32 V100 GPUs
39	Performer	Transformer-based LM	2020	37 million (base)	Google	Limited to sequence modeling tasks	4.5 hours	A transformer-based model that uses an attention mechanism that is more efficient and easier to compute than standard self-attention. It has achieved state-of-the-art performance on several benchmarks while being more computationally efficient than other models. Its main limitation is that it may not be as effective as standard self-attention on tasks that require more complex attention patterns.	Various, including WMT16 and WMT18 news translation	1 V100 GPU
40	Meena	Conversational AI	2020	2.6 billion (base)	Google	May have ethical implications	(Need to confirm)	A conversational AI model that uses a seq2seq architecture and a human-like persona to generate natural language responses to user inputs. It has been shown to be more engaging and emotionally intelligent than other conversational AI models. One limitation is that it requires a large amount of data and computational resources to train effectively.	Multi-repository corpus with 341GB of text	2048-v3 TPU
41	MASSIVE	Multilingual (Asian languages)	2020	400 million	Microsoft Research Asia	Focuses mainly on Asian languages	not mentioned in research paper	A transformer-based model for multi-agent reinforcement learning that can handle large state and action spaces. It has achieved state-of-the-art performance on several benchmarks. Its main limitation is that it can be computationally expensive, especially when working with large state and action spaces.	English Wikipedia + BookCorpus	1024-v100 GPUs
42	SemBERT	Semantic Understanding	2020	340 million (base)	Zhejiang University	Limited to English and Chinese languages	40 hours	A model designed for joint modeling of semantic and syntactic information in text. Utilizes a BERT-like architecture with additional semantic and syntactic modules to enable fine-grained analysis of text. Limitations include the need for specialized training data for semantic and syntactic parsing tasks, as well as potential issues with maintaining model interpretability.	English Wikipedia + BookCorpus	64-v100 GPUs
43	RAG	Retrieval-Augmented Generation	2020	550 million (base)	Facebook AI Research	Limited to short-text generation tasks	2 days	A transformer-based model that performs retrieval-augmented generation, which enables more precise and informative generation by incorporating relevant context from a large database of text. Its main limitation is that it can be computationally expensive, especially when working with large databases.	English Wikipedia + Common Crawl	8-v100 GPUs (for fine-tuning), 64-v100 GPUs (for pre-training)
44	PEGASUS	Abstractive Text Summarization	2020	568 million (large)	Google	Heavy computational resource requirements	9.5 days	A transformer-based sequence-to-sequence model designed for text summarization tasks. It uses a novel pre-training approach called "unsupervised massive-scale self-supervised learning" to learn text representations. Its limitations include the need for large amounts of training data and the potential for bias in text generation.	English Wikipedia + BooksCorpus + OpenWebText + GitHub	32-v100 GPUs
45	GShard-OLM	Cross-lingual language modeling	2020	600 billion	OpenAI	Requires large-scale distributed training	not mentioned in research paper	A large-scale pre-training method for natural language generation tasks using GShard architecture. Its limitation is that it requires massive computational resources for pre-training.	Large text corpora	2,048-cores (for training), 512-cores (for evaluation)
46	X-Transformer	Cross-lingual language modeling	2020	102 million (base)	Huawei Noah's Ark Lab	Requires specialized hardware for training	not mentioned in research paper	A language model developed by Facebook AI that extends the Transformer architecture by introducing cross-attention between different layers of the model. It has shown strong performance on several NLP benchmarks, including machine translation and language understanding.	Large-scale text corpora	512-v100 GPUs
47	RoBERTa-large	Robustly Optimized BERT Pretraining Approach, Large	2020	355M	Facebook AI Research (FAIR)	High computational requirements	4 days	A larger version of RoBERTa, developed by Facebook AI, with 355 million parameters. It has been trained on massive amounts of data and has shown strong performance on several NLP benchmarks, including GLUE and SuperGLUE. Limitations: RoBERTa-large's large size makes it computationally expensive, and it may not be suitable for all applications due to its high memory requirements.	English Wikipedia + Common Crawl	8-v100 GPUs
48	ERNIE 2.0	Enhanced Representation through Knowledge Integration, version 2.0	2020	2.8B	Baidu	Limited to Chinese language processing	1 day	An improved version of the ERNIE language model that integrates knowledge graphs and improves representation learning. It can handle various natural language tasks such as sentiment analysis and named entity recognition. Its limitations include the need for large amounts of training data and the potential for bias in generating text.	Large-scale Chinese corpora + English Wikipedia + BookCorpus	128-v100 GPUs
49	SEQUENCE-TO-SEQUENCE Transformer (S2S-T5)	Sequence-to-sequence generation	2020	11B	Microsoft	Requires large-scale distributed training	not mentioned in research paper	A transformer-based architecture specifically designed for sequence-to-sequence tasks, such as machine translation and summarization. It achieves state-of-the-art performance on several benchmarks and can be fine-tuned for specific tasks. Limitations include high computational requirements and large model size.	English Wikipedia + CC-News + OpenWebText + StoriesCorpus	64-v100 GPUs (for pre-training), 1-v100 GPU (for fine-tuning)
50	MT5	Multilingual T5	2020	14.7B	Google	Heavy computational resource requirements	not mentioned in research paper	A multilingual version of T5 that can handle tasks across multiple languages. Its key features include a shared encoder-decoder architecture and the ability to handle multiple languages. Its limitations include the need for large amounts of multilingual training data.	C4 dataset	2048-v3 TPUs
51	LaBSE	Language-agnostic BERT Sentence Embedding	2020	247M	Google	Limited to sentence-level embeddings	not mentioned in research paper	Language-agnostic BERT Sentence Embedding (LaBSE) is a cross-lingual language model that produces high-quality sentence embeddings in multiple languages, allowing for language-agnostic text analysis. Limitations include potential difficulties in handling low-resource languages and the need for large amounts of pre-training data.	Multilingual Wikipedia, OSCAR, and the Common Crawl corpus	-
52	Wu Dao 2.0	Chinese language processing	2020	1.75T	Beijing Academy of Artificial Intelligence (BAAI)	High computational resource requirements	not mentioned in research paper	A supercomputer developed by China's National Supercomputer Center that is specifically designed for AI applications. It has a computing power of 200 petaflops and is used for training and testing large AI models. Limitations include limited accessibility to researchers outside China.	Proprietary dataset of mixed-domain Chinese text	1,024-Ascend 910 AI processors
53	GShard	Cross-lingual language modeling	2021	600B	OpenAI	Requires large-scale distributed training	not mentioned in research paper	A large-scale unsupervised language model that uses a distributed architecture to enable training on massive datasets. It is based on the transformer architecture and has been shown to achieve state-of-the-art performance on a range of natural language processing tasks.	Large-scale text corpora	2,048-cores (for training), 512-cores (for evaluation)
54	Swin Transformer	Image processing	2021	200 million (base)	Microsoft Research Asia	Limited to image processing tasks	not mentioned in research paper	A hierarchical Transformer that incorporates local and global self-attention mechanisms. Its limitation is that it may suffer from limited attention coverage for long sequences.	ImageNet, COCO, and various object detection datasets	1024-v100 GPUs
55	Speech-CLIP	Speech recognition and vision	2021	400 million	Facebook AI Research	Limited to audio and visual processing	not mentioned in research paper	A model that combines vision and audio information to enable tasks like audio-visual event localization and sound separation. It uses contrastive learning to learn audio and visual representations jointly. One limitation is that it requires a large amount of data to train effectively.	Various audio datasets	-
56	Turing NLG		2021	2.7 billion (base)	Microsoft Research			A neural language generation model designed for text completion and summarization tasks. Its limitation is that it may produce outputs that lack coherence and context.	Various dialogue datasets	24 V100 GPUs
57	BlenderBot	Conversational LLM for chatbots and dialog systems	2021	4 billion (small)	Facebook AI Research	Requires a large amount of computational resources for training	3 weeks	A transformer-based chatbot developed by Facebook AI Research that can engage in open-domain conversations with humans. It uses a large-scale generative model trained on a diverse set of conversation topics. Its limitations include the potential for generating inappropriate or biased responses.	A combination of existing conversational and social media datasets	2048-v3 TPUs
58	Evolved	LLM for long-form content generation and summarization	2021	2 billion (base)	Google	Not suitable for small-scale or low-resource environments	not mentioned	An architecture search method for Transformers that optimizes for a given task. Its limitation is that the search process is computationally expensive.	C4, CC-News, OpenWebText	512-v100 GPUs
59	PLATO	Large-scale LLM for dialogue generation	2021	34 billion (base)	LingoAce, South China Morning Post	Limited to generating human-like responses in a few specific domains	not mentioned	A large-scale pre-trained model for multi-turn dialogue generation that incorporates knowledge distillation and a hierarchical Transformer architecture. Its limitation is that it requires a significant amount of pre-training data.	A large-scale proprietary dataset of diverse user interactions	512-v100 GPUs (for pre-training)
60	AdaLM	Multilingual LLM for natural language processing	2021	1.3 billion (base)	RWTH Aachen University	Not explicitly designed for any specific task	7 days	A large-scale adaptive language model that dynamically adjusts its parameters based on the input sequence. Its limitation is that it requires a large amount of computation for training and inference.	Proprietary dataset of user-generated text	256-v100 GPUs
61	GPT-Neo	LLM designed for natural language processing and generation	2021	2.7 billion (base)	EleutherAI	Can exhibit gender and racial bias	not mentioned	GPT-Neo is a transformer-based language model developed by EleutherAI that uses the same architecture as GPT but is trained on a larger dataset with a wider range of data sources, resulting in improved performance on various NLP tasks. Limitations include potential difficulties in adapting to specific domains and the need for large amounts of training data.	WebText, Books1, Books2, CC-News, Stories, Wikipedia	Up to 2,048-v100 GPUs or 512-CPUs
62	Codex	LLM designed for programming language processing	2021	6 billion (base)	GitHub	Limited to a specific task of programming language processing	not mentioned	Codex is a language model developed by OpenAI that is specifically trained on code, enabling it to perform various programming tasks, such as code completion and bug detection. Limitations include a limited range of programming languages and potential difficulties in adapting to new programming paradigms.	Public code repositories and Stack Overflow	-
63	MASS-T5	Multilingual LLM for natural language processing and generation	2021	14.5 billion (base)	Microsoft Research Asia	Limited to the T5 architecture and focused on English	not mentioned	A large-scale language model based on the transformer architecture that has been trained on a diverse range of tasks, including language modeling, translation, and summarization. It has been shown to perform well on a wide range of natural language processing tasks.	C4 dataset	1024-v100 GPUs
64	BigBird	LLM designed for long-form content processing	2021	1.3B	Hugging Face	Requires high computational resources and time for training	not mentioned	Hugging Face's BigBird (Bidirectional Encoder Representations from Transformers): It is a transformer-based language model that can process sequences of up to 8,192 tokens. It uses a sparse attention mechanism to reduce the computational cost of the transformer architecture.	English Wikipedia, GigaWord, ClueWeb and Common Crawl	512-v100 GPUs
65	DALL-E 2	LLM designed for image generation	2021	250M	OpenAI	Limited to generating images with 256x256 pixels	not mentioned	A large-scale generative model developed by OpenAI that can generate images from textual descriptions. It uses a transformer-based architecture and a unique pre-training approach called "clip contrastive learning" to generate images. Its limitations include the potential for generating inappropriate or biased images.	A curated subset of image-text pairs from the internet and CC-WebVideo	8192-TPUs
66	GShard-XL	Large-scale LLM with cross-lingual pre-training	2021	6.9B	Google	Limited to the GShard architecture and focused on English	not mentioned	A transformer-based model that allows for training on extremely large datasets by splitting the model parameters across multiple devices. The different versions correspond to the number of parameters and the size of the dataset used for training. Their main limitation is that they require specialized infrastructure to train and use effectively.	Large-scale text corpora	8,192-cores (for training), 2,048-cores (for evaluation)
67	UniLMv2	LLM for natural language processing and generation	2022	340M	Microsoft	Limited to the UniLM architecture and focused on English	not mentioned	An extension of the UniLM model that incorporates pre-training on multiple tasks to improve performance on downstream tasks. It has achieved state-of-the-art performance on several benchmarks. One limitation is that it may require more computational resources to train and use than other models.	English Wikipedia, BooksCorpus, OpenWebText, STS-B, SQuAD and a QA dataset	16-v100 GPUs, 16-TPUs, or 8-GPUs
68	GShard-3B	Large-scale LLM with unsupervised training	2022	3B	Google	Limited to the GShard architecture and focused on English	not mentioned	A large-scale multilingual model with 3 billion parameters trained by Google on its GShard distributed training system. It can handle 499 different languages and shows state-of-the-art performance on several NLP tasks. Limitations: Due to its large size, GShard-3B is computationally expensive and requires specialized hardware for training and inference.	Large-scale text corpora	2,048-cores (for training), 512-cores (for evaluation)
69	GShard-13B	Large-scale LLM with unsupervised training	2022	13B	Google	Limited to the GShard architecture and focused on English	not mentioned	A further extension of GShard-3B, with 13 billion parameters, making it one of the largest language models available. It can handle even more languages than GShard-3B and has shown strong performance on various NLP benchmarks. Limitations: As with GShard-3B, GShard-13B requires significant computational resources and specialized hardware for training and inference.	Large-scale text corpora	16,384-cores (for training), 4,096-cores (for evaluation)
70	Chinchilla	LLM designed for code generation	2022	70 billion	DeepMind	Limited to a specific task of code generation	not mentioned	A language model designed for natural language understanding and generation tasks. It uses a sparse attention mechanism and a novel pre-training approach called "knowledge transfer" to improve text representation learning. Its limitations include the need for large amounts of training data and computation power.	BookCorpus, Wikipedia, Common Crawl and OpenWebText	8-TPUs
71	PaLM	LLM designed for natural language processing	2022	540 billion	Google	Limited to a specific task of natural language processing	not mentioned	A language model based on the neural network architecture, which uses self-attention and LSTMs to learn contextual word representations. It has been trained on a large corpus of text from different domains and languages.	WikiText-103, Toronto Books Corpus, UMBC corpus, and Common Crawl	8 NVIDIA V100 GPUs
72	OPT (Open Pretrained Transformer)	LLM designed for natural language processing	2022	175 billion	Meta	Limited flexibility for fine-tuning on specific tasks due to a fixed architecture and pre-defined set of parameters.	not mentioned	Open Pretrained Transformer (OPT) is a transformer-based language model developed by Microsoft Research Asia that is pre-trained on a large-scale dataset and can be fine-tuned for various NLP tasks. Its key feature is its adaptability to a wide range of NLP tasks with good performance. Limitations include potential difficulties in handling low-resource languages and the need for large amounts of pre-training data.	Various large-scale text corpora, including Common Crawl, BooksCorpus, and Wikipedia	256 TPU v3s
73	YaLM 100B	LLM for natural language processing and generation	2022	100 billion	Yandex	Limited accessibility for many researchers and developers due to the high computational requirements and cost.	not mentioned	Yet another Language Model with 100 Billion parameters: It is a transformer-based language model with 100 billion parameters, trained on a diverse range of datasets. It is capable of generating high-quality text in various languages.	Large-scale text corpora, including Common Crawl, and various scientific and academic datasets	8192 NVIDIA V100 GPUs
74	Minerva	LLM designed for natural language processing	2022	540 billion	Google	Limited interpretability and transparency of its decision-making process due to its complex architecture and training process.	not mentioned	A language model trained on a diverse range of texts from various domains and languages. It is based on a transformer-based architecture that uses self-attention to learn contextual representations of words.	OpenWebText2, Common Crawl, PubMed, and other public datasets	2x2 NVIDIA V100 GPUs
75	BLOOM	LLM designed for natural language processing	2022	175 billion	Large collaboration led by Hugging Face	Limited scalability for large-scale language understanding tasks due to its relatively small model size.	not mentioned	A large-scale language model based on a transformer architecture, trained on a diverse range of datasets. It has been shown to perform well on a wide range of natural language processing tasks, including language modeling, question-answering, and sentiment analysis.	OpenWebText2 and BooksCorpus	Not specified
76	AlexaTM (Teacher Models)	LLM designed for natural language processing	2022	20 billion	Amazon	Limited generalizability to tasks beyond the scope of the specific domains they are trained on.	not mentioned	It is a family of pre-trained language models developed by Amazon for various natural language processing tasks, including text classification, entity recognition, and question-answering. They are trained on a large corpus of text and can be fine-tuned on specific tasks.	Various datasets, including BooksCorpus, Common Crawl, and other web text	Not specified
77	LLaMA (Large Language Model Meta AI)	LLM designed for natural language processing and generation	2023	65 billion	Meta	Limited interpretability and control over the learned representations due to its highly automated training process.	not mentioned	It is a language model developed by Intel that leverages meta-learning to adapt to new tasks with few examples. It is based on a transformer architecture and has been shown to outperform other language models on few-shot learning tasks.	Various large-scale text corpora, including Common Crawl, BooksCorpus, and Wikipedia	Not specified
78	GPT-4	LLM designed for natural language processing and generation	2023	Unknown	OpenAI	Unknown	not mentioned	A hypothetical successor to GPT-3. Its potential features and limitations are unknown at this time.	Not released yet	Not released yet
79	ERNIE Bot	Conversational LLM for chatbots and dialog systems	2023	Unknown	Baidu	Unknown	not mentioned	A chatbot model developed by Baidu, based on the ERNIE pretraining framework. It has been trained on large amounts of dialogue data and can engage in natural-sounding conversations with users. Limitations: ERNIE Bot's performance is limited by the quality of the training data it has been exposed to, and it may struggle with complex or abstract topics.	Not released yet	Not released yet
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100