List of large language models
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation . LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
This page lists notable large language models.
List
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.
Name
Release date[ a]
Developer
Number of parameters (billion) [ b]
Corpus size
Training cost (petaFLOP-day)
License[ c]
Notes
Attention Is All You Need
June 2017
Vaswani et al at Google
0.213
36 million English-French sentence pairs
0.09[ 1]
Trained for 0.3M steps on 8 NVIDIA P100 GPUs.
GPT-1
June 2018
OpenAI
0.117
1[ 2]
MIT[ 3]
First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs .
BERT
October 2018
Google
0.340 [ 4]
3.3 billion words[ 4]
9 [ 5]
Apache 2.0[ 6]
An early and influential language model.[ 7] Encoder-only and thus not built to be prompted or generative.[ 8] Training took 4 days on 64 TPUv2 chips.[ 9]
T5
October 2019
Google
11 [ 10]
34 billion tokens[ 10]
Apache 2.0[ 11]
Base model for many Google projects, such as Imagen.[ 12]
XLNet
June 2019
Google
0.340 [ 13]
33 billion words
330
Apache 2.0[ 14]
An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[ 15]
GPT-2
February 2019
OpenAI
1.5 [ 16]
40GB[ 17] (~10 billion tokens)[ 18]
28[ 19]
MIT[ 20]
Trained on 32 TPUv3 chips for 1 week.[ 19]
GPT-3
May 2020
OpenAI
175 [ 21]
300 billion tokens[ 18]
3640[ 22]
Proprietary
A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[ 23]
GPT-Neo
March 2021
EleutherAI
2.7 [ 24]
825 GiB[ 25]
MIT[ 26]
The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[ 26]
GPT-J
June 2021
EleutherAI
6 [ 27]
825 GiB[ 25]
200[ 28]
Apache 2.0
GPT-3-style language model
Megatron-Turing NLG
October 2021 [ 29]
Microsoft and Nvidia
530 [ 30]
338.6 billion tokens[ 30]
38000[ 31]
Restricted web access
Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer , for over 3 million GPU-hours.[ 31]
Ernie 3.0 Titan
December 2021
Baidu
260 [ 32]
4 Tb
Proprietary
Chinese-language LLM. Ernie Bot is based on this model.
Claude [ 33]
December 2021
Anthropic
52 [ 34]
400 billion tokens[ 34]
beta
Fine-tuned for desirable behavior in conversations.[ 35]
GLaM (Generalist Language Model)
December 2021
Google
1200 [ 36]
1.6 trillion tokens[ 36]
5600[ 36]
Proprietary
Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher
December 2021
DeepMind
280 [ 37]
300 billion tokens[ 38]
5833[ 39]
Proprietary
Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications)
January 2022
Google
137 [ 40]
1.56T words,[ 40] 168 billion tokens[ 38]
4110[ 41]
Proprietary
Specialized for response generation in conversations.
GPT-NeoX
February 2022
EleutherAI
20 [ 42]
825 GiB[ 25]
740[ 28]
Apache 2.0
based on the Megatron architecture
Chinchilla
March 2022
DeepMind
70 [ 43]
1.4 trillion tokens[ 43] [ 38]
6805[ 39]
Proprietary
Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law .
PaLM (Pathways Language Model)
April 2022
Google
540 [ 44]
768 billion tokens[ 43]
29,250 [ 39]
Proprietary
Trained for ~60 days on ~6000 TPU v4 chips.[ 39] As of October 2024[update] , it is the largest dense Transformer published.
OPT (Open Pretrained Transformer)
May 2022
Meta
175 [ 45]
180 billion tokens[ 46]
310[ 28]
Non-commercial research[ d]
GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[ 47]
YaLM 100B
June 2022
Yandex
100 [ 48]
1.7TB[ 48]
Apache 2.0
English-Russian model based on Microsoft's Megatron-LM.
Minerva
June 2022
Google
540 [ 49]
38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[ 49]
Proprietary
For solving "mathematical and scientific questions using step-by-step reasoning".[ 50] Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM
July 2022
Large collaboration led by Hugging Face
175 [ 51]
350 billion tokens (1.6TB)[ 52]
Responsible AI
Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica
November 2022
Meta
120
106 billion tokens[ 53]
Unknown
CC-BY-NC-4.0
Trained on scientific text and modalities.
AlexaTM (Teacher Models)
November 2022
Amazon
20 [ 54]
1.3 trillion [ 55]
Proprietary [ 56]
bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI)
February 2023
Meta AI
65 [ 57]
1.4 trillion [ 57]
6300[ 58]
Non-commercial research[ e]
Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law ) for better performance with fewer parameters.[ 57]
GPT-4
March 2023
OpenAI
Unknown[ f] (According to rumors: 1760)[ 60]
Unknown
Unknown, estimated 230,000.
Proprietary
Available for ChatGPT Plus users and used in several products .
Chameleon
June 2024
Meta AI
34 [ 61]
4.4 trillion
Cerebras-GPT
March 2023
Cerebras
13 [ 62]
270[ 28]
Apache 2.0
Trained with Chinchilla formula .
Falcon
March 2023
Technology Innovation Institute
40 [ 63]
1 trillion tokens, from RefinedWeb (filtered web text corpus)[ 64] plus some "curated corpora".[ 65]
2800[ 58]
Apache 2.0[ 66]
BloombergGPT
March 2023
Bloomberg L.P.
50
363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[ 67]
Proprietary
Trained on financial data from proprietary sources, for financial tasks.
PanGu-Σ
March 2023
Huawei
1085
329 billion tokens[ 68]
Proprietary
OpenAssistant[ 69]
March 2023
LAION
17
1.5 trillion tokens
Apache 2.0
Trained on crowdsourced open data
Jurassic-2[ 70]
March 2023
AI21 Labs
Unknown
Unknown
Proprietary
Multilingual[ 71]
PaLM 2 (Pathways Language Model 2)
May 2023
Google
340 [ 72]
3.6 trillion tokens[ 72]
85,000 [ 58]
Proprietary
Was used in Bard chatbot .[ 73]
Llama 2
July 2023
Meta AI
70 [ 74]
2 trillion tokens[ 74]
21,000
Llama 2 license
1.7 million A100-hours.[ 75]
Claude 2
July 2023
Anthropic
Unknown
Unknown
Unknown
Proprietary
Used in Claude chatbot.[ 76]
Granite 13b
July 2023
IBM
Unknown
Unknown
Unknown
Proprietary
Used in IBM Watsonx .[ 77]
Mistral 7B
September 2023
Mistral AI
7.3 [ 78]
Unknown
Apache 2.0
Claude 2.1
November 2023
Anthropic
Unknown
Unknown
Unknown
Proprietary
Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[ 79]
Grok 1[ 80]
November 2023
xAI
314
Unknown
Unknown
Apache 2.0
Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).[ 81]
Gemini 1.0
December 2023
Google DeepMind
Unknown
Unknown
Unknown
Proprietary
Multimodal model, comes in three sizes. Used in the chatbot of the same name .[ 82]
Mixtral 8x7B
December 2023
Mistral AI
46.7
Unknown
Unknown
Apache 2.0
Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[ 83] Mixture of experts model, with 12.9 billion parameters activated per token.[ 84]
Mixtral 8x22B
April 2024
Mistral AI
141
Unknown
Unknown
Apache 2.0
[ 85]
DeepSeek-LLM
November 29, 2023
DeepSeek
67
2T tokens[ 86] : table 2
12,000
DeepSeek License
Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B[ 86] : figure 5
Phi-2
December 2023
Microsoft
2.7
1.4T tokens
419[ 87]
MIT
Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[ 87]
Gemini 1.5
February 2024
Google DeepMind
Unknown
Unknown
Unknown
Proprietary
Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[ 88]
Gemini Ultra
February 2024
Google DeepMind
Unknown
Unknown
Unknown
Gemma
February 2024
Google DeepMind
7
6T tokens
Unknown
Gemma Terms of Use[ 89]
Claude 3
March 2024
Anthropic
Unknown
Unknown
Unknown
Proprietary
Includes three models, Haiku, Sonnet, and Opus.[ 90]
Nova
October 2024
Rubik's AI
Unknown
Unknown
Unknown
Proprietary
Previous three models, Nova-Instant, Nova-Air, and Nova-Pro. Company shifted to Sonus AI.
Sonus [ 91]
January 2025
Rubik's AI
Unknown
Unknown
Unknown
Proprietary
DBRX
March 2024
Databricks and Mosaic ML
136
12T Tokens
Databricks Open Model License
Training cost 10 million USD.
Fugaku-LLM
May 2024
Fujitsu , Tokyo Institute of Technology , etc.
13
380B Tokens
The largest model ever trained on CPU-only, on the Fugaku .[ 92]
Phi-3
April 2024
Microsoft
14[ 93]
4.8T Tokens
MIT
Microsoft markets them as "small language model".[ 94]
Granite Code Models
May 2024
IBM
Unknown
Unknown
Unknown
Apache 2.0
Qwen2
June 2024
Alibaba Cloud
72[ 95]
3T Tokens
Unknown
Qwen License
Multiple sizes, the smallest being 0.5B.
DeepSeek-V2
June 2024
DeepSeek
236
8.1T tokens
28,000
DeepSeek License
1.4M hours on H800.[ 96]
Nemotron-4
June 2024
Nvidia
340
9T Tokens
200,000
NVIDIA Open Model License
Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[ 97] [ 98]
Llama 3.1
July 2024
Meta AI
405
15.6T tokens
440,000
Llama 3 license
405B version took 31 million hours on H100 -80GB, at 3.8E25 FLOPs.[ 99] [ 100]
DeepSeek-V3
December 2024
DeepSeek
671
14.8T tokens
56,000
MIT
2.788M hours on H800 GPUs.[ 101] Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.[ 102]
Amazon Nova
December 2024
Amazon
Unknown
Unknown
Unknown
Proprietary
Includes three models, Nova Micro, Nova Lite, and Nova Pro[ 103]
DeepSeek-R1
January 2025
DeepSeek
671
Not applicable
Unknown
MIT
No pretraining. Reinforcement-learned upon V3-Base.[ 104] [ 105]
Qwen2.5
January 2025
Alibaba
72
18T tokens
Unknown
Qwen License
7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.[ 106]
MiniMax-Text-01
January 2025
Minimax
456
4.7T tokens[ 107]
Unknown
Minimax Model license
[ 108] [ 107]
Gemini 2.0
February 2025
Google DeepMind
Unknown
Unknown
Unknown
Proprietary
Three models released: Flash, Flash-Lite and Pro[ 109] [ 110] [ 111]
Mistral Large
November 2024
Mistral AI
123
Unknown
Unknown
Mistral Research License
Upgraded over time. The latest version is 24.11.[ 112]
Pixtral
November 2024
Mistral AI
123
Unknown
Unknown
Mistral Research License
Multimodal. There is also a 12B version which is under Apache 2 license.[ 112]
Grok 3
February 2025
xAI
Unknown
Unknown
Unknown, estimated 5,800,000.
Proprietary
Training cost claimed "10x the compute of previous state-of-the-art models".[ 113]
Llama 4
April 5, 2025
Meta AI
400
40T tokens
Llama 4 license
[ 114] [ 115]
Qwen3
April 2025
Alibaba Cloud
235
36T tokens
Unknown
Apache 2.0
Multiple sizes, the smallest being 0.6B.[ 116]
See also
Notes
^ This is the date that documentation describing the model's architecture was first released.
^ In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
^ This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.
^ The smaller models including 66B are publicly available, while the 175B model is available on request.
^ Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
^ As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."[ 59]
References
^ "AI and compute" . openai.com . 2022-06-09. Retrieved 2025-04-24 .
^ "Improving language understanding with unsupervised learning" . openai.com . June 11, 2018. Archived from the original on 2023-03-18. Retrieved 2023-03-18 .
^ "finetune-transformer-lm" . GitHub . Archived from the original on 19 May 2023. Retrieved 2 January 2024 .
^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv :1810.04805v2 [cs.CL ].
^ Prickett, Nicole Hemsoth (2021-08-24). "Cerebras Shifts Architecture To Meet Massive AI/ML Models" . The Next Platform . Archived from the original on 2023-06-20. Retrieved 2023-06-20 .
^ "BERT" . March 13, 2023. Archived from the original on January 13, 2021. Retrieved March 13, 2023 – via GitHub.
^ Manning, Christopher D. (2022). "Human Language Understanding & Reasoning" . Daedalus . 151 (2): 127– 138. doi :10.1162/daed_a_01905 . S2CID 248377870 . Archived from the original on 2023-11-17. Retrieved 2023-03-09 .
^ Patel, Ajay; Li, Bryan; Rasooli, Mohammad Sadegh; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). "Bidirectional Language Models Are Also Few-shot Learners". arXiv :2209.14500 [cs.LG ].
^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv :1810.04805v2 [cs.CL ].
^ a b Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" . Journal of Machine Learning Research . 21 (140): 1– 67. arXiv :1910.10683 . ISSN 1533-7928 .
^ google-research/text-to-text-transfer-transformer , Google Research, 2024-04-02, archived from the original on 2024-03-29, retrieved 2024-04-04
^ "Imagen: Text-to-Image Diffusion Models" . imagen.research.google . Archived from the original on 2024-03-27. Retrieved 2024-04-04 .
^ "Pretrained models — transformers 2.0.0 documentation" . huggingface.co . Archived from the original on 2024-08-05. Retrieved 2024-08-05 .
^ "xlnet" . GitHub . Archived from the original on 2 January 2024. Retrieved 2 January 2024 .
^ Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2 January 2020). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". arXiv :1906.08237 [cs.CL ].
^ "GPT-2: 1.5B Release" . OpenAI . 2019-11-05. Archived from the original on 2019-11-14. Retrieved 2019-11-14 .
^ "Better language models and their implications" . openai.com . Archived from the original on 2023-03-16. Retrieved 2023-03-13 .
^ a b "OpenAI's GPT-3 Language Model: A Technical Overview" . lambdalabs.com . 3 June 2020. Archived from the original on 27 March 2023. Retrieved 13 March 2023 .
^ a b "openai-community/gpt2-xl · Hugging Face" . huggingface.co . Archived from the original on 2024-07-24. Retrieved 2024-07-24 .
^ "gpt-2" . GitHub . Archived from the original on 11 March 2023. Retrieved 13 March 2023 .
^ Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter" . TechCrunch . Archived from the original on 16 March 2023. Retrieved 9 March 2023 .
^ Table D.1 in Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (May 28, 2020). "Language Models are Few-Shot Learners". arXiv :2005.14165v4 [cs.CL ].
^ "ChatGPT: Optimizing Language Models for Dialogue" . OpenAI . 2022-11-30. Archived from the original on 2022-11-30. Retrieved 2023-01-13 .
^ "GPT Neo" . March 15, 2023. Archived from the original on March 12, 2023. Retrieved March 12, 2023 – via GitHub.
^ a b c Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv :2101.00027 [cs.CL ].
^ a b Iyer, Abhishek (15 May 2021). "GPT-3's free alternative GPT-Neo is something to be excited about" . VentureBeat . Archived from the original on 9 March 2023. Retrieved 13 March 2023 .
^ "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront" . www.forefront.ai . Archived from the original on 2023-03-09. Retrieved 2023-02-28 .
^ a b c d Dey, Nolan; Gosal, Gurpreet; Zhiming; Chen; Khachane, Hemant; Marshall, William; Pathria, Ribhu; Tom, Marvin; Hestness, Joel (2023-04-01). "Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster". arXiv :2304.03208 [cs.LG ].
^ Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model" . Microsoft Research . Archived from the original on 13 March 2023. Retrieved 13 March 2023 .
^ a b Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv :2201.11990 [cs.CL ].
^ a b Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong (2022-07-21), DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , arXiv :2201.05596
^ Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang, Junyuan; Zhao, Yanbin; Pang, Chao; Liu, Jiaxiang; Chen, Xuyi; Lu, Yuxiang; Liu, Weixin; Wang, Xi; Bai, Yangfan; Chen, Qiuliang; Zhao, Li; Li, Shiyong; Sun, Peng; Yu, Dianhai; Ma, Yanjun; Tian, Hao; Wu, Hua; Wu, Tian; Zeng, Wei; Li, Ge; Gao, Wen; Wang, Haifeng (December 23, 2021). "ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation". arXiv :2112.12731 [cs.CL ].
^ "Product" . Anthropic . Archived from the original on 16 March 2023. Retrieved 14 March 2023 .
^ a b Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. (9 December 2021). "A General Language Assistant as a Laboratory for Alignment". arXiv :2112.00861 [cs.CL ].
^ Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. (15 December 2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv :2212.08073 [cs.CL ].
^ a b c Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM" . ai.googleblog.com . Archived from the original on 2023-03-12. Retrieved 2023-03-09 .
^ "Language modelling at scale: Gopher, ethical considerations, and retrieval" . www.deepmind.com . 8 December 2021. Archived from the original on 20 March 2023. Retrieved 20 March 2023 .
^ a b c Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. (29 March 2022). "Training Compute-Optimal Large Language Models". arXiv :2203.15556 [cs.CL ].
^ a b c d Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways Archived 2023-06-10 at the Wayback Machine
^ a b Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything" . ai.googleblog.com . Archived from the original on 2022-03-25. Retrieved 2023-03-09 .
^ Thoppilan, Romal; De Freitas, Daniel; Hall, Jamie; Shazeer, Noam; Kulshreshtha, Apoorv; Cheng, Heng-Tze; Jin, Alicia; Bos, Taylor; Baker, Leslie; Du, Yu; Li, YaGuang; Lee, Hongrae; Zheng, Huaixiu Steven; Ghafouri, Amin; Menegali, Marcelo (2022-01-01). "LaMDA: Language Models for Dialog Applications". arXiv :2201.08239 [cs.CL ].
^ Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. (2022-05-01). GPT-NeoX-20B: An Open-Source Autoregressive Language Model . Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. Vol. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. pp. 95– 136. Archived from the original on 2022-12-10. Retrieved 2022-12-19 .
^ a b c Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical analysis of compute-optimal large language model training" . Deepmind Blog . Archived from the original on 13 April 2022. Retrieved 9 March 2023 .
^ Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance" . ai.googleblog.com . Archived from the original on 2022-04-04. Retrieved 2023-03-09 .
^ Susan Zhang; Mona Diab; Luke Zettlemoyer. "Democratizing access to large-scale language models with OPT-175B" . ai.facebook.com . Archived from the original on 2023-03-12. Retrieved 2023-03-12 .
^ Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models". arXiv :2205.01068 [cs.CL ].
^ "metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq" . GitHub . Retrieved 2024-10-18 .
^ a b Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay (2022-06-22), YaLM 100B , archived from the original on 2023-06-16, retrieved 2023-03-18
^ a b Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra, Vedant (30 June 2022). "Solving Quantitative Reasoning Problems with Language Models". arXiv :2206.14858 [cs.CL ].
^ "Minerva: Solving Quantitative Reasoning Problems with Language Models" . ai.googleblog.com . 30 June 2022. Retrieved 20 March 2023 .
^ Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?" . Nature . 615 (7951): 202– 205. Bibcode :2023Natur.615..202A . doi :10.1038/d41586-023-00641-w . PMID 36890378 . S2CID 257380916 . Archived from the original on 16 March 2023. Retrieved 9 March 2023 .
^ "bigscience/bloom · Hugging Face" . huggingface.co . Archived from the original on 2023-04-12. Retrieved 2023-03-13 .
^ Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science". arXiv :2211.09085 [cs.CL ].
^ "20B-parameter Alexa model sets new marks in few-shot learning" . Amazon Science . 2 August 2022. Archived from the original on 15 March 2023. Retrieved 12 March 2023 .
^ Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. (3 August 2022). "AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model". arXiv :2208.01448 [cs.CL ].
^ "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog" . aws.amazon.com . 17 November 2022. Archived from the original on 13 March 2023. Retrieved 13 March 2023 .
^ a b c "Introducing LLaMA: A foundational, 65-billion-parameter large language model" . Meta AI . 24 February 2023. Archived from the original on 3 March 2023. Retrieved 9 March 2023 .
^ a b c "The Falcon has landed in the Hugging Face ecosystem" . huggingface.co . Archived from the original on 2023-06-20. Retrieved 2023-06-20 .
^ "GPT-4 Technical Report" (PDF) . OpenAI . 2023. Archived (PDF) from the original on March 14, 2023. Retrieved March 14, 2023 .
^ Schreiner, Maximilian (2023-07-11). "GPT-4 architecture, datasets, costs and more leaked" . THE DECODER . Archived from the original on 2023-07-12. Retrieved 2024-07-26 .
^ Dickson, Ben (22 May 2024). "Meta introduces Chameleon, a state-of-the-art multimodal model" . VentureBeat .
^ Dey, Nolan (March 28, 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models" . Cerebras . Archived from the original on March 28, 2023. Retrieved March 28, 2023 .
^ "Abu Dhabi-based TII launches its own version of ChatGPT" . tii.ae . Archived from the original on 2023-04-03. Retrieved 2023-04-03 .
^ Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien (2023-06-01). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only". arXiv :2306.01116 [cs.CL ].
^ "tiiuae/falcon-40b · Hugging Face" . huggingface.co . 2023-06-09. Retrieved 2023-06-20 .
^ UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free Archived 2024-02-08 at the Wayback Machine , 31 May 2023
^ Wu, Shijie; Irsoy, Ozan; Lu, Steven; Dabravolski, Vadim; Dredze, Mark; Gehrmann, Sebastian; Kambadur, Prabhanjan; Rosenberg, David; Mann, Gideon (March 30, 2023). "BloombergGPT: A Large Language Model for Finance". arXiv :2303.17564 [cs.LG ].
^ Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun (March 19, 2023). "PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing". arXiv :2303.10845 [cs.CL ].
^ Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations – Democratizing Large Language Model Alignment". arXiv :2304.07327 [cs.CL ].
^ Wrobel, Sharon. "Tel Aviv startup rolls out new advanced AI language model to rival OpenAI" . www.timesofisrael.com . Archived from the original on 2023-07-24. Retrieved 2023-07-24 .
^ Wiggers, Kyle (2023-04-13). "With Bedrock, Amazon enters the generative AI race" . TechCrunch . Archived from the original on 2023-07-24. Retrieved 2023-07-24 .
^ a b Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for training than its predecessor" . CNBC . Archived from the original on 16 May 2023. Retrieved 18 May 2023 .
^ "Introducing PaLM 2" . Google . May 10, 2023. Archived from the original on May 18, 2023. Retrieved May 18, 2023 .
^ a b "Introducing Llama 2: The Next Generation of Our Open Source Large Language Model" . Meta AI . 2023. Archived from the original on 2024-01-05. Retrieved 2023-07-19 .
^ "llama/MODEL_CARD.md at main · meta-llama/llama" . GitHub . Archived from the original on 2024-05-28. Retrieved 2024-05-28 .
^ "Claude 2" . anthropic.com . Archived from the original on 15 December 2023. Retrieved 12 December 2023 .
^ Nirmal, Dinesh (2023-09-07). "Building AI for business: IBM's Granite foundation models" . IBM Blog . Archived from the original on 2024-07-22. Retrieved 2024-08-11 .
^ "Announcing Mistral 7B" . Mistral . 2023. Archived from the original on 2024-01-06. Retrieved 2023-10-06 .
^ "Introducing Claude 2.1" . anthropic.com . Archived from the original on 15 December 2023. Retrieved 12 December 2023 .
^ xai-org/grok-1 , xai-org, 2024-03-19, archived from the original on 2024-05-28, retrieved 2024-03-19
^ "Grok-1 model card" . x.ai . Retrieved 12 December 2023 .
^ "Gemini – Google DeepMind" . deepmind.google . Archived from the original on 8 December 2023. Retrieved 12 December 2023 .
^ Franzen, Carl (11 December 2023). "Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance" . VentureBeat . Archived from the original on 11 December 2023. Retrieved 12 December 2023 .
^ "Mixtral of experts" . mistral.ai . 11 December 2023. Archived from the original on 13 February 2024. Retrieved 12 December 2023 .
^ AI, Mistral (2024-04-17). "Cheaper, Better, Faster, Stronger" . mistral.ai . Archived from the original on 2024-05-05. Retrieved 2024-05-05 .
^ a b DeepSeek-AI; Bi, Xiao; Chen, Deli; Chen, Guanting; Chen, Shanhuang; Dai, Damai; Deng, Chengqi; Ding, Honghui; Dong, Kai (2024-01-05), DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , arXiv :2401.02954
^ a b Hughes, Alyssa (12 December 2023). "Phi-2: The surprising power of small language models" . Microsoft Research . Archived from the original on 12 December 2023. Retrieved 13 December 2023 .
^ "Our next-generation model: Gemini 1.5" . Google . 15 February 2024. Archived from the original on 16 February 2024. Retrieved 16 February 2024 . This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we've also successfully tested up to 10 million tokens.
^ "Gemma" – via GitHub.
^ "Introducing the next generation of Claude" . www.anthropic.com . Archived from the original on 2024-03-04. Retrieved 2024-03-04 .
^ "Sonus AI" . sonus.ai . Retrieved 2025-03-07 .
^ "Fugaku-LLM/Fugaku-LLM-13B · Hugging Face" . huggingface.co . Archived from the original on 2024-05-17. Retrieved 2024-05-17 .
^ "Phi-3" . azure.microsoft.com . 23 April 2024. Archived from the original on 2024-04-27. Retrieved 2024-04-28 .
^ "Phi-3 Model Documentation" . huggingface.co . Archived from the original on 2024-05-13. Retrieved 2024-04-28 .
^ "Qwen2" . GitHub . Archived from the original on 2024-06-17. Retrieved 2024-06-17 .
^ DeepSeek-AI; Liu, Aixin; Feng, Bei; Wang, Bin; Wang, Bingxuan; Liu, Bo; Zhao, Chenggang; Dengr, Chengqi; Ruan, Chong (2024-06-19), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , arXiv :2405.04434
^ "nvidia/Nemotron-4-340B-Base · Hugging Face" . huggingface.co . 2024-06-14. Archived from the original on 2024-06-15. Retrieved 2024-06-15 .
^ "Nemotron-4 340B | Research" . research.nvidia.com . Archived from the original on 2024-06-15. Retrieved 2024-06-15 .
^ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta
^ "llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models" . GitHub . Archived from the original on 2024-07-23. Retrieved 2024-07-23 .
^ deepseek-ai/DeepSeek-V3 , DeepSeek, 2024-12-26, retrieved 2024-12-26
^ Feng, Coco (25 March 2025). "DeepSeek wows coders with more powerful open-source V3 model" . South China Morning Post . Retrieved 6 April 2025 .
^ Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3 , Amazon, 2024-12-27, retrieved 2024-12-27
^ deepseek-ai/DeepSeek-R1 , DeepSeek, 2025-01-21, retrieved 2025-01-21
^ DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , arXiv :2501.12948
^ Qwen; Yang, An; Yang, Baosong; Zhang, Beichen; Hui, Binyuan; Zheng, Bo; Yu, Bowen; Li, Chengyuan; Liu, Dayiheng (2025-01-03), Qwen2.5 Technical Report , arXiv :2412.15115
^ a b MiniMax; Li, Aonian; Gong, Bangwei; Yang, Bo; Shan, Boji; Liu, Chang; Zhu, Cheng; Zhang, Chunhao; Guo, Congchao (2025-01-14), MiniMax-01: Scaling Foundation Models with Lightning Attention , arXiv :2501.08313
^ MiniMax-AI/MiniMax-01 , MiniMax, 2025-01-26, retrieved 2025-01-26
^ Kavukcuoglu, Koray (5 February 2025). "Gemini 2.0 is now available to everyone" . Google . Retrieved 6 February 2025 .
^ "Gemini 2.0: Flash, Flash-Lite and Pro" . Google for Developers . Retrieved 6 February 2025 .
^ Franzen, Carl (5 February 2025). "Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search" . VentureBeat . Retrieved 6 February 2025 .
^ a b "Models Overview" . mistral.ai . Retrieved 2025-03-03 .
^ "Grok 3 Beta — The Age of Reasoning Agents" . x.ai . Retrieved 2025-02-22 .
^ "meta-llama/Llama-4-Maverick-17B-128E · Hugging Face" . huggingface.co . 2025-04-05. Retrieved 2025-04-06 .
^ "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation" . ai.meta.com . Archived from the original on 2025-04-05. Retrieved 2025-04-05 .
^ Team, Qwen (2025-04-29). "Qwen3: Think Deeper, Act Faster" . Qwen . Retrieved 2025-04-29 .