GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021.[1] As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.[2] The model is available on GitHub, but the web interface no longer communicates with the model. Development stopped in 2021.[3]
The GPT-J model uses rotary position embeddings, which has been found to be a superior method of injecting positional information into transformers.[5][6]
GPT-J uses dense attention instead of efficient sparse attention, as used in GPT-3.
Beyond that, the model has 28 transformer layers and 16 attention heads. Its vocabulary size is 50257 tokens, the same size as GPT-2's.[2] It has a context window size of 2048 tokens.[7]
It was trained on the Pile dataset,[2][4] using the Mesh Transformer JAX library in JAX to handle the parallelization scheme.[2][8]
Performance
GPT-J was designed to generate English text from a prompt. It was not designed for translating or generating text in other languages or for performance without first fine-tuning the model for a specific task.[2] Nonetheless, GPT-J performs reasonably well even without fine-tuning, even in translation (at least from English to French).[9]
When neither is fine-tuned, GPT-J-6B performs almost as well as the 6.7 billion parameter GPT-3 (Curie) on a variety of tasks.[4] It even outperforms the 175 billion parameter GPT-3 (Davinci) on code generation tasks.[10] With fine-tuning, it outperforms an untuned GPT-3 (Davinci) on a number of tasks.[1]
Like all LLMs, it is not programmed to give factually accurate information, only to generate text based on probability.[2]
Applications
The untuned GPT-J is available on EleutherAI's website,[11]NVIDIA's Triton Inference Server,[12] and NLP Cloud's website.[13]Cerebras[1] and Amazon Web Services[14][15] offer services to fine-tune the GPT-J model for company-specific tasks. Graphcore offers both fine-tuning and hosting services for the untuned GPT-J, as well as offering to host the fine-tuned models after they are produced.[16] CoreWeave offers hosting services for both the untuned GPT-J and fine-tuned variants.[17][18]
In March 2023, Databricks released Dolly, an Apache-licensed, instruction-following model created by fine-tuning GPT-J on the Stanford Alpaca dataset.[19]NovelAI's Sigurd[20] and Genji-JP 6B[21] models are both fine-tuned versions of GPT-J. They also offer further fine-tuning services to produce and host custom models.[22]
EleutherAI has received praise from Cerebras,[1] GPT-3 Demo,[4] NLP Cloud,[13] and Databricks[19] for making the model open-source, and its open-source status is often cited as a major advantage when choosing which model to use.[10][16][23]
^ abcd"GPT-J". GPT-3 Demo. Retrieved 13 June 2023.
^Biderman, Stella; Black, Sid; Foster, Charles; Gao, Leo; Hallahan, Eric; He, Horace; Wang, Ben; Wang, Phil (20 April 2021). "Rotary Embeddings: A Relative Revolution". EleutherAI. Retrieved 14 June 2023. In general we have found that across a large suite of setups including regular, linear, and local self-attention, it either matches or surpasses all other methods currently available for injecting positional information into transformers.
^Su, Jianlin; Lu, Yu; Pan, Shengfeng; Murtadha, Ahmed; Wen, Bo; Liu, Yunfeng (9 August 2022). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864 [cs.CL].