Image by Editor
A lot of people have been wondering what makes GPT-4 so much better than GPT-3. It has taken the world by storm. It is the most anticipated AI model currently, and people wanted to know more about it. OpenAI did not release anything regarding GPT-4, for example, the size, data, internal structure, or how they trained and built it. We’ve all been wondering as to why they have been concealing this information.Â
Well, you’re about to find out because the details on GPT-4 have been leaked!
So what details have we found out about GPT-4? Let’s dive in…
Model Size
Large language models (LLMs) have been growing over the years, and the model size reflects this. In 2022, GPT-3 had a model size of 1 trillion, which is a 15,000x increase in the past 5 years. It is said that GPT-4 is 10x the size of its predecessor, GPT-3. Stating that it has roughly 1.8 trillion parameters, across 120 layers. At 120 layers, GPT-4 is a deep architecture which is able to complex various complex tasks - making it one of the most advanced models out there!
Mixture of Experts
OpenAI is using MOE - A mixture of experts. Unlike GPT-3 which is one static model, GPT is a mixture of 8 x 220-billion-parameter models. These 8 x 220B models were trained on different data and task distributions, utilizing 16 experts within their model. Each model is roughly around 111 billion parameters for multi-layer perceptrons, with each expert having a specific role for example coding, or formatting.Â
Mixture-of-experts is not a new thing and has been around for a while. For example, Google uses a mixture of experts with expert choice routing which means depending on what type of question you are asking, it routes you to a different expert that answers your questions.Â
GPT-4 uses roughly 55b parameters solely for 'attention', for example guiding the model to stay on the topic at hand.
Inference
The inference is all about how LLMs make predictions. GPT-4 is doing pretty well in comparison to other models. It has been said that each forward-pass inference for the generation of 1 token only utilizes roughly 280 billion parameters and roughly 560 teraflops (the rate to measure your GPU’s performance).
Datasets
You can imagine how many datasets GPT-4 uses based on its performance and being a state-of-the-art model. It is stated that GPT-4 is trained on roughly 13 trillion tokens, which is roughly 10 trillion words. It uses 2 epochs for text-based data and 4 epochs for code-based data.Â
The actual size of the dataset is unknown, as some of these tokens were re-used, so we can roughly estimate that it includes several trillion tokens. Internally, there are also millions of rows of instructions which fine-tune data from ScaleAI.
Context Length
For the pre-training phase of GPT-4, it used a context length of 8 thousand tokens. After the pre-training, the sequence length was based on fine-tuning the 8 thousand tokens.
Batch Size
The batch size is the number of samples processed before the model is updated. The batch size was continuously increasing, with OpenAI using a batch size of 60 million, which is roughly around 7.5 million tokens per expert. In order to find out the real batch size, you will need to divide this number by the sequence length.Â
Training Costs
This is an area that a lot of you will be interested in - training costs. You can imagine how expensive GPT-4 was to build and train.Â
It took OpenAI roughly 2.1e25 FLOPS (floating point operations per second) of computing to train on roughly using around 25 though A100 processors in the space of 3 months. It is stated that GPT-4 is around 3x more computationally expensive to run than GPT-3.5. It is also said that GPT-4 costs 3x more than GPT-3 in regards to prompts.Â
For example, if OpenAI were training in the cloud was around $1 per A100 hour, the training cost for this hour alone would have cost $63 million.Â
Speculative Decoding
It has also been said that OpenAI might be using speculative decoding. The keyword ‘might’. This means that they are using smaller and faster models to help decode tokens and then feed this into large models as a single batch.Â
This means that if the predictions made from the smaller model were correct, the large model will agree with these predictions. However, if the larger model rejects the predictions from the smaller model, the rest of the batch is also discarded.Â
Wrapping it up
This leak reflects more of a high-level architecture leak, rather than a model leak - which a lot of people were expecting. Although it is not the same, this kind of information is still useful to know as we continue to see the growth of LLMs and how much it takes to create an AI model such as GPT-4.
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.