This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/NixTheFolf on 2024-04-03 01:35:24.


The paper authors of “Logits of API-Protected LLMs Leak Proprietary Information” describe how they figured out and exploited a “softmax bottleneck” when calling on an API-Protected LLM over a ton of API calls, which they then used to get a close estimate that GPT-3.5-Turbo’s embedding size of around 4096 ± 512. They then mention how this makes GPT-3.5-Turbo either a 7B dense model (by looking at other models with a known embedding size of ~4096), or a MoE that is Nx7B.

I have done some thinking and I make the prediction that GPT-3.5-Turbo (the one that has been used since early 2023, not the original GPT-3.5) is almost CERTAINLY a 8x7B model.

Evidence points to this too indirectly when we take a look at Mixtral-8x7B. Mixtral-8x7B has used by many with the general consensus of this model being on-par or slightly exceeding GPT-3.5-Turbo on most things.

GPT-3.5-Turbo-0613 & Mixtral-8x7B-Instruct-v0.1 on the LMSYS Chatbot Arena Leaderboard having an averaged ELO difference of ~1 point, though there could be a deviation with GPT-3.5-Turbo-0613 by either +3 or -4 points.

While the evidence points to this, some still might not think that GPT-3.5-Turbo is the same size as Mixtral-8x7B because of difference in other language performance, but this could be due to differences in training data. We have no idea what training data was used for both Mixtral-8x7B or GPT-3.5-Turbo, so differences in their performance in relation to training data can be there.

Differences can also be found in tuning of these two models as GPT-3.5-Turbo is fine-tuned on RLHF data that has a LOT of human feedback by including the feature for people to vote on an answer that the LLM gives out (ChatGPT), while Mixtral-8x7B-Instruct is a more general Instruction fine-tune.

The use of a MoE by OpenAI makes a lot of sense too. They originally released GPT-3.5 back in November of 2022 inside of ChatGPT, which they thought not many people would use, so compute was not much of a concern. When ChatGPT blew up in the next two months, compute was now the MAIN concern as 10s of millions of people were now using ChatGPT and that model, with more and more people jumping on it in the coming months. They needed a new model that can be as smart as the original GPT-3.5, but able to be served to millions of people at the same time to keep up with heavy demand.

OpenAI had just finish training GPT-4 not too long ago, which used a 8x MoE (based on indirect knowledge), it showed great promise for the power it can give but also for efficiency in running it compared to a fully dense 1T+ model. They possibly figured that a smaller MoE could possibly get similar performance to GPT-3.5 while costing much less compute to serve to many people, only needing the VRAM to load the model into GPU memory. If we assume they loaded 2 experts out of a possible 8 experts for serving to people (similar to default Mixtral-8x7B), this would basically quadruple their existing computing power to serve to the growing user base of ChatGPT.

They first released GPT-3.5-Turbo in the API and in ChatGPT Plus to get a better idea of the performance of the model from the public compared to the original GPT-3.5, as well as set aside enough compute to run this model at full ChatGPT scale, which they then did a little while later. They possibly used RLHF data from the public as well to tune the new GPT-3.5-Turbo model to act a lot like ChatGPT as well in most cases, which helped them seamlessly change the model in ChatGPT without most users noticing directly.

Based on ALL of this, I can very confidently say that I think that GPT-3.5-Turbo is a 8x7B model, basically the same size as Mixtral.

One thing I also want to note is that the paper I mentioned was a newer paper compared to one about a month or two ago that described a similar technique to this? (I forgot the name of that earlier paper). They did something similar and found the embedding size of other smaller OpenAI models, but they did not give GPT-3.5-Turbo’s embedding size on request of OpenAI. Them not giving that information might be due to a model with the same specifications as GPT-3.5-Turbo already existing, and that model is Mixtral!