As the codex based github copilot shows its age, I am in search of a better LLM for ...

As the codex based github copilot shows its age, I am in search of a better LLM for Python code completion.
The benchmark that measures the performance is called HumanEval.
Found this interesting site which a chart on Code Generation on HumanEval.
Together with some open source models and xAIs new Grok-1, this is the top 20:

1.) Language Agent Tree Search (GPT-4) 94.4%
2.) Reflexion (GPT-4) 91.0%
3.) Language Agent Tree Search (GPT-3.5) 86.9%
4.) OctoPack (GPT-4) 86.6%
5.) ANPL (GPT-4) 86.6%
6.) Parsel (GPT-4 + CodeT) 85.1
7.) MetaGPT (GPT-4) 81.7%
8.) ANPL (GPT-3.5) 76.2%
9) Phind-CodeLlama-34B-v2 73.8%
10.) WizardCoder-Python-34B-v2 73.2%
11.) Claude 2 70%
12.) Phind-CodeLlama-34B-Python-v1 69.5%
13.) GPT-4 67%
14.) CODE-T (code-davinci-002) 65.8%
15.) CODE-T-Iter (code-davinci-002) 65.2%
16.) Grok-1 63.2%
17.) Unnatural Code Llama 62.2%
18.) PanGu-Coder2 15B 61.64%
19.) WizardCoder 15B 57.3%
20.) Code Llama - Python 53.7%
.
.
.
36.) Codex-12B/copilot 28.81%

I tried WizardCoder-Python-34B-v2 already via llama-cpp-python server in visual studio code, but it is not that fast for me as cuda support is still in early development. Some other alternatives I can test are vllm, flexflow, LocalAI, FastChat and OpenLLM.

So the best options for me are:

1.) OpenAI API (GPT-4)
2.) OpenAI API (GPT-3)
3.) Self hosted 6bit or 8bit quantized Phind-CodeLlama-34B-v2

https://paperswithcode.com/sota/code-generation-on-humaneval

streetyogi on Nostr: As the codex based github copilot shows its age, I am in search of a better LLM for ...