An academic with limited resources shared their experience using several large language models (LLMs) for code assistance on a 16GB GeForce RTX 4060 Ti GPU.
Evaluating LLMs for Specific Tasks
The user tested various LLMs, including GLM 4.7, Qwen3 coder 30B, a3b oss 20B, Qwen3.5 (27B and 9B), and Qwen2.5 coder 14B, with context windows varying between 20,000 and 48,000 tokens. The goal was to assess the models' ability to understand and extend existing code, specifically a reinforcement learning implementation for a transitive inference task.
Devstral Small 2: An Unexpected Surprise
Contrary to expectations based on previous online assessments, Devstral Small 2 (24B) stood out as the most effective model. While not providing perfect answers, it was the only one able to produce partially correct results that could be used as a starting point. Other models, including GLM 4.7, produced lower quality output, even with longer processing times.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!