The recording is available off the live stream here.
I am back in Canada now and after some rest I am following up on the next steps I proposed.
First, I’ve opened a pull request on codeberg with the MCP server as it ran in the talk’s demo.
Testing, questions, comments & reviews welcome.
You might notice between README.rst and AGENTS.md, there is heavy guidance for AI assistance and the MCP server. This is expected and on purpose: this is a tool for LLMs, after all.
It comes included with the necessary context so that LLMs can work with and on the MCP server.
There are still some tweaks I would like to do but I feel like this is a good first iteration that we can improve and build upon.
I wrote down some ideas in the TODO.md with some use cases that we could probably support, like:
scanning files, results or host facts for sensitive informations/tokens that could benefit from no_log or being ignored by ara
comparing two hosts or two playbooks with each other to find out how they might differ
summarizing tasks that resulted in changes with some heuristics since modules might falsely return CHANGED like a shell of echo foo without changed_when: false.
I still plan to find out how to hook this up to ara’s CI jobs in codeberg/forgejo actions so that a model can help troubleshoot CI jobs.
I haven’t looked at opportunities to leverage LLM “skills” yet but they could be interesting in different ways.
I think there is a lot of potential in asking competent models about our infrastructure and automation so I’m not worried we’ll come up with other ideas but that’s what I’ve got for now.
@rfc2549 nice talk indeed and the only one I felt was down to practical utilization of LLMs, with some demos and all. All other talks were mostly either philosophy or marketing on why LLMs are good/bad for you.
I’m wondering what is the largest model you have tried to run, at what quantization level and what are your thoughts on it? Did you make some comparison with commercial ones? Have you experimented with local LLMs in agentic tasks via Codex, OpenCode, Qwen Code and similar? Did you experiment with spec driven development in Ansible?
Personally I’ve maxed at 30b parameter class of models with 4 bit quant and small contexts (up to 64K), like Qwen3 Coder 30b, gpt-oss 20b, GLM 4.7 flash. That’s pretty much the best I can fit into 24 GB of VRAM.
My framework desktop has 128GB of RAM of which 96GB can be allocated to the GPU as VRAM which means I am able to run fairly large models relatively easily.
The examples I gave in my presentation run very well out of the box using the Vulkan backend:
The number of active parameters and whether the models are using a mixture of experts (MoE) architecture influence the performance quite a bit. For example, even though gpt-oss-120b is very large, it is really quick due to MoE whereas the devstral model is smaller, but it is also slower.
I don’t know about “agentic coding” but the models run fine in a coding context, I am using Zed and it works fine: https://zed.dev/
I haven’t done any formal comparisons but I do have a 20$/mo subscription to Claude for the time being. It is very good but that doesn’t mean that I use it for everything.
We could establish some Ansible baseline benchmarks for LLMs so that we know how they compare. This also gives the people some sense of what they are capable of and how much help they can be.
While testing LLMs myself, I’ve used a rather simple request but on a very convoluted and uncommon/unorthodox pile of Ansible code. The request was to implement argument_specs.yml by analyzing the code of the role and interrelated files (imported tasks from outside the role, global variables…). Then I “fuzzily” compared the generated output to the expected near perfect hand written argument_specs.yml. I put LLM to do the task of comparison thus self criticizing itself . Some more deterministic scoring mechanism is lacking but it gave me the sense of what they are capable of.
Suffice to say, commercial LLMs like gpt-5.1-codex-max or claude-opus-4.6 were able to do fine job. There were some hallucinations and errors here and there but overall it halved the time needed to document the argument_specs.yml. Out of free LLMs I’ve tried, Qwen3 Coder 30b showed promising results on the same benchmark. Older and smaller models produced some output I could only laugh to
Yeah, that sounds pretty aligned with what most people hit in practice.
On consumer hardware, 24GB VRAM is basically the “ceiling of sanity” for anything beyond 30B-ish models at 4-bit. Once you go above that, you either start heavily compromising context, offloading to CPU (which kills latency), or just accept slower iteration cycles.
I’ve also found that in the 20B–35B range, quality differences between models become more about training/data + instruction tuning than raw parameter count. A well-tuned 30B (like the ones you mentioned) often feels closer to much larger proprietary models than you’d expect, especially for coding tasks.
For agentic workflows (Codex-style loops, tool use, etc.), smaller but more responsive models tend to actually perform better in real use. Once latency creeps up, multi-step reasoning pipelines start to break down in practice, even if the single-response quality is higher.
I’ve experimented with Qwen-style coder models and similar setups in tool-using loops, and the biggest bottleneck usually isn’t reasoning ability — it’s stability across steps (staying on task, not drifting, not over-editing outputs). That’s where commercial models still tend to be ahead, especially in long-horizon agent tasks.
I completely agree with this very important point. In my experience I’ve also seen that not much “model intelligence” (e.g. smaller models will suffice) is needed to generate ansible yaml content.
I think if there is enought guardrails (lint rules, good implementation examples, fast and light tests) and agent tasks are divided into manageble chuncks that should move horizon of what is possible with smaller (local) models. But I am yet to quantify that.