The recording is available off the live stream here.
I am back in Canada now and after some rest I am following up on the next steps I proposed.
First, I’ve opened a pull request on codeberg with the MCP server as it ran in the talk’s demo.
Testing, questions, comments & reviews welcome.
You might notice between README.rst and AGENTS.md, there is heavy guidance for AI assistance and the MCP server. This is expected and on purpose: this is a tool for LLMs, after all.
It comes included with the necessary context so that LLMs can work with and on the MCP server.
There are still some tweaks I would like to do but I feel like this is a good first iteration that we can improve and build upon.
I wrote down some ideas in the TODO.md with some use cases that we could probably support, like:
scanning files, results or host facts for sensitive informations/tokens that could benefit from no_log or being ignored by ara
comparing two hosts or two playbooks with each other to find out how they might differ
summarizing tasks that resulted in changes with some heuristics since modules might falsely return CHANGED like a shell of echo foo without changed_when: false.
I still plan to find out how to hook this up to ara’s CI jobs in codeberg/forgejo actions so that a model can help troubleshoot CI jobs.
I haven’t looked at opportunities to leverage LLM “skills” yet but they could be interesting in different ways.
I think there is a lot of potential in asking competent models about our infrastructure and automation so I’m not worried we’ll come up with other ideas but that’s what I’ve got for now.
@rfc2549 nice talk indeed and the only one I felt was down to practical utilization of LLMs, with some demos and all. All other talks were mostly either philosophy or marketing on why LLMs are good/bad for you.
I’m wondering what is the largest model you have tried to run, at what quantization level and what are your thoughts on it? Did you make some comparison with commercial ones? Have you experimented with local LLMs in agentic tasks via Codex, OpenCode, Qwen Code and similar? Did you experiment with spec driven development in Ansible?
Personally I’ve maxed at 30b parameter class of models with 4 bit quant and small contexts (up to 64K), like Qwen3 Coder 30b, gpt-oss 20b, GLM 4.7 flash. That’s pretty much the best I can fit into 24 GB of VRAM.
My framework desktop has 128GB of RAM of which 96GB can be allocated to the GPU as VRAM which means I am able to run fairly large models relatively easily.
The examples I gave in my presentation run very well out of the box using the Vulkan backend:
The number of active parameters and whether the models are using a mixture of experts (MoE) architecture influence the performance quite a bit. For example, even though gpt-oss-120b is very large, it is really quick due to MoE whereas the devstral model is smaller, but it is also slower.
I don’t know about “agentic coding” but the models run fine in a coding context, I am using Zed and it works fine: https://zed.dev/
I haven’t done any formal comparisons but I do have a 20$/mo subscription to Claude for the time being. It is very good but that doesn’t mean that I use it for everything.
We could establish some Ansible baseline benchmarks for LLMs so that we know how they compare. This also gives the people some sense of what they are capable of and how much help they can be.
While testing LLMs myself, I’ve used a rather simple request but on a very convoluted and uncommon/unorthodox pile of Ansible code. The request was to implement argument_specs.yml by analyzing the code of the role and interrelated files (imported tasks from outside the role, global variables…). Then I “fuzzily” compared the generated output to the expected near perfect hand written argument_specs.yml. I put LLM to do the task of comparison thus self criticizing itself . Some more deterministic scoring mechanism is lacking but it gave me the sense of what they are capable of.
Suffice to say, commercial LLMs like gpt-5.1-codex-max or claude-opus-4.6 were able to do fine job. There were some hallucinations and errors here and there but overall it halved the time needed to document the argument_specs.yml. Out of free LLMs I’ve tried, Qwen3 Coder 30b showed promising results on the same benchmark. Older and smaller models produced some output I could only laugh to