My Journey with LLMs and AI Agents: From Skeptic to Adopter

24 May 2026 - 12 minute read

When AI services like ChatGPT and Claude Code hit the scene, I was impressed by what they could do against LeetCode problems and mini projects but was skeptical that they would be able to handle the complexity of an actual product’s project.

We had access to several different AI services and tools at day job at Gather but every time I tried using them for code generation, I wasn’t happy with the results. They didn’t follow existing patterns in the codebase, failed to handle edge cases, or output code that wasn’t even buildable in some cases, especially for automated tests.

It wasn’t until 2026, I started to see a positive uptick with people on my team and developer friends in usage of AI and LLMs - it felt like something had genuinely changed, especially with the recent release of top of line models from Claude and OpenAI. They were frequently referenced as being a marked improvement on previous models, I decided to give it another go to see what I was missing.

The Early Struggles: Too Much Human in the Loop

I had a list of small issues in our app that I wanted to deal with for a while, but couldn’t prioritise hands-on time because there was always a higher priority task to deal with in our backlog.

As I knew these issues well and I knew what the solutions should be, I felt like this was a solid test if using AI would be an overall improvement to my productivity in any way. Gather’s engineering AI tool stack was:

Devin - An AI agent in the cloud that has it’s own virtual computer to work and build projects in using prompts.
Cursor - A code editor with AI powered autocompletes and agent mode for prompt based code generation.
Claude Code - An AI agent for prompt based code generation on your local projects.
Grapevine - Our own internal “ChatGPT” that’s been trained on our knowledge sources in Notion, Slack, GitHub, etc.

Devin was the tool I settled on because it uses own virtual computer and workspace so it meant it could “work” alongside me while I was working on other tasks. It wouldn’t change or interfere with my local workspace in any way.

It did okay at first. It would create a GitHub PR and our bots would leave feedback, but I had to keep sending that feedback back to Devin to make adjustments. Other times it would generate a PR with a solution that was incorrect or not what I had in mind, and I would have to instruct it to backtrack, give it the direction I wanted, and repeat the process.

It required too much interaction with me in the loop to get to a reasonable result instead of the holy grail of being completely hands off.

I didn’t care about how fast it could generate code, but how fast could I produce deliverables that impacts our users. Having me switch problem context between my main tasks and these Devin tasks was slowing me down rather than helping me deliver this value.

The Turning Point: Playbooks and “One-Shot” Workflows

That’s when I discovered we actually had “playbooks” already added to our account that weren’t being actively used by us today. (This was a leftover from our previous AI products team. Read more.)

A playbook (which is Devin’s equivalent of Claude Skills) is a mix between a knowledge base and a boilerplate prompt. It gives the agent a repeatable framework or high-level approach on exactly how to deal with the task it’s been given.

In our case, the playbook splits the workflow into multiple phases. The first phase is research and classification, where the agent investigates the codebase and proposes a plan for me to approve or tweak. Once the plan is approved, it moves onto implementation, does an internal code review and submits a PR to GitHub.

Crucially, the playbook instructs the agent to keep an eye on CI tasks and automated tests. If any of those fail, or if a review bot (or even another engineer) leaves feedback, the agent handles it automatically.

So I tried another ticket, but this time I attached a playbook. The difference was night and day. It managed the full loop, and after a couple of hours, pinged me to say, “I’m done, ready for your human review.”

I no longer had to be involved in the iteration loops at all and this is where I truly felt like a task could be fully handed off to an agent. I could give it a ticket, go off and do my deep work, and whenever I had a clear gap, be it a few hours or a few days, I could come back in to prompt for changes, review the PR and/or pull it down to test locally.

And because the playbooks are reusable from task to task, the input I need to give the AI is minimal. I just give it the task and tell it to use the playbook instead of having to re-prompt the guidelines on how to approach the problem, how to deal with our tool stack, or how to handle the automated feedback loop every time.

The knowledge base aspect of the playbook also ensures the agent follows our specific engineering patterns. It explicitly tells the agent: “This is the specific way we do things. You must run the lint and prettier tasks via these commands.” It provides guidelines on where to look for patterns we use, and just as importantly, examples of what not to do.

We also have a team culture that embraces these tools. If a session produces something worth repeating - a useful approach, a pattern that worked well - you can ask the agent to turn that session into a new skill. It will output it directly into your shared collection, ready to be reused by anyone on the team in future tasks. The playbook and skill library essentially grows itself.

The Parallel Workflow: Compartmentalising AI

I found that the real power of these tools wasn’t just in what or how fast they could generate, but in the impact on how I can structure my work and started treating agents as a parallel workflow.

As Devin operates in a remote computer, I could easily compartmentalise tasks. I have Devin start working on a few small tasks in the background including planning and suggested solutions while I stayed fully focused on a major task on my main computer.

This was the turning point for me to adopt AI more into my workflows. The ability to offload small, clearly scoped pieces or work or chores, without the friction of constantly context switching my own workspace, is where I found the real value.

A good example is a recent week where I was heavily focused on our payment system. Because the payment system is essential to the business and we can’t afford for it to be wrong, this work required my full focus as much as possible. I had to ensure I understood every edge case so we could be confident when it shipped since it directly affects revenue.

While I was deep in that work, I had a backlog of low-priority tech debt piling up included adding confetti to the virtual office and the need for a licensed tool to generate the image files needed for some art instead of a command line tool everyone can access.

I gave Devin the spec for this tasks and features, along with examples of how other games and projects had solved similar problems. It asked me a few clarifying questions, and then I let it go off and work in the background. It didn’t matter how long it took, because I was only going to check on it when I hit a natural breakpoint in my payment system work.

Workflow of using Devin in day to day workflow

That is what my week generally looks like now: one very deep, high-focus task running on my local machine, with a few bits of tech debt clearance running in parallel via an agent in the background.

What Makes a Task Suitable for Parallel Workflows?

To make this parallel workflow actually useful, a task needs to meet two main criteria:

I must know the relevant area of the codebase well enough. I have to be comfortable reviewing the code output and fully understanding it when it comes back. If I don’t know the system, I don’t have the confidence to review the PR and own the output.
There must be a very clear definition of done. The task needs to have a concrete expected output and a straightforward way to verify it.

If a task requires constant iteration or experimentation to see what works, it is a bad fit for an autonomous agent.

For example, we had a task recently where we needed to add a warning when a user deletes a chat channel (stating that all related data will also be deleted). I ended up doing that manually because I didn’t know exactly how I wanted it to look until I saw it on screen.

Initially, I thought it would just be a bit of text, but that didn’t stand out enough. So I changed it to a box-out with an exclamation mark. That still wasn’t quite right, so I had to tweak the colours further. Because we don’t yet have a clear playbook for UI iteration, and because the task required visual experimentation rather than a strict logical output, handing it to Devin would have been frustrating and slow.

The confetti script on the other hand, perfectly met the criteria for a parallel task. I knew the area well, I knew exactly what the output should look like, and it was easily testable. Even though the final output was visual but everything prior to that was systems work, making it perfect for a Devin task. If it worked, great. If it was close, I could ask Devin to revise it. If it was completely wrong, I could reject the code entirely.

In the end, Devin managed to nail everything but the final visual implementation of how the confetti looked, and that required me to manually iterate on it before I was happy to ship it into production.

Gather Confetti Feature — Throwing Confetti in Gather 🎉

The Hidden Costs: Mental Bandwidth and Team Capacity

So, am I actually “faster”? That’s the tough part to measure. I definitely feel more productive, and I’m definitely shipping more PRs by splitting my workload this way.

However, there is a flip side that rarely gets talked about: team capacity.

It’s fine being able to spit out so many PRs and generate so much code, but this code still needs to be reviewed. Our system relies on having one person who hasn’t been involved in the code (or the prompt generation) review the PR before it gets merged to main.

My review process for AI-generated code is exactly the same as it would be for a human contributor. If the solution isn’t immediately understandable, or if it overreaches its scope, I challenge it. If it still doesn’t make sense, I completely reject the PR and start again, either by taking over the task myself or using a different tool. I never submit code for a human PR review unless I fully understand it.

In terms of velocity as a company, we have to consider our team’s mental capacity to not only write features but also review them. If I had a whole team of “me’s” doing the same thing, would we have enough capacity to review all the output? Honestly, I’m not entirely sure.

What Needs to Be in Place Before This Works?

I’m actually still trying to find the definitive answer to this.

My current theory is that a major reason this parallel workflow succeeds for me at work is because we have a large, well-managed codebase. The code quality is high, and the standard of quality has been maintained throughout the years. This means the agents have excellent data to work from. When Devin generates code, it’s pattern-matching against a clean, consistent source rather than pulling fragmented examples from the internet.

The frontier models certainly help, and the tools built around them are getting better, but my gut feeling-based on my own experience and anecdotal conversations with other developers-is that the underlying quality of the codebase matters immensely.

The part I’m trying to figure out now is what happens when you don’t have that foundation. If you start from a new codebase, or a messy one, what do you need to put in place for these agents to work well? That’s something I’m currently testing myself on a personal project, and I’m hoping to get a better understanding of it soon.

In the End…

Is the 100x or 10x developer actually true? Honestly, I don’t think so. It’s not about the speed of work; it’s about the speed of deliverables that actually reach our users.

I want to understand why my experience at work has been so different from developers who are experimenting and struggling to get satisfactory results. What is needed to bridge the gap here?

Steven Yau's Blog

Game Maker Geek