Over the past few months, I led a team through intensive iterations on an AI development tool. We hit many walls and validated some counterintuitive findings. The core conclusion: LLM development efficiency, capability, and quality have been thoroughly proven - no question about it. But to extract maximum value, you need the right collaboration methods. Some align with traditional engineering practices; others are the complete opposite.
I won’t prescribe a perfect methodology. AI capabilities are evolving rapidly - today’s best practices may be obsolete tomorrow. What follows is more of a field notebook: what worked, what failed, and why. The most surprising discovery was the impact on traditional engineering practices - we gradually stopped our CI pipeline, abandoned manual code review, and fully embraced AI pair programming. This isn’t laziness; we found more effective approaches through practice. Sounds like heresy, but there’s solid logic behind it.
Why We Fully Embraced AI Pair Programming#
I started with a naive idea: since AI is so powerful, why not let it run autonomously overnight and harvest results in the morning? Humans sleep, machines work - everyone plays to their strengths. I tried it. The result was a profound lesson.
That night I set up tasks and let AI work autonomously while I slept peacefully. Next morning, all unit tests passed. I was thrilled, thinking I’d found the holy grail of productivity. Then I spent an entire day manually validating and discovered the results weren’t what I wanted at all. Worse, previously correct code had been broken. The entire codebase was in a state of “looks normal, actually a mess.” Since all tests were green, problems were perfectly hidden. I nearly merged this code into the main branch.
What went wrong? After repeated retrospectives, I found the root cause: drift. AI might deviate slightly from your intent at each step. Looking at any single step, the deviation is small, even reasonable - you can’t spot obvious flaws. But without someone pulling it back in time, deviations accumulate. Like an uncalibrated compass that’s off by one degree per step - after a hundred steps, you’re completely lost. After hours of autonomous work, AI may have severely deviated from the original goal without realizing it.
What’s worse, AI will “self-rationalize,” drifting further in the wrong direction while becoming increasingly self-consistent. It won’t stop to question whether it’s off track - from its perspective, each reasoning step is logical. It finds perfect justifications for its deviations, even modifying tests to fit incorrect implementations, ultimately presenting you with a “self-consistent” result. This self-rationalization ability is an asset when going the right direction, but a disaster when going wrong.
An analogy: AI is like an extremely capable executor without global perspective - similar to an engineer with top-tier technical skills but limited understanding of business objectives. It can execute the current step with high quality, producing clean code with clear logic, but may not know if this step is still on the path to the correct goal. It needs someone with global perspective to constantly calibrate direction - this is the value of pair programming.
I completely changed my collaboration approach, shifting from “fully autonomous” to “small-step iteration” pair programming mode. Check AI output every 5-10 minutes; when deviations appear, immediately pull it back instead of letting errors compound. Confirm each milestone before proceeding to the next step. It seems less efficient - humans spend more time watching - but actual effectiveness improved dramatically because there’s no need for massive post-hoc validation and rework. The cost of rework is ten to a hundred times higher than mid-course correction.
When can you safely let AI work independently? High-certainty tasks: modules where interfaces are clearly defined and only implementation is needed, with no design decisions involved; work with clear boundaries that doesn’t touch other modules, with no gray areas; repetitive operations like batch refactoring, adding caching, or reskinning - tasks where “correct” is objectively verifiable. Team members validated this judgment: “The interface was already deterministic - I just needed to implement one module. With almost no feedback to AI, it produced a working first version.” Another member said: “Adding L2 caching was a very certain task. AI nailed it in one shot, no adjustments needed, straight to PR.” These cases share a common trait: task boundaries and acceptance criteria are clear, with no ambiguity requiring human judgment.
Conversely, exploratory projects, work involving design decisions, and cross-module changes are unsuitable for AI working independently. These scenarios share a characteristic: what constitutes “correct” is inherently ambiguous and requires human judgment. This demands pair programming - humans and AI working closely together, with humans handling direction and judgment while AI handles execution and implementation. This division lets both parties maximize their value.
The Intent Paradox: Why Doing Beats Thinking#
This was my most counterintuitive discovery, directly challenging the conventional wisdom of “think it through before acting”: you cannot produce correct Intent without actually doing the work.
Traditional software engineering methodology says you should first do requirements analysis, write design documents, think through every detail, then implement. This approach assumes thinking can substitute for practice - that through sufficiently deep analysis, you can figure out all problems before starting. But in AI-assisted development, this assumption is often wrong. Requirements you think are clear actually contain massive flexibility and uncertainty. Many critical technology choices only surface after you’ve built something. Pure “thinking” only captures surface-level information; true complexity hides in the details.
Here’s a concrete example. The team debated using Tauri versus Electron for a desktop client - seemingly a simple technology choice. We analyzed thoroughly: benchmarks, bundle size, memory usage, community activity. We concluded Tauri was better: smaller bundles, better performance, more modern stack. The analysis seemed rigorous, every argument backed by data. We were confident in this decision.
Then we built the first version with Tauri. During code review, we discovered the two frameworks have completely different architectural philosophies - and this difference matters far more than benchmark numbers. Tauri uses a sidecar architecture: the application’s core logic is compiled into a separate sidecar process that communicates with the Tauri main UI via IPC. The frontend renders through the system’s native WebView while backend logic runs in a Rust process - the two sides are isolated. This architecture strongly constrains how you divide frontend and backend responsibilities. You must think through which logic goes where, and all cross-boundary calls must go through IPC. Electron is completely different - it bundles Chromium and Node.js together, with frontend and backend running in the same process space. Communication has virtually no cost, boundaries can be blurry, making it suitable for rapid iteration and complex UI logic.
This architectural difference was invisible before we started building - it doesn’t appear in any benchmark comparison articles. During discussions, we looked at surface metrics. The real architectural differences only emerged after writing code and hitting concrete problems. During implementation, we discovered fundamental conflicts between Tauri’s model and our requirements - conflicts invisible at the design stage. “Paper knowledge feels shallow; true understanding requires practice” - still applies in the AI era.
This is the Intent Paradox: you need Intent to guide implementation, but correct Intent can only be obtained through implementation. A chicken-and-egg problem with no perfect solution, only pragmatic ones. The solution: build a prototype first, clarify intent through practice, refine as you go, don’t expect perfection on the first try. Accept that intent will iterate; treat the first implementation as a learning process, not the final deliverable.
This leads to another counterintuitive discovery: rebuilding after completing a prototype dramatically improves efficiency - unbelievably fast.
Back to the Tauri vs Electron example. Building the first version with Tauri took about an hour, including hitting pitfalls, debugging, and learning various concepts. This process fully revealed the requirements’ complexity and clarified the real technical needs. Then I decided to scrap it and rebuild with Electron. I let AI start working. Five minutes later, it said the task was complete, all unit tests passing. I thought it had made a mistake or cut corners. Manual testing showed all features worked correctly, with high code quality too.
Why was the second attempt so fast? Clear logic behind it. All pitfalls had been encountered once - which areas have traps, which APIs have gotchas, which edge cases need handling - all exposed in the first implementation. Intent was crystal clear with no ambiguity; every feature’s exact behavior was in my head. Technology choices were settled, no more exploratory work needed. All details were captured in context; AI’s second implementation had nothing to guess or infer. These four factors combined to make the second attempt pure “translation” work: translating clear intent into code. This deterministic translation is exactly what AI excels at.
This reframed my understanding of prototypes. A prototype’s value isn’t producing code; it’s producing learning. The first implementation is for learning: learning the true complexity of requirements, learning the actual limitations of technical approaches, learning various edge cases and exception scenarios. The second implementation is for production - by then everything is clear, and execution efficiency is remarkable. If you treat the first implementation as the final deliverable, you’ll agonize over sunk costs, reluctant to start over, ultimately struggling forward with a half-baked solution. The right mindset: prototypes are meant to be thrown away. Don’t get attached.
TDD Becomes Essential: The Two-Phase Task Method#
“AI is perfect for TDD. AI desperately needs TDD.” This is one of my most confident conclusions from practice, though it sounds ironic - conventional wisdom says TDD was designed for humans.
Traditional software engineering has always preached TDD; every textbook says writing tests before code is best practice. But honestly, few projects truly follow it strictly - at least most projects I’ve seen can’t. Tight timelines, heavy workloads - testing time gets squeezed out first. Writing code first, then backfilling tests (or not writing tests at all) is normal for many projects. In the human era, TDD was somewhat of an ideal - everyone knew they should do it, but real-world constraints made it hard to implement.
The AI era is completely different. TDD has shifted from “ideal” to “essential.”
AI doesn’t like writing tests - an interesting phenomenon I observed. When asked to write code and tests simultaneously, it instinctively focuses energy on code and goes through the motions on tests - cutting corners, skipping cases, writing weak tests. Coverage looks okay, but actual validation capability is weak. This isn’t a bug but a “preference” - it’s more eager to demonstrate coding ability than testing ability. Having it write tests after code is even worse - it produces tests that “make existing code pass.” These tests have no validation value because they’re designed around implementation, not requirements. Tests become footnotes to code, not guardians.
But if you write tests first, then code, effectiveness transforms dramatically. Correctness rates improve significantly.
I developed a “two-phase task method.” Phase one: only have AI write tests - this is a standalone task. Give it requirements and ask for unit tests that must cover happy paths, unhappy paths, at least N severe exception scenarios (network failures, database unavailability, etc.), and at least N security scenarios (injection attacks, privilege escalation, etc.). The key: AI doesn’t know it will write the implementation next. It thinks seriously about test scenarios because there’s no “implementation detail” baggage - it’s not constrained by specific implementation ideas and can think more purely from a requirements perspective about “what situations need testing.”
Phase two: give the tests to AI (usually in a new session) and have it implement functionality based on requirements and these tests. Now it has clear acceptance criteria and can self-check using tests. Run tests after writing each portion of code, forming a tight feedback loop. Tests become its “navigation system,” telling it whether it’s still on the right track.
Why separate the phases with different sessions? To avoid context contamination. If you have AI write tests then implementation in the same session, it’s already “envisioning” implementation approaches while writing tests - tests unconsciously lean toward implementation. Its tests will happen to validate the implementation approach it imagined, not the requirements themselves. Separated, phase-one AI is in “pure testing mindset,” thinking entirely from requirements and user perspectives; phase two is “pure implementation mindset,” focused on passing tests. Each does its own job.
Another discovery: AI’s E2E test design capability is remarkably strong - beyond expectations. I had AI list all possible cases for a deployment feature: edge cases, bad cases, different frameworks, various boundary conditions. It produced dozens of different test scenarios, including edge cases I never thought of. I added a few more, then had it generate all test fixtures. After about fifteen minutes of discussion, it produced over 50 different test projects covering various tech stacks, deployment methods, and exception scenarios. This coverage would take humans days to design manually.
But there are key techniques for AI E2E testing. First, don’t let AI operate browsers - browser-based testing is slow and unstable, with timeout issues making tests fragile. Better to have each application include a verify script using curl or similar tools to check key endpoints - much faster and more stable. Second, disconnect code context. I explicitly tell AI: “When doing E2E testing, you’re not allowed to look at my code. The only thing you can use is this CLI. If the CLI can’t complete an operation, that’s a bug.” This prevents AI from taking shortcuts by directly manipulating internal structures, bypassing normal user paths, ensuring you’re truly testing external interfaces rather than internal implementation. Third, prohibit auto-fixing. When AI finds failing tests, it instinctively wants to fix your code - its instinct is to make tests green. But this can mask real problems, hiding bugs rather than exposing them. Testing phase only records issues; fixing is a separate task.
Why We Stopped CI and Abandoned Code Review#
Now to explain the most controversial part of the title: why we stopped CI pipelines and abandoned manual code review. Sounds like regression, abandoning decades of software engineering’s recognized best practices. But in the context of AI-assisted development, these practices’ value is undergoing fundamental change.
First, CI. What’s the core value of traditional CI? Ensuring every commit passes all checks - lint, test, type check, build - preventing bad code from entering the main branch. Under “humans write code, machines check” mode, this makes sense because humans make various low-level errors that need machine gatekeeping. But in AI-assisted development, AI has already completed all these checks locally. It runs tests while writing code, does lint while fixing issues - code is basically “green” at commit time. Running CI again on GitHub mostly just confirms “yes, it’s green” without finding new issues. Waiting minutes to tens of minutes for CI to complete produces no value.
Does CI still have value? Yes, but the value point has shifted - from “checking code quality” to “deployment automation.” Code quality checks are already done locally. CI now mainly handles deployment pipelines: build, package, release, deploy to various environments. This still needs automation, but it’s a different thing from traditional “continuous integration.”
Now, Code Review. What’s the core value of traditional code review? Knowledge sharing, quality gatekeeping, team collaboration. Experienced engineers reviewing newcomers’ code can spot potential issues, teach best practices, ensure code follows team standards. Under “humans write code, humans review code” mode, this makes sense because code volume is limited and humans have bandwidth to look carefully. But in AI-assisted development, code output has changed by orders of magnitude. AI-produced code is too large - often thousands of lines, with a single feature’s PR potentially touching a dozen files. Humans simply cannot review carefully. Most of the time you glance at the diff, scan key sections, and call it done. This isn’t laziness; it’s reality - human bandwidth is limited.
How do we replace code review’s value? Two approaches. One: use AI for code review, having another AI (or the same AI in a new session) audit code. It can examine every line carefully, never tires, never misses. The other, more fundamental approach: change what gets reviewed. What should really be reviewed isn’t code but Intent. Intent is the true source code; code is just one compiled artifact of Intent. If Intent is correct, code problems can be regenerated at low cost. If Intent is wrong, no amount of beautiful code helps - running faster in the wrong direction is more dangerous. So now we review Intent documents, design decisions, and technology choice rationales, not every line of code.
There’s good news and bad news.
Good news: refactoring has become painless. Previously, changing one name meant touching many files - too much work, so we’d leave mistakes alone, knowing a variable name was misleading but not daring to change it. Small problems accumulated into big problems, technical debt piling up until something eventually broke. Now AI completes full refactoring in minutes, and because Intent is very clear (“rename A to B, global replace”), refactoring is usually an extremely deterministic task that rarely goes wrong. Don’t tolerate technical debt - refactoring cost is already very low. Change what you want, keep the codebase healthy.
Bad news: Merge will become the biggest challenge. AI can produce tens of thousands of lines of code changes daily. Everyone is refactoring frequently. Codebase change velocity is ten to a hundred times what it was. Traditional merge workflows will collapse - before you finish reviewing one PR, the codebase has changed three times, with conflicts everywhere. We’re still exploring coping strategies. Current ideas: First, return to multi-repo, using interfaces to isolate different modules. Each module is its own codebase, reducing merge conflict potential. Modules communicate through well-defined interfaces; internal implementation changes don’t affect others. Second, establish clear module boundaries. Each module has its own “territory,” with only one owner for code within that territory. AI can only work within its own territory, not crossing boundaries to modify others’ code. Third, something like check-out/check-in mechanisms, clearly establishing who’s modifying which module. Only one person can modify a module at a time, avoiding conflicts from parallel modifications to the same area. These ideas are still being validated.
Finally, let’s talk about mindset when collaborating with AI. This may matter more than specific methodologies.
One scene left a deep impression on me. I designed what I thought was an elegant data structure, spent considerable time refining it, convinced it was optimal. Then I had AI implement it. It started implementing but kept getting confused, asking various questions: “Is this handling intentional here?” “Is this edge case deliberate?” I insisted on my approach and told it to continue. It complied, but the code came out awkward, workarounds everywhere. Finally, I reviewed carefully and discovered I was wrong. Its initial confusion was correct - the approach it wanted to take was better from the start: simpler, more robust, easier to understand.
This made me realize something: when you direct a stronger person using weaker methods, they’ll perform worse. Because they can’t think at such a low level - they must suppress their own judgment to follow your instructions. They’re striving to understand your intent, striving to execute your commands, even when your commands are suboptimal. This “obedience” is actually waste - wasting AI’s capability.
After building good AI development tools, we essentially have access to an extremely high-level coder. The question is: are we qualified to direct this person? This isn’t modesty; it’s pragmatic self-examination. AI may be more correct than we think, especially at the implementation level. Maintaining a learning and open mindset, willing to admit our approach isn’t optimal, willing to hear AI’s “suggestions” (even when expressed as confusion and questions) - this is how to truly leverage this tool.
AI capabilities are evolving rapidly. This article records the past few months’ practical experience. I’m confident the core insights will remain valid for some time, but specific practices may need constant adjustment. Looking back a year from now, some views may be outdated, new best practices may have emerged. But the spirit of exploration won’t change: continuous validation, staying open, not being bound by old paradigms, willing to overturn our own previous conclusions.
Intent is the source code. Code is just the compiled artifact.
