Chat-Oriented Programming (CHOP) is an evolving paradigm where you use a chat-based LLM interface to generate code instead of writing it yourself.

I’ve been experimenting with using CHOP to accelerate my work on side projects, and wanted to share some of the results of my testing.

Specifically, I’ll test the following VS Code extensions that support CHOP:

  1. Github Copilot
  2. Cline
  3. Cody
  4. Deep-Cody (Cody using the Deep Cody agentic model)

I’ll be comparing their performance on the following tasks:

  1. Writing tests for existing code.
  2. Refactoring existing code.
  3. Adding a small feature (w/tests).
  4. Adding a large feature (w/tests).

All the extensions are configured to use the Claude 3.5 Sonnet model. Theoretically, the only difference is in the context/prompts that the extensions provide.

All these extensions are evolving quickly. This should be seen as just a snapshot of their behavior at this point in time. Expect improvements in the coming months.

There are two broad approaches to CHOP:

  1. Using a tool that’s integrated with your text editor / IDE.
  2. Using a standalone chat interface (e.g. ChatGPT or Claude web UI) and copy/pasting code back and forth.

In this post, I’m focused on the former, which feels more “autonomous” to me (the tool can actually update your code files with a single click).

Some folks prefer the latter because it gives more explicit control over the prompt and context. On the flip side, it forces you to carefully manage your prompt and context! 🙃

Table of Contents

Prompts Used

I’m testing these prompts on the codebase for a React-based “clicker-style” game I’m working on. The codebase is relatively small (about 2000 SLOC).

Also, these prompts are not “optimized” for best results. I just wrote what seemed like a decent prompt and rolled with it, exactly the same as how I normally do CHOP.

Prompt for writing tests:

Write tests for the tick function. Include tests about what happens when there are tasks in the taskQueue. Don’t test anything related to debugMode.

I included the src/game/index.ts file (where the tick() function is) in the context for this test.


Prompt for refactoring:

Extract action-related code, and action definitions, from the src/game/index.ts file and move them to a src/game/actions.ts file. Each action should define a “tick” function that has the code to run inside the main tick function when that action is in the taskQueue. Also update all relevant imports to reference the new file.

I included the whole workspace in context for this test (except for Cline, which chooses files to read automatically, and Deep-Cody, which builds the context with code search).


Prompt for adding a small feature:

Add a new facility called “Wood shed”. This facility should allow the user to store 100 more of the wood resource per wood shed built. It should be unlocked by a new “Wood shed” research that requires the “Construction I” research to be unlocked. Update the tests for the tick function to verify the behavior of the new facility and research.

I included the whole workspace in context for this test (except for Cline, which chooses files to read automatically, and Deep-Cody, which builds the context with code search).


Prompt for adding a large feature:

Introduce the idea of seasons to the game. There should be four seasons (spring, summer, fall, winter), each lasting 30 days. The results of the foraging action should be higher in the spring, and lower in the fall, and even lower in the winter. Also, the player should have a temperature that goes down in the winter, and the player dies when it gets too low. The player’s temperature should not go down if they’ve built a Shelter facility. Include tests for all of the new functionality.

I included the whole workspace in context for this test (except for Cline, which chooses files to read automatically, and Deep-Cody, which builds the context with code search).

Results

The results of the tests are summarized in the table below.

Legend:

  • ❌ = Did not succeed at task before I gave up on it.
  • ✔ = Succeeded automatically.
  • ✏ = Needed some adjustment.
  • ✏✏ = Needed a lot of adjustment.
CopilotClineCodyDeep-Cody
Writing tests✏ (12 tests; needed test results to be copy/pasted manually)✔ (19 tests)✏✏ (10 tests; needed file to be moved; needed many chunks to be copy/pasted manually)✏ (9 tests; needed test results to be copy/pasted manually)
Refactoring logic✏ (didn’t update imports in files that weren’t open)❌ (unable to produce code that compiles; unable to fix problems)❌ (unable to produce code that compiles; unable to fix problems)
Small feature❌ (generated very wrong code and didn’t put it in the right place, either)✏ (the suggested code was good, but had to manually fix some diff issues when applying)
Large feature✏✏ (produces code that only needs minor adjustments, fixed failing tests when prompted, but needed interpretation)✏✏ (tests created in wrong file, multiple test failures that couldn’t be automatically fixed, generated invalid JS)✏✏ (some logic was mangled, tests created in wrong file, needed significant massaging)

Cline and Copilot succeeded on the most tasks. However, Copilot required careful management of the context for it to be able to succeed, sometimes needing to load the whole workspace into the context for good results, whereas Cline loads context automatically with its agentic functionality.

Also, Cline did impressively well at implementing the large feature without any adjustment required.

Deep-Cody has good UX and seems very promising, but ultimately fell slightly short.

Overall, Cline was the most pleasant to work with, exemplifying the passive nature of CHOP — just write what you want to do, click Approve a bunch of times, read the diff, and be done.

Additional Notes/Comparisons

Some more detailed comparisons and points of interest:

Building Context

Both Cline and Deep-Cody have the ability to automatically decide what parts of the project should be added to the context, whereas Copilot and Cody require the user to manually choose which files to add, and produce bad results if relevant information isn’t included.

Copilot requires the user to manually add relevant files to the context. Or, it works fairly well if you just drop in @workspace to add the whole workspace to the context, but it’s quite slow.

Cline asks permission to read whole files that it thinks might be relevant, explaining why it wants access to each file.

Cody requires the user to manually add relevant files to the context.

Deep-Cody uses a code search to decide what chunks of files should be added to the context. This seems to work fairly well, but not quite as well as Cline’s approach. I believe it’s because Deep-Cody extracts specific, relevant blocks of code only, whereas Cline reads whole files, which provides more, well, context.

User experience

Copilot has a very polished-looking interface, but requires more effort from the user when actually applying the suggested changes. The user has to make sure to open the right file before applying changes, accept each “chunk” individually, and Copilot also will not automatically name or save newly created files. This adds a lot of overhead for the more complicated changes.

Although Cline has the least refined-looking interface, it actually tends to be the most intuitive and require the least clicks. You pretty much just click “Approve” over and over again, and then it’s done.

Cody and Deep-Cody have the same interface. They do better than Copilot in that you only need to click Apply once per file change, without needing to open that file first. Similar to Cline, you just need to scroll through and click Apply a bunch of times. But they seem to struggle a bit with creating new files; the new files frequently have the wrong name or folder path, requiring the user to create or move them manually. I’m not sure if that’s a widespread problem or just something with my particular codebase.

Self-correction

Copilot and (non-Deep) Cody do not include any agentic capabilities or self-correction that I can see.

Cline is the only one of the tested extensions that will automatically self-correct out of the box. It “sees” and addresses Problems in VS Code — like TS compile errors — automatically, and it also automatically runs and fixes tests. You can configure whether approval is required for each step.

Deep-Cody might get there as they improve their agentic capabilities but in the meantime you still need to copy/paste test results and manually select/resolve problems.

Notably, I have terminal permissions disabled for Deep-Cody because it requires an experimental flag; also, it’s not documented whether I’ll get to see/approve commands before they’re automatically run. Perhaps with terminal permissions enabled, Deep-Cody would run tests automatically too.

Pricing / API access

Copilot and Cody (including Deep Cody) both have a limited number of available models and operate on a subscription basis (a fixed monthly rate for a “Pro plan” with unlimited queries).

The Pro plan for Copilot is $10/mo, and for Cody is $9/mo.

Cline can use any compatible backend (including locally run models) and requires an API key for a 3rd party service instead of offering a subscription.

When used with Anthropic API (Claude 3.5 Sonnet), you are charged per API request. The four tests above cost $0.19, $0.20, $0.17, and $0.43 respectively, for a total of $0.99 across all four tests.