We have recently contributed to a research study investigating how AI can help with realistic software development tasks. METR initiated this study to measure how AI tools affect real-world software engineering productivity, particularly in substantial open-source projects. The study was designed to measure and assess how experts can use AI tools in order to improve their workflows. It’s all well and good if the latest model can fix an artificial test case, but what’s more interesting is how AI can be driven by expert knowledge.
The study required us to work on about 20 small normal development tasks. These tasks were randomised; in some, we were allowed to use AI tools, and in others, we were not. Apart from that, we could solve them however we wanted. We compiled notes on how we approached using the tools and recorded our screens to provide a record of our experience.
Sam used the hours to work on GHC tickets, and I (Matt) used the hours to work on Cabal tickets.
I have included a list of the issues we worked on at the end of the post. Sam focused on fixing small bugs,
while I fixed all the known regressions in the cabal-install-3.14
release.
In this post, I’ll briefly discuss how we used the tools and what our overall experience was.
Haskell Programming with the help of AI
Before this study, neither of us was experienced with using AI tools to help with software development. I was impressed that the models could interact with Haskell code at all. At the start, it was quite overwhelming trying to understand what was available and what the trade-offs were between different tools. The AI landscape is changing rapidly at the moment; there is a new model and tool every week. Therefore, I won’t go into too much detail about what specific models or tools we used, but rather focus on our findings and experiences.
Development Environment
For the study, we were primarily using the following models and tools:
- The text editor Cursor with AI autocomplete. Cursor is a fork of VSCode with AI related features. In the version of Cursor we used (0.45), there were two modes: the “chat” mode, which does not directly perform edits to your files, and the “compose” mode, which does.
- From within Cursor, the LLM
claude-3.7-sonnet-thinking
for the “chat” and “compose” features. - The standard ChatGPT 4o model from the web interface.
- Haskell Language Server
Using an editor with integrated LLM support, in particular one that supports Haskell Language Server, is key to getting the most out of the AI tools:
- Within the editor, it means that LLMs have access to relevant context for the task. This includes any files we pass to the model as context, but also the rest of the codebase which can be searched by the LLM.
- When an LLM suggests a change, it will receive feedback from HLS which will allow it to fix issues (e.g. fixing up missing imports, resolving typechecker errors, etc). In practice, this made LLMs much more more autonomous and reliable.
We didn’t use anything complicated or new, such as the Model Context Protocol or very advanced thinking models such as ChatGPT o1.
Armed with these tools, we were ready to set about our task.
Architectural Understanding Tasks
For the AI-enabled tasks, we were encouraged to use the AI as much as possible. Therefore, I typically started by just giving the AI a link to the GitHub issue and asking it to explain what to do to me. The summary was useful to me to just check I understood correctly, and hearing it phrased differently was a good sanity check before starting the issue.
Asking specific questions about the codebase had more mixed results. In general, the AI could usually give plausible answers to understanding tasks, but they were often wrong in some subtle way. It is also very suggestible to agree with whatever you state you think the solution is.
My impression for architectural understanding tasks was that you would have to provide a summary document as context in order to answer questions more accurately.
Technology Understanding Tasks
For tasks that required me to understand something new or unfamiliar, the AI was very good. In one issue, I had to investigate something wrong with the GitHub CI setup, which was an area I was quite clueless about. ChatGPT was able to suggest the probable cause of the issue with minimal prompting and just the issue description for context. That certainly saved a lot of time.
The ability to generate ad-hoc scripts for particular tasks was also very useful. I generated several useful single-use bash and python scripts for extracting specific pieces of information from the codebase. These scripts can also be used to generate information to feed back into the prompt, which can perform a useful feedback loop.
Code Generation Tasks
Once the AI demonstrated to me it understood the problem, I would ask it to generate a solution. The AI could generate plausible, syntactically correct code, but it was often the wrong idea. I think this was the biggest waste of time. Once a solution is generated, it was quite tempting to just “fix” the wrongness, but more often than not, the architecture or design was wrong. Many fixes in a codebase like Cabal require changing a few lines very precisely; that’s not something the AI is good at doing on its own at the moment.
On the other hand, if you are precise with your prompts and set the correct context, the AI can save a lot of time generating specific definitions for you. I would often use it to generate routine instances, simple definitions, or other well-defined generation tasks. It normally got these correct, which I was very surprised about.
Generating test cases was also a good use of the AI. It was able to handle generating the right structure for the custom Cabal test framework. These invariably required some tweaking, but getting all the right files in place made it a much simpler task.
The final approach to improve generation tasks is to first converse with the “chat” interface to clarify the problem, discuss different parts of the design, and point out any issues. Once this context is established, you can ask the “chat” window to first generate a prompt for the “compose” window. This prompt then gives precise instructions already to “compose,” but it can be edited further if something is not quite right.
Documentation Tasks
Opinion was split between us about how useful the LLMs were for documentation tasks.
I thought that this was a strong point in favour of using LLMs. Often when working on an issue, you end up having to explain the same thing several different times. First, you explain precisely to the machine what your intent is with the code you write. Then you explain to a developer in the comments and commit message. Finally, you explain to the user in the changelog and documentation. Each of these tasks requires modifying somewhere slightly different with a slightly different focus. I found that I was much more inclined to include all these different parts when using the AI since it could do a good job generating the necessary files without requiring too much further editing.
The code changes themselves, along with the context developed in “chat,” were normally enough to be able to generate the commit message, changelog, and documentation updates with very little effort.
On the other hand, the suggestions weren’t to Sam’s taste. He thought that the style generated for the commit messages was rambly and indirect. The model might focus on explaining a small detail rather than giving a bigger picture overview. For the more complicated code in GHC, the explanation of the code was a vague transcription rather than relaying any higher-level ideas the user might want to know.
He felt similarly when it came to note writing, a developer documentation artifact common to GHC development, the LLMs would “get stuck” explaining details of the code rather than the bigger picture. He did have some success in writing commit messages: the LLMs were good at summarising which functions and parts of the code were modified, which gave a good starting point for structuring the necessary explanations.
It’s interesting we had different experiences in this area, perhaps it was due to the difference in the codebase, or a difference in our style of using the models. People often struggle writing commit messages or documentation, and I think using LLMs can reduce the barrier to entry in this area. A human crafted commit message is often much bettter than one generated by a model but I would much prefer a commit message generated by an LLM rather than none at all.
Verification Tasks
Another interesting use case is to use the AI to perform ad-hoc verification tasks. For example, I used the AI to check that
all NFData
instances had a certain structure. For this, I first worked with the AI to generate a script to extract all the code for
NFData
instances from the codebase. This required a small amount of debugging, but it would have taken me several hours to write the
awk script myself due to unfamiliarity with the language. Once I had the script, I extracted all the NFData
instances and asked
ChatGPT to check that they all had the correct structure. The instance-by-instance summary allowed me to also quickly verify
the AI’s answer. It resulted in spotting a few missed cases that were very hard to spot by eye.
LLMs were also useful in diagnosing failing test cases. For example, Sam implemented a change to GHC which lead to a few dozen failing tests. After giving relevant context about the change, the LLM was able to categorise the failing tests:
- Some test results only involved minor output changes or improvements in error messages, these could be accepted.
- The LLM further categorised the serious test failures, e.g. “tests 1, 4 and 5 failed for one reason, while test 2 failed for another reason”.
This categorisation was useful to identify potential issues with a change and quickly addressing them. It often happens in GHC development that a small change can lead to hundred of failing test cases, and it can be very time consuming to go over all failing test cases individually. Having an assistant that can quickly do a first pass at sorting the test failures is very helpful.
Of course, the answers given to you by an LLM must always be taken with suspicion. In situations where 95% confidence is good enough, or when it is quick and easy to check the correctness of an answer, they can be very useful.
Refactoring Tasks
Using an LLM can be helpful for refactoring tasks that are routine and well-defined. In our experience, however, they tend to struggle with larger tasks or those requiring nuance.
For instance, the LLMs performed well when adding a new error message to Cabal’s diagnostic infrastructure. This kind of task requires modifying quite a few different places in a routine manner. There is not much code to add, nothing to move around or delete. Similarly, for smaller tasks like lifting an expression to a top-level definition or adding debugging traces, the AI was able to do this with a high degree of confidence.
For repetitive refactoring tasks, such as renaming fields or parameters, the Cursor autocomplete is extremely useful. I could often just change the name of a field, navigate to the next type error, and the autocomplete would suggest the correct modification. I personally also found it useful that the AI liked to keep things consistent. Fields were named consistently, functions named consistently, etc. I find this task of making sure the whole API is consistent quite difficult to do manually.
Finally, I didn’t really try to use the AI for large refactoring tasks. There would tend to be a lot of random or incomplete changes to the codebase, and subtle things would go wrong. Sam also reported that when working on some more routine refactorings, there might have been one or two places which required a decision to be made, and you could waste quite a lot of time if the LLM choose incorrectly. It would be useful if a LLM could indicate the places that it modified with lower confidence.
Conclusion
Overall, I found the experience of using AI tools in my normal development workflow to be very useful, and I will continue to use them after the study. It’s clear to me that it is going to become essential to be familiar with these tools as a developer in the future.
Sam has a more negative outlook in comparison. While he found LLMs useful, he is concerned that increased use of LLMs will affect our shared ability to reason about our code. Usage of LLMs risks disincentivising thinking deeply about the design or architecture of our software, which then increases the burden placed on reviewers and risks the community losing their shared understanding of how the codebase is supposed to operate.
We appreciate METR’s support in conducting this research, which has helped us better understand both the potential and limitations of AI-assisted development in the Haskell ecosystem.
If your company is interested in funding open-source work then we offer Haskell Ecosystem Support Packages to provide commercial users with support from Well-Typed’s experts, while investing in the Haskell community and its technical ecosystem.
Issues Fixed
Sam’s GHC Issues
Sam focused on fixing small bugs in the typechecker, together with a couple of bugfixes related to LLVM code generation.
Issue | Description | MR |
---|---|---|
#24035 | Address incorrect unused imports warning when using DuplicateRecordFields extension. |
!14066 |
#25778 | Fix oversight in the implementation of NamedDefaults in the typechecker. |
!14075 |
#25529 | Stop caching HasCallStack constraints to resolve an inconsistent in callstacks in GHC 9.8. | !14084 |
#25807 | Document defaulting in the user’s guide. | !14057 (first commit) |
#23388 | Enhance documentation for the ExtendedDefaultRules and OverloadedStrings extensions. |
!14057 (second commit) |
#25777 | Document the change in defaulting semantics caused by NamedDefaults . |
!14072 + !14057 (third) |
#25825 | Improve defaulting of representational equalities. | !14100 |
#24027 | Address issues related to type data declarations and their import/export behaviour. |
!14119 |
#23982 | Improve error messages for out-of-scope data constructors and types in terms. | !14122 (second commit) |
#22688 | Improve error messages when an instance is declared for something that isn’t a class. | !14105 |
#25770 | Fix segmentation fault occurring in the LLVM backend, due to a bug in FP register padding. | !14134 |
#25769 | Identify and fix the AtomicFetch test failure with the LLVM backend. |
!14129 |
#25857 | Stop emitting incorrect duplicate export warnings when using NamedDefaults . |
!14142 |
#25882 | Ensure NamedDefaults properly handles poly-kinded classes such as Typeable . |
!14143 |
#25874 | Refactor: remove GhcHint field from TcRnNotInScope error message constructor. |
!14150 |
#25848 | Refactor: split up CtEvidence into CtGiven and CtWanted . |
!14139 |
#25877 | Preserve user-written module qualification in error messages. | !14122 (first commit) |
#25204 | Fix Windows binary distribution error when there are spaces in the installation directory path. | !14137 |
#25881 | Improve recompilation checking mechanisms when explicit import lists are used. | !14178 |
#24090 | Investigate bug with ill-scoped type synonyms, implement next step of a deprecation plan. | !14171 |
Matt’s Cabal Issues
I focused on fixing regressions in cabal-3.14.1.0
, updating commands to use the project infrastructure and
fixing bugs in the multi-repl.
Issue | Description | PR |
---|---|---|
#2015 | cabal repl does not reflect changes in the PATH environment variable, causing inconsistencies in environment-dependent behaviours. |
#10817 |
#10783 | Add a test to verify the fix implemented by the contributor in #10783. | #10783 |
#10295 | Enhance cabal check to reject packages containing invalid file names, improving package validation. |
#10816 |
#10810 | Allow compatibility with containers-0.8 by updating package constraints. |
#10814 |
#7504 | Implement v2-gen-bounds . |
#10840 |
#10718 | Create a reproducer for the issue where Cabal 3.14.1.0 invokes test binaries with a corrupt (duplicated) environment variable list, aiding in debugging and resolution. | #10827 |
#10718 | Cabal 3.14.1.0 invokes test binaries with a corrupt (duplicated) environment variable list, causing test execution failures. | #10827 |
#10759 | Write a reproducer for the issue where cabal-install 3.14 linking fails with “shared object file not found,”. |
#10828 |
#10759 | cabal-install 3.14 linking fails with “shared object file not found,” indicating issues in locating shared libraries during the linking process. |
#10828 |
#10717 | Regression in Cabal-3.14.1.0: v1-test and Setup.hs test cause test suites of alex-3.4.0.1 and happy-1.20.1.1 to be unable to find data files. |
#10830 |
#10717 | Develop a test for the regression where Cabal-3.14.1.0’s v1-test and Setup.hs test cause test suites of alex-3.4.0.1 and happy-1.20.1.1 to be unable to find data files. |
#10830 |
#10797 | Ensure that C++ environment variables are correctly passed to configure scripts, facilitating proper configuration of packages requiring C++ settings. | #10844 |
#8419 | Continuous Integration (CI) using GitHub Actions produces error annotations despite tests passing, leading to misleading CI results. | #10837 |
#7502 | Implement per-version index cache in the ~/.cabal directory to improve performance and accuracy of package index operations. |
#10848 |
#10744 | The fix-whitespace job produces overly verbose output, making it difficult to identify relevant information in logs. |
#71 |
#10726 | The plan.json file generated by Cabal does not include the compiler ID, omitting crucial information about the compiler used in the build plan. |
#10845 |
#10775 | Encountering “Error: Dependency on unbuildable library” when using three internal libraries with --enable-multi-repl , indicating issues in handling multiple internal dependencies. |
#10841 |
#10818 | Unable to access __HADDOCK_VERSION__ macro during documentation generation, affecting conditional compilation based on Haddock version. |
#10851 |
#8283 | Extend cabal outdated command to support multi-package projects, allowing users to check for outdated dependencies across all packages in a project. |
#10878 |
#10181 | The cabal repl command does not support renaming of re-exported modules when loading multiple components. This limitation leads to failures when one component re-exports a module from another with a different name. |
#10880 |