zachblume / autospec

Autospec is an open-source AI agent that takes a web app URL and autonomously QAs it, and saves its passing specs as E2E test code
https://autospec.dev
MIT License
45 stars 4 forks source link

Generate Playwrite tests based on instructions to AI #17

Closed perryraskin closed 3 months ago

perryraskin commented 3 months ago

Would be amazing to be able to reference a text file as instructions to give the AI to test. Then, based on these instructions, the AI genereates actual playwrite code, and does the testing!

zachblume commented 3 months ago

The direction that this project originally set out from is an agent that has the opportunity to interactively step through testing with visual and DOM feedback, rather than generating test code ahead of runtime.

@craigmulligan mentioned that he had also considered e2e test generation from scratch and it lead him to this project. My original thoughts were actually more along the lines of generating playwright from prod/dev traffic, saving it as playwright code, and then using AI to deduplicate those sessions to produce a consistent battery (the deduplication aspect seems like an unavoidable future aspect to any generated or agent testing approach).

In any case, I don't think this project's purpose is over-determined, but it did set out from the viewpoint of an agent that gets to interactively step through QA like coding agents have a chance to interactively step through coding. This adds realworld feedback on each step to the model's decision making, rather than asking it to go from 0 to 100. Humans don't generally write test code without having stepped through the flow in a browser and my thought is that indicates something about the particular difficulty of reasoning through the entire codebase/stack about how to write a test (I think this is very similar to the question of who writes code without compiler/linter/manual testing, which is the question that coding agents seem to be answering).

I've opened #18 and #19 which I think capture some of your suggestion. They describe, as opposed to directly proceeding from spec list -> generated code, using a provided spec list -> to run the agent (skipping the dicovery phase) and caching successful runs -> as playwright code, which would allow the agent to not run redundantly on future commits for those specific specs.

I'll leave this issue open. I think my questions for future thought are:

perryraskin commented 3 months ago

I'm not entirely sure but I think @lost-pixel has something in this arena that is open source.

I'm very new to testing, so I'm not sure I will have the best answer for you. I was very impressed once I ran this project using one of our project URLs at @coverdash. However, we are really looking for something that will help us write actual testing code that we can maintain and modify as we need. To me, that seems a lot better than an agent simply looking at a UI and guessing what the steps are that should be taken.

But perhaps I'm misunderstanding how helpful this can be as an AI Agent, and I'm happy to learn more!

craigmulligan commented 3 months ago

My initial thought was that you could provide a prompt of the specs or user flows you'd like the agent to test and then have it action those by way of a playwrite API. But looking at autospec I was pretty impressed at the generated specs.

I still think there maybe a use case for user supplied specs but maybe that would be in conjunction with agent generating some of it's own.

Importantly I don't think the agent should just generate playwright code that you later execute. E2e tests are notoriously flakey either due to changes in the UI or weird timing issues. Having an agent work through these issues, I suspect will make them more robust in the same way that a human QA would be compared to a hard coded test suite.

There are some optimizations you could make like caching the agents responses, that way you only rely on the LLM when there is a change or timing issue in your UI and then you can leave it to the agent to try make it's way through the rest of the test case. And of course you get automatic testing for new features without potentially having to do anything.

Of course there are lots of other things to consider, but if you assume that a human is more reliable at testing a feature then ai agents seem like a good approach for this sort of testing.

zachblume commented 3 months ago

I'm not entirely sure but I think @lost-pixel has something in this arena that is open source.

I'm very new to testing, so I'm not sure I will have the best answer for you. I was very impressed once I ran this project using one of our project URLs at @coverdash. However, we are really looking for something that will help us write actual testing code that we can maintain and modify as we need. To me, that seems a lot better than an agent simply looking at a UI and guessing what the steps are that should be taken.

But perhaps I'm misunderstanding how helpful this can be as an AI Agent, and I'm happy to learn more!

You're not misunderstanding! In fact, I think you are perhaps the first person to run it on a production URL :) -- I hadn't bothered yet because I was very focused on getting it to just running in a laboratory setting on very simple examples like TodoMVC. If you would be down to share the logging output of the agent here (if you're concerned about accidently sharing anything internal perhaps review it thoroughly before posting), that'd be really useful.

Restating how I see this convo:

@perryraskin would be great to see any of those logs, and very happy to have you take a stab at a PR, or do some prep work to make that more of an accessible task to take on (those tickets I made were an attempt at scoping things down into more bite-sized chunks for others to contribute to)

perryraskin commented 3 months ago

Sure thing, here's the log! combined.log

zachblume commented 3 months ago

Thanks for posting! I just closed #20 with -> #22

That should make progress on the other issues easier

zachblume commented 3 months ago

@perryraskin The run now ends by outputting a playwright file (#30 closing #18).

zachblume commented 3 months ago

@perryraskin Okay! #32 closes #19. You should now be able to do a full loop of reading a JSON array of specs, executing them, and saving them as playwright code using npx autospecai. To see the flags and config, run npx autospecai --help. Let me know if you can get things working and I'll wait for your feedback before closing this issue

perryraskin commented 3 months ago

Damn @zachblume you work fast! I will try this week and report back 🙌

perryraskin commented 3 months ago

Oh this is great, it looks like it worked! I see the test file and it makes sense.

Again this might be me not knowing much about testing or the amount that you accomplished in this repo, but:

How might I go about generating useful tests using autospec for a super complex dynamic form? For instance, the test I did was for the first page of the form, where it just has a title and a "Get started" button. So, assuming this form as a ton of logic and tons of different paths, how might I use this to make a test for a specific path? (And then for all the paths, too.)

For some background on this form: it's a single page React app, so I can't run autospec a bunch of times providing it different URLs. I would need to somehow tell it to click/type certain things in order so that the UI shows the next screen, and then run autospec again. Or something? 😄

zachblume commented 3 months ago

As described above, the agent does not go Specification -> Test code -> Run but actually works through a decision-making loop with the real application feedback. So I imagine you might provide it with a spec like:

Users should be able to input valid emails, but invalid emails should show an error to the user in the multi-step form. The email input is on the third page of the form, you'll need to fill out the previous steps with sensible inputs to proceed to the point where you find an email field. This form is located on the page titled "form", you can find it in the navbar on the left.

You could also format it more formally, if you like, and I bet there will be similar performance

Go to the form page Type out and select sensible inputs to proceed to next page Now you're on step 2, again type out and select sensible inputs to proceed to next page Now you're on step 3, an invalid email input into the email input should raise a visible error to the user

Things are fairly agnostic right now, the agent is following the command:

...

  1. After the mapping and description phase, you'll be provided a spec that you wrote to focus on specifically, one at a time. You'll begin a loop executing actions in order to fulfill the spec. On each turn, you'll be provided a screenshot, a HTML dump, and the current mouse cursor position and other metadata. ...
  2. You have an API of actions you can take: type Action = { ... }

So there's not rigid input syntax required for instructing the agent, or rule-based-system that stands between the spec you write and the agent's execution.

If you provide a bunch of specs through that input file, they will be executed in parallel. Every spec begins at the root URL and proceeds from there, we may need to provide additional hints to the agent to understand it may need to navigate around the site to get to the thing that is being discussed. Another deficiency of this file-input mode is that it skips the discovery phase, which maps out the site by clicking all the links and provides that to the agent as a sitemap to help this kind of decision-making.

Right now I'm focused on building out a web UI ontop of the agent, and refactoring as I go to support that (e.g. accepts more human-in-the-loop intervention, runs in Lambda, etc etc), which I think will be aligned with the kinds of questions you're asking, but would love for you to keep trying this out and potentially use the web UI once it's looking a bit better. The other priority is building a good benchmark, but more feedback like the kind you're giving will help build an accurate benchmark.

zachblume commented 3 months ago

@perryraskin Additionally I'm going to mark this issue as closed since we now do writeout playwrite tests. Please open more issues or contribute how you see fit!