xTerradon / hcaptcha-solver

Automated hCaptcha solver using binary image classification networks
https://pypi.org/project/hcaptcha-solver/
21 stars 6 forks source link

V1 needs more training(data). #17

Open ludwig7685 opened 11 months ago

ludwig7685 commented 11 months ago

As i see it right there are 3 modes for a website using hcaptchas, • Auto • Easy • Moderate • Difficult chrome_screenshot_1692459771513

Also! not all hcaptchas gets solved, there are some like this that dont get solved or hcaptchas with meerkats! I think thats because on the demo.hcaptcha website only old hcaptchas gets shown! chrome_screenshot_1692459589486

i would set up a VM that runns 24/7 for about 1 month that would scrape more V1 hcaptchas for the projekt, would be completely free. ( microsoft azure ) ↳ and "scrape" up-to-date hcaptchas, from discord.com/registry ( on the registry page you will need to solve one to regestry, and they are up-to-dates hcaptchas! )

Also! is it not easier to collect hcaptchas with a own website???? with github pages it it easy to set up or not? ↳ with the 3 diffrent difficulties. aka. 3 websites. ↳ with a free hcaptcha account you are able to present 1 milion hcaptchas? im not really shure about the number, but its big, let me check!

EDIT: they say on their website:"Free up to one million requests per month". that means, with 1 account you are able to scrape 1 milion hcaptchas per month, and if they are reached, no more till month ends.... BUT is so easy to set up a new hcaptcha.com account, the 1 milion to reach is harder!

Shaeikh commented 11 months ago

There are many captchas for different region (i think) and for different site keys and stuff https://accounts.hcaptcha.com/demo?sitekey=60a46f6a-e214-4aa8-b4df-4386e68dfde4

check its challenges you might find new ones

ludwig7685 commented 11 months ago

There are many captchas for different region (i think) and for different site keys and stuff https://accounts.hcaptcha.com/demo?sitekey=60a46f6a-e214-4aa8-b4df-4386e68dfde4

check its challenges you might find new ones

sitekeys: the hcaptcha challenge for the link you provided is on the difficulty: "Difficult"

i acctually have the same hcaptchas on my own website -> newest "issue" from me -> last link ( difficulty: Difficult )

so i come to the conclusion: i dont think for every site keys are individually hcaptchas generated.

for the regions: that could be possible!

i saw at setting a new account at hcaptchas.com up, that you are able to set a region.... but for me not all regions worked, so i set region for all the websites i created for gettering new data to united states.

also: there are diffrent plans at hcaptcha.com, and i dont know if the biggest plan for companies have thier own separated hcaptchas from the free-accounts.

Shaeikh commented 11 months ago

For paid plan it's enterprise plan which is used by big companies, captcha solving companies charge more for the solving of enterprise plan captchas

xTerradon commented 11 months ago

Very good points you bring up, one big issue with this project is that it relies on data it has seen before - so if a new captcha is introduced we would always need the latest data in order to solve that. My idea was just checking in the project every couple days and let the collector run for a bit, but I think thats not really sustainable over the span of months or even years.
Doing an automated VM for scraping would be a wonderful idea! The code for scraping is available in the repo, so feel free to try out whatever you want with that. The labeling still would have to be done manually, but that does not really take long. Normally a model performs reasonably well with just a couple hundred labeled images.

Regarding the difficulties: I have not really found out what the difference between them is. On their official docs they simply state that they get harder to solve - that does not really mean anything. But I agree that by using captchas from different difficulties and regions and sitekeys we can get a reasonable coverage of all possible captcha types to train on. We should definitely keep that in mind when setting up an automated collection system.

Shaeikh commented 11 months ago

Very good points you bring up, one big issue with this project is that it relies on data it has seen before - so if a new captcha is introduced we would always need the latest data in order to solve that. My idea was just checking in the project every couple days and let the collector run for a bit, but I think thats not really sustainable over the span of months or even years.
Doing an automated VM for scraping would be a wonderful idea! The code for scraping is available in the repo, so feel free to try out whatever you want with that. The labeling still would have to be done manually, but that does not really take long. Normally a model performs reasonably well with just a couple hundred labeled images.

Regarding the difficulties: I have not really found out what the difference between them is. On their official docs they simply state that they get harder to solve - that does not really mean anything. But I agree that by using captchas from different difficulties and regions and sitekeys we can get a reasonable coverage of all possible captcha types to train on. We should definitely keep that in mind when setting up an automated collection system.

I will try to automate it on vm/vps, and tell you if there is a problem

And I for the labelling, I have labelled 1k models for different models and the avg time it took was 10 min (means few hundreds per 10 min as it includes not only target images)