ypwhs / captcha_break

验证码识别
MIT License
2.73k stars 684 forks source link

Any way to make captcha_break more generic? #6

Closed eromoe closed 7 years ago

eromoe commented 7 years ago

I found it can only recognize captcha generate from the python captcha lib. It didn't work when I give it different style 4 size char+number . Since it is hard to mock a captcha with different style every time. I wonder is there any more generic way to do this?

What I can imagine is :

  1. use many different font (simple, but only the specical font would help)
  2. some preprocess, like convert RGB to greyscale first, to beat reversed out image(反白图,字白,底有颜色) .But this would lost some feature too, because some great captchas use one color one char, just have high(maybe not so high) contrast with the surrounding area,
  3. Use some deep learning method , learn char from unlabeled captcha dataset, then label the learned features/pattern to A~Z 0~9 , then use that to build model. I think this way is best, but I have no idea how to start. I have heard deconvolution or some clustering method can generate some pattern , but I am not very familiar with these technique.

Could you give me some tips?

ypwhs commented 7 years ago

你想识别什么验证码,你就准备大约十万张验证码,训练这个模型就好了。如果你想做一个通用验证码识别器,你需要准备很多种类的验证码,然后训练。我试过腾讯的验证码,十万样本准确率轻松上90%。

发自我的 iPhone

在 2017年6月20日,17:47,eromoe notifications@github.com 写道:

I found it can only recognize captcha generate from the python captcha lib. It didn't work when I give it different style 4 size char+number . Seems it is hard to mock a captcha with specific style. I wonder is there any more generic way to do this?

What I can imagine is :

use much more different font (simple, only the specical font would help) some preprocess, like convert RGB to greyscale first, to beat reversed out image(反白图,字白,底有颜色) .But this would lost some feature too, because some great captchas use one color one char, just have high contrast with the surrounding area, Use some deep learning method , learn char from unlabeled captcha dataset. I think this way is best, but I have no idea how to start. I have heard deconvolution or some clustering method can generate some pattern , but I am not very familiar with these technique. Could you give me some tips?

eromoe commented 7 years ago

但是没有10W已经标注的验证码啊(如果是未标注的到是很好弄到)。所以才想问是不是有什么通用的方法。。。 上面第3个 提到的就是我觉得可能行得通的做法,用某种模型归纳模式,然后人工把归纳出的模式打上标签。但我没啥基础不知道怎么做。。

ypwhs commented 7 years ago

打码平台很容易弄到大量的验证码,只要给钱就可以得到标注好的数据,自己生成验证码,不仅累,还没办法和需要识别的验证码完全一样,所以建议你想办法获取更多数据而不是纠结如何生成多种多样的验证码。

发自我的 iPhone

在 2017年6月20日,20:12,eromoe notifications@github.com 写道:

但是没有10W已经标注的验证码啊。所以才想问是不是有什么通用的方法。。。 上面第三部提到的就是我觉得可能行得通的做法,用某种模型归纳模式,然后人工把归纳出的模式打上标签,就是没啥基础不知道怎么做。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

eromoe commented 7 years ago

就是一定要标注数据是吗。我还以为有办法自动识别出字符。 谢谢!

ypwhs commented 7 years ago

识别一亿张图片的前提是标注一万张图片

发自我的 iPhone

在 2017年6月20日,21:55,eromoe notifications@github.com 写道:

就是一定要标注数据是吗。我还以为有办法自动识别出字符。 谢谢!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.