yeguixin / captcha_solver

Source code for ACM CCS 2018
145 stars 57 forks source link

small set #2

Closed an1018 closed 5 years ago

an1018 commented 5 years ago

@yeguixin Why can you train the network well with only 500 real captchas, owing to the network is simple, or other skills.Could you give me some suggestion? Look forward to your replay!

yeguixin commented 5 years ago

We first using a generator to synthesize the captchas. Actually, the generator is a traditional captcha generator. After synthesizing the captchas, we use discrinimator the distinguish the synthetic captchas from the real ones. To make sure the style of synthetic captchas are similar to the real one, we use 500 real captchas to tune the generator parameters. For details, you can refer to my paper http://delivery.acm.org/10.1145/3250000/3243754/p332-ye.pdf?ip=148.88.244.92&id=3243754&acc=ACTIVE%20SERVICE&key=BF07A2EE685417C5%2EF52F20EBE5138950%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1544006378_04ab2419edba35c7875b03e4147dd03d

an1018 commented 5 years ago

Thanks for your reply.You mentioned using 500 real captchas to tune the generator parameters,so the training set is 500.It usually requires a lot of data to train a network well. I'm confused about how to train the network well just by using 500 pictures?

yeguixin commented 5 years ago

Actually,the training set is not only the 500 real captchas. It includes both real captchas and synthetic captchas generated by the captcha generator. Note that the captcha generator can produce captchas using initial parameters such as rotation, distortion, waving and so on. In fact, more real captchas used, the better the model will be. In our work, to quick lanch the attack, we used 500 real captchas which we found it performs well.

an1018 commented 5 years ago

Sorry to trouble you again.I wonder why does captcha generator use only 500 captchas .And with so samll dataset ,the captcha generator can generate captchas similar to real ones. image image

an1018 commented 5 years ago

@yeguixin Do you use transfer Learning?Looking forward your reply!

yeguixin commented 5 years ago

Hi, in our initial experiments, we respectively used 500, 1000, 2000 and 5000 real captchas. We found their results have little difference. But 500 real captchas is easier to overfit. To prevent overfit, we setup the drop out lower than 0.5 according to the complexity of captcha scheme. Also, we do multi-scale transformation for the captcha image.

I used tansfer learning to tune the based solver which is trained by synthetic captchas. Because the synthetic captchas are not absolute the same to the real ones.