如果验证码最后两位相同，似乎一定识别错误

ypwhs / captcha_break

验证码识别

MIT License

2.72k stars 686 forks source link

如果验证码最后两位相同，似乎一定识别错误 #60

Open devmonkeyx opened 4 years ago

devmonkeyx commented 4 years ago

你好，在decode函数里，判断的逻辑貌似如果最后一位和前一位相同，就不会加到结果中，这样貌似导致最后两位相同的验证码一定无法识别？例如：6666，识别过程像下面这样 a: -6-6-6--6 s: 666 最后输出的就是666

ypwhs commented 4 years ago

是的，你发现了一个bug。原来的写法是这样的：

def decode(sequence):
    a = ''.join([characters[x] for x in sequence])
    s = ''.join([x for j, x in enumerate(a[:-1]) if x != characters[0] and x != a[j+1]])
    if len(s) == 0:
        return ''
    if a[-1] != characters[0] and s[-1] != a[-1]:
        s += a[-1]
    return s

这个函数正确的写法是这样的：先去重，再去除空格：

def decode(sequence):
    a = ''.join([characters[x] for x in sequence])
    s = []
    last = None
    for x in a:
        if x != last:
            s.append(x)
            last = x
    s2 = ''.join([x for x in s if x != characters[0]])
    return s2

a = ['-', '6', '-', '6', '-', '6', '-', '-', '6']
s = ['-', '6', '-', '6', '-', '6', '-', '6']
s2 = '6666'

AzureSkyHuHu commented 4 years ago

tensor([9, 0, 9, 0, 9, 9], device='cuda:0') &0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ['8', '&', '8', '&', '8'] 感觉这个写法好像还是有问题我这里面识别 2222 最后还是222 好像不一定是 ['-', '6', '-', '6', '-', '6', '-', '-', '6']这种格式还是我自己本身有问题.

ypwhs commented 4 years ago

def decode(sequence):
    a = ''.join([characters[x] for x in sequence])
    s = []
    last = None
    for x in a:
        if x != last:
            s.append(x)
            last = x
    s2 = ''.join([x for x in s if x != characters[0]])
    return s2

characters = '&0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
sequence = [9, 0, 9, 0, 9]
decode(sequence)
# output: 888

Doghole commented 3 years ago

For each given label, you'd better allocate a separator between each character and at the begin and end of the label before you feed a label to the model. For example, if your label are 'ABCD', and assume your blank is '-' at index 0 of your characters, after inserting a separator '-', your label will be '-A-B-C-D-'. Notice that after inserting, your label_length will no longer be 4 but 9 (with 4 characters and 5 blanks). While decoding a sequence, change your code like below:

# Utilss.py
def decode_target(sequence, characters):
    s = [characters[x] for x in sequence if x != 0]
    return s

# main.py
def decode_target(sequence):
    return ''.join([characters[x] for x in sequence if x != 0]).replace(' ', '')

And keep decode_output as what it is.