A friend asked me to help her manipulate the results of an online poll. Being a challenge-lover, and not having done anything like this before, I gladly accepted.
chkCapPass(id), which, along with other related code, are listed below: (Comments are removed since they are in Chinese anyway.)
Combined with the network tab and manually sending a vote, it’s trivial to see that the process is as follows:
/vote/chkcapnumto see if a CAPTCHA should be displayed
- If so (in my tests, it seems to always be the case), get the CAPTCHA page from
- Fetch the CAPTCHA image from
/captcha/chkcode(link from the previous step)
- Send a POST request to
/vote/to_votewith the id and CAPTCHA string
After understanding the requests, we need to programatically solve the CAPTCHAs.
I tried to use Tesseract for the job. However, a lot of the Python wrappers need some pip modules which isn’t that straightforward when installing on Windows, like PIL or OpenCV. (Asking the friend to install Linux in Virtualbox is, of course, difficult.)
Thus, I decided to directly call the command line program by
subprocess.Popen. Dirty, but works. However, another problem arose: it does not seem to recognize the characters at all.
Then, I found a snippet that uses OpenCV to strengthen the image before feeding into Tesseract. However, as mentioned above, it seems that you need to jump through a lot of hoops to install OpenCV on Windows.
Thus, I launched GIMP, and used its “threshold” function. Essentially, this forces all pixels below a certain threshold into black, and those above into white. Sure enough, it worked like a charm.
Of course, I didn’t want to call such a large program (in fact, I don’t even know whether GIMP has a CLI). So I turned to ImageMagick, one of the most famous CLI image processing software.
After some trial an error, the following parameters seem to work best:
convert captcha.png -black-threshold 60% -white-threshold 40% captcha-mod.png
Also, it is worth mentioning that calling Tesseract like the following:
tesseract captcha-mod.png captcha-txt digits
Limits the results to numbers, which can potentially increase the accuracy here.
After trying the code out, however, it seems like the server blocks your IP if you send the requests too frequently. Of course, I could add some code that fetches a list of proxies from the Internet and rotate through them, but frankly I’m a bit lazy. So in the end I just added a simple
time.sleep(30) in the loop.
The following is the resulting full code: (The website and vote id have been hidden/modified.)