yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.52k stars 1.92k forks source link

Problem fetching password protected pages #39

Open Alive-and-Well opened 9 years ago

Alive-and-Well commented 9 years ago

Problem fetching password protected pages

I am trying to fetch password protected pages (i.e. twitter.com). In order to do that I use the “FormAuthInfo”. authInfo = new FormAuthInfo("username", "password", "https://twitter.com/sessions", "session[username_or_email]", "session[password]"); config.addAuthInfo(authInfo); When I start my crawler I get the following output: [main] INFO edu.uci.ics.crawler4j.fetcher.PageFetcher - FORM authentication for: /sessions123 [main] DEBUG edu.uci.ics.crawler4j.fetcher.PageFetcher - Successfully Logged in with user: username to: twitter.com

But the crawler doesn’t crawl the password protected site.

Is there a problem with the cookie that needs to be send to the server?

SaiTejaswini commented 9 years ago

Any updates on this? I see the same problem.

Alive-and-Well commented 9 years ago

Hey, I think there is no real solution to this. Since teh side is protected against "Cross-Site-Request-Forgery" which makes ist impossible for the crawler to read it.