xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Proxy information get lost when using basic authentication #330

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Set proxy settings in CrawlConfig
2. Add BasicAuthInfo to CrawlConfig
3. Try to crawl a site with basic authentication

What is the expected output? What do you see instead?
The crawler should crawl the URL and fetch the data.
But this is not possible, because the crawler can´t connect.

What version of the product are you using?
4.0

Please provide any additional information below.
The code in PageFetcher.java must be changed.

Currently proxy information (and maybe other informations) get lost when 
performing basic authentication.

In method PageFetcher.doBasicLogin(BasicAuthInfo authInfo) a new HttpClient is 
created.

/**
     * BASIC authentication<br/>
     * Official Example:
     * https://hc.apache.org/httpcomponents-client-ga/httpclient/examples/org/apache/http/examples/client/ClientAuthentication
     * .java
     * */
    protected void doBasicLogin(BasicAuthInfo authInfo) {
        HttpHost targetHost = new HttpHost(authInfo.getHost(), authInfo.getPort(), authInfo.getProtocol());
        CredentialsProvider credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(new AuthScope(targetHost.getHostName(), targetHost.getPort()),
                        new UsernamePasswordCredentials(authInfo.getUsername(), authInfo.getPassword()));
        httpClient = HttpClients.custom().setDefaultCredentialsProvider(credsProvider).build();
    }

Original issue reported on code.google.com by wefwefw...@gmail.com on 6 Jan 2015 at 11:19