crawler storage data size is increasing

xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j

0 stars 0 forks source link

crawler storage data size is increasing #123

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

I am crawling website and extracting required information but the problem is 
the crawler storage data size is increasing. I want to do some kind of setting 
to control  crawler storage data. I only want to store URL history which 
crawler has crawled, other information I do not want to store.

If there is not any kind of setting for the above please can you let me know in 
which java class this storage is happening and what kind of information it 
stores.

I was surprise while crawling some website which has not more video and photos 
and I was extracting only text information but though the  crawler storage data 
size was around more then 30GB.

Please help me regarding this.

Thank you.

Original issue reported on code.google.com by b.like.no.other on 17 Feb 2012 at 10:20

GoogleCodeExporter commented 9 years ago

crawler4j only stores URLs for its internal processing. I assume any other data 
is collected by your own program. Let me know if this is not the case. 30GB is 
the size of which folder?

-Yasser

Original comment by ganjisaffar@gmail.com on 17 Feb 2012 at 4:55

Added labels: Type-Other
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

frontier folder which is Crawl Storage Folder.
After crawling each page I am extracting information from html data and storing 
in database.

Original comment by b.like.no.other on 18 Feb 2012 at 3:59

GoogleCodeExporter commented 9 years ago

the following is visit function code and though the frontier folder size is 
around 30 GB:
    @Override
    public void visit(Page page) {
        if (page.getParseData() instanceof HtmlParseData) {
            String url = page.getWebURL().getURL();
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
            String html = htmlParseData.getHtml();
            RawHtmlDataBean.add(url,html); // stores data in MySQL data base
        }
    }

Thank you.

Original comment by b.like.no.other on 22 Feb 2012 at 4:05

GoogleCodeExporter commented 9 years ago

i am having the same issue. What is the solution?

Original comment by jeger...@gmail.com on 16 May 2014 at 4:20

GoogleCodeExporter commented 9 years ago

The problem is not in crawler4j, you are inserting the WWW data into a DB and 
it is large as the WWW is very large.

My advice is to insert only needed text into a DB and not all raw data, also 
try compressing the text before inserting it into the DB.

Anyway, it is not a bug...

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:13

GoogleCodeExporter commented 9 years ago

Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:14

Changed state: Invalid