Closed GoogleCodeExporter closed 9 years ago
crawler4j only stores URLs for its internal processing. I assume any other data
is collected by your own program. Let me know if this is not the case. 30GB is
the size of which folder?
-Yasser
Original comment by ganjisaffar@gmail.com
on 17 Feb 2012 at 4:55
frontier folder which is Crawl Storage Folder.
After crawling each page I am extracting information from html data and storing
in database.
Original comment by b.like.no.other
on 18 Feb 2012 at 3:59
the following is visit function code and though the frontier folder size is
around 30 GB:
@Override
public void visit(Page page) {
if (page.getParseData() instanceof HtmlParseData) {
String url = page.getWebURL().getURL();
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
RawHtmlDataBean.add(url,html); // stores data in MySQL data base
}
}
Thank you.
Original comment by b.like.no.other
on 22 Feb 2012 at 4:05
i am having the same issue. What is the solution?
Original comment by jeger...@gmail.com
on 16 May 2014 at 4:20
The problem is not in crawler4j, you are inserting the WWW data into a DB and
it is large as the WWW is very large.
My advice is to insert only needed text into a DB and not all raw data, also
try compressing the text before inserting it into the DB.
Anyway, it is not a bug...
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 1:13
Not a bug or feature request
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 1:14
Original issue reported on code.google.com by
b.like.no.other
on 17 Feb 2012 at 10:20