yahoo / FEL

Fast Entity Linker Toolkit for training models to link entities to KnowledgeBase (Wikipedia) in documents and queries.
Apache License 2.0
336 stars 85 forks source link

GC overhead limit when mining wikipedia and extracting anchor text #18

Open shubhamagarwal92 opened 6 years ago

shubhamagarwal92 commented 6 years ago

Hi

I am following the steps provided here to train my model.

I have pre-processed the datapack. But when I am trying to "Build Data Structures and extract anchor text", I am having this GC overhead issue.

screen shot 2018-05-29 at 09 14 53

I have even increased the MAPRED and HADOOP memory to 15G and even provided opts for Dmapreduce.reduce.java.opts and Dmapreduce.reduce.memory.mb

My system has 8 cores 32 GB, using java 8. This is the snippet of command that I am following.

hadoop \
jar target/FEL-0.1.0-fat.jar \
com.yahoo.semsearch.fastlinking.io.ExtractWikipediaAnchorText \
-Dmapreduce.map.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapreduce.reduce.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dyarn.app.mapreduce.am.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapred.job.map.memory.mb=15144 \
-Dmapreduce.map.memory.mb=15144 \
-Dmapreduce.reduce.memory.mb=15144 \
-Dmapred.child.java.opts="-Xmx15g" \
-Dmapreduce.map.java.opts='-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC' \
-Dmapreduce.reduce.java.opts="-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC" \
-input wiki/${WIKI_MARKET}/${WIKI_DATE}/pages-articles.block \
-emap wiki/${WIKI_MARKET}/${WIKI_DATE}/entities.map \
-amap wiki/${WIKI_MARKET}/${WIKI_DATE}/anchors.map \
-cfmap wiki/${WIKI_MARKET}/${WIKI_DATE}/alias-entity-counts.map \
-redir wiki/${WIKI_MARKET}/${WIKI_DATE}/redirects

Could you please suggest why this might be happening?

Pardon me as I am novice to hadoop and java

shubhamagarwal92 commented 6 years ago

@aasish Could you please comment as to how should I resolve this?

shubhamagarwal92 commented 6 years ago

FYI, I solved the issue with this shell script. README needs to be updated.