jobs in cluster mode fail using with-job-conf & mapred.child.java.opts

wri / forma-clj

The Forest Monitoring for Action (FORMA) project provides forest clearing alerts derived from MODIS satellite imagery every 16 days beginning in December 2005. FORMA is a project of World Resources Institute and was developed by the Center for Global Development.

Eclipse Public License 1.0

1 stars 0 forks source link

jobs in cluster mode fail using with-job-conf & mapred.child.java.opts #70

Closed robinkraft closed 11 years ago

robinkraft commented 12 years ago

In cluster mode (using EMR at least), for any query that uses with-job-conf to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:

mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m

Sample query to reproduce error

This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api) then run this:

(with-job-conf {"mapred.child.java.opts" "-Xmx512"} 
                     (let [src [[1 2]]
                             out-loc (hfs-seqfile "s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
                          (?<- out-loc [?a]
                                (src ?a ?b))))

Things I've tried

The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.
The simple sample query above works without using with-job-conf
It works with (with-job-conf {"mapred.map.tasks" 10} ...
It fails using Cascalog 1.9 AND Cascalog 1.9-wip with (with-job-conf {"mapped.child.java.opts" "-Xmx512"} ... and for several other memory configurations
It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:

<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m</value></property>

As far as workarounds go it's not too bad, but it's definitely a pain.

Sample error messages from the logs

(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task

eightysteele commented 12 years ago

what AMI version are you seeing this on?

robinkraft commented 12 years ago

The one we've been using since April - 2.05. Hadoop version 0.20.205.

On Tue, Jul 10, 2012, at 04:01 PM, Aaron Steele wrote:

what AMI version are you seeing this on?

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6892789

sritchie commented 12 years ago

Can you file a bug on the cascalog mailing list? Really good intel here.

On Tuesday, July 10, 2012, Robin Kraft wrote:

In cluster mode (using EMR at least), for any query that uses with-job-conf to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:
mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m
Sample query to reproduce error

This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api) then run this:
(with-job-conf {"mapred.child.java.opts" "-Xmx512"}
                     (let [src [[1 2]]
                             out-loc (hfs-seqfile
"s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
                          (?<- out-loc [?a]
                                (src ?a ?b))))
Things I've tried

The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.

The simple sample query above works without using with-job-conf

It works with (with-job-conf {"mapred.map.tasks" 10} ...

It fails using Cascalog 1.9 AND Cascalog 1.9-wip with (with-job-conf {"mapped.child.java.opts" "-Xmx512"} ... and for several other memory configurations

It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:
<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native
-Xms1024m -Xmx1025m</value></property>
As far as workarounds go it's not too bad, but it's definitely a pain.

Sample error messages from the logs

(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70

Sam Ritchie, Twitter Inc 703.662.1337 @sritchie09

(Too brief? Here's why! http://emailcharter.org)

robinkraft commented 12 years ago

Sure, I'll throw it up now.

On Wed, Jul 11, 2012, at 08:58 AM, Sam Ritchie wrote:

Can you file a bug on the cascalog mailing list? Really good intel here.

On Tuesday, July 10, 2012, Robin Kraft wrote:
In cluster mode (using EMR at least), for any query that uses with-job-conf to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:
mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m
Sample query to reproduce error

This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api) then run this:
(with-job-conf {"mapred.child.java.opts" "-Xmx512"}
                     (let [src [[1 2]]
                             out-loc (hfs-seqfile
"s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
                          (?<- out-loc [?a]
                                (src ?a ?b))))
Things I've tried

The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.

The simple sample query above works without using with-job-conf

It works with (with-job-conf {"mapred.map.tasks" 10} ...

It fails using Cascalog 1.9 AND Cascalog 1.9-wip with (with-job-conf {"mapped.child.java.opts" "-Xmx512"} ... and for several other memory configurations

It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:
<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native
-Xms1024m -Xmx1025m</value></property>
As far as workarounds go it's not too bad, but it's definitely a pain.

Sample error messages from the logs

(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70
Sam Ritchie, Twitter Inc 703.662.1337 @sritchie09

(Too brief? Here's why! http://emailcharter.org)

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6910868

eightysteele commented 12 years ago

+1 On Jul 11, 2012 9:07 AM, "Robin Kraft" < reply@reply.github.com> wrote:

Sure, I'll throw it up now.

On Wed, Jul 11, 2012, at 08:58 AM, Sam Ritchie wrote:
Can you file a bug on the cascalog mailing list? Really good intel here.

On Tuesday, July 10, 2012, Robin Kraft wrote:
In cluster mode (using EMR at least), for any query that uses with-job-conf to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:
mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m
Sample query to reproduce error

This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api) then run this:
(with-job-conf {"mapred.child.java.opts" "-Xmx512"}
                     (let [src [[1 2]]
                             out-loc (hfs-seqfile
"s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
                          (?<- out-loc [?a]
                                (src ?a ?b))))
Things I've tried

The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.

The simple sample query above works without using with-job-conf

It works with (with-job-conf {"mapred.map.tasks" 10} ...

It fails using Cascalog 1.9 AND Cascalog 1.9-wip with (with-job-conf {"mapped.child.java.opts" "-Xmx512"} ... and for several other memory configurations

It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:
<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native
-Xms1024m -Xmx1025m</value></property>
As far as workarounds go it's not too bad, but it's definitely a pain.

Sample error messages from the logs

(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70
Sam Ritchie, Twitter Inc 703.662.1337 @sritchie09

(Too brief? Here's why! http://emailcharter.org)

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6910868
Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6911137

robinkraft commented 12 years ago

https://groups.google.com/forum/?fromgroups#!topic/cascalog-user/INNv167aLns

On Wed, Jul 11, 2012, at 09:08 AM, Aaron Steele wrote:

+1 On Jul 11, 2012 9:07 AM, "Robin Kraft" < reply@reply.github.com> wrote:
Sure, I'll throw it up now.

On Wed, Jul 11, 2012, at 08:58 AM, Sam Ritchie wrote:
Can you file a bug on the cascalog mailing list? Really good intel here.

On Tuesday, July 10, 2012, Robin Kraft wrote:
In cluster mode (using EMR at least), for any query that uses with-job-conf to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:
mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m
Sample query to reproduce error

This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api) then run this:
(with-job-conf {"mapred.child.java.opts" "-Xmx512"}
                     (let [src [[1 2]]
                             out-loc (hfs-seqfile
"s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
                          (?<- out-loc [?a]
                                (src ?a ?b))))
Things I've tried

The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.

The simple sample query above works without using with-job-conf

It works with (with-job-conf {"mapred.map.tasks" 10} ...

It fails using Cascalog 1.9 AND Cascalog 1.9-wip with (with-job-conf {"mapped.child.java.opts" "-Xmx512"} ... and for several other memory configurations

It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:
<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native
-Xms1024m -Xmx1025m</value></property>
As far as workarounds go it's not too bad, but it's definitely a pain.

Sample error messages from the logs

(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70
Sam Ritchie, Twitter Inc 703.662.1337 @sritchie09

(Too brief? Here's why! http://emailcharter.org)

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6910868
Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6911137
Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6911184

robinkraft commented 12 years ago

Update: today, for a job using with-job-conf that was burned into the uberjar, the memory config got picked up without any problem. In fact, it overrode the memory config in the hadoop-site.xml file. But I remember being able to use with-job-conf at the cluster repl in the past.

More investigations to come.

sritchie commented 12 years ago

Not if you wrapped a workflow

On Thu, Jul 12, 2012 at 2:08 PM, Robin Kraft < reply@reply.github.com

wrote:

Update: today, for a job using with-job-conf that was burned into the uberjar, the memory config got picked up without any problem. In fact, it overrode the memory config in the hadoop-site.xml file. But I remember being able to use with-job-conf at the cluster repl in the past.

More investigations to come.

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6947787

Sam Ritchie, Twitter Inc 703.662.1337 @sritchie09

(Too brief? Here's why! http://emailcharter.org)

robinkraft commented 12 years ago

I was using with-job-conf within the workflow:

https://github.com/danhammer/forma-clj/blob/feature/cmr/src/clj/forma/hadoop/jobs/scatter.clj#L320

That code worked today. In the past, I'd modify the code locally and paste it into the cluster repl, then launch the workflow using (formarunner ,,,,,)

On Thu, Jul 12, 2012, at 02:25 PM, Sam Ritchie wrote:

Not if you wrapped a workflow

On Thu, Jul 12, 2012 at 2:08 PM, Robin Kraft < reply@reply.github.com

wrote:

Update: today, for a job using with-job-conf that was burned into the uberjar, the memory config got picked up without any problem. In fact, it overrode the memory config in the hadoop-site.xml file. But I remember being able to use with-job-conf at the cluster repl in the past.

More investigations to come.

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6947787

Sam Ritchie, Twitter Inc 703.662.1337 @sritchie09

(Too brief? Here's why! http://emailcharter.org)

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-6948237

robinkraft commented 12 years ago

I've created a sample project to illustrate the issue, and have confirmed that this problem occurs independent of the FORMA project.

https://github.com/robinkraft/job-conf-error

I have no idea why the job mentioned in my previous comment worked as expected.

robinkraft commented 12 years ago

Dev environment:

AMI 2.0.5: Amazon Elastic MapReduce 2012-04-17-20-44-39 pvm/s3 (ami-3a914a53) Hadoop 0.20.205 Cascalog 1.9 Clojure 1.4 Leiningen 1.7.1

eightysteele commented 12 years ago

@robinkraft nice leg work on this dude. Where we at now?

robinkraft commented 12 years ago

@eightysteele no word from the Cascalog mailing list, and the issue seems to be sporadic. But the sample project still has this issue for an incredibly simple query.

eightysteele commented 12 years ago

Lesse. Ping cascalog Irc? Also maybe search hadoop user group or irc? I'll look at your project that reproduces. On Jul 19, 2012 8:25 AM, "Robin Kraft" < reply@reply.github.com> wrote:

@eightysteele no word from the Cascalog mailing list, and the issue seems to be sporadic. But the sample project still has this issue for an incredibly simple query.

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-7101383

robinkraft commented 12 years ago

It may be a cascading thing. @danhammer didn't you say that Oscar had seen some weird behavior with Cascading 2.0?

On Jul 19, 2012, at 8:32 AM, Aaron Steele reply@reply.github.com wrote:

Lesse. Ping cascalog Irc? Also maybe search hadoop user group or irc? I'll look at your project that reproduces. On Jul 19, 2012 8:25 AM, "Robin Kraft" < reply@reply.github.com> wrote:

@eightysteele no word from the Cascalog mailing list, and the issue seems to be sporadic. But the sample project still has this issue for an incredibly simple query.

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-7101383

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-7101609

eightysteele commented 12 years ago

Yeah ping the Cascading list too.

On Thu, Jul 19, 2012 at 8:42 AM, Robin Kraft reply@reply.github.com wrote:

It may be a cascading thing. @danhammer didn't you say that Oscar had seen some weird behavior with Cascading 2.0?

On Jul 19, 2012, at 8:32 AM, Aaron Steele reply@reply.github.com wrote:

Lesse. Ping cascalog Irc? Also maybe search hadoop user group or irc? I'll look at your project that reproduces. On Jul 19, 2012 8:25 AM, "Robin Kraft" < reply@reply.github.com> wrote:

@eightysteele no word from the Cascalog mailing list, and the issue seems to be sporadic. But the sample project still has this issue for an incredibly simple query.

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-7101383

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-7101609

Reply to this email directly or view it on GitHub: https://github.com/reddmetrics/forma-clj/issues/70#issuecomment-7101914

ndimiduk commented 11 years ago

Hi Robin,

I forked your example project, added printing of the jvm arguments list. Running this in local mode shows "mapred.child.java.opts" is not respected -- this makes sense given that everything runs in a single process. I don't have time just now, but I'll try this code on a (pseudo-)distributed cluster and see what happens.

-n

ndimiduk commented 11 years ago

I don't believe hadoop-1.x supports reconfiguration of child processes at all. Have you asked on the user@ list? You can easily cheat using multiple EMR clusters, but this strikes me as heavy-handed violent solution. It's that or configuring your cluster to handle the bottle-neck at the expense of overall concurrency.

robinkraft commented 11 years ago

Closing. We've got stable cluster configs now, and it's not the end of the world just to modify hadoop-sites.xml for the one quick step that requires tons of memory (i.e. estimating beta vectors).

wri / forma-clj

jobs in cluster mode fail using with-job-conf & mapred.child.java.opts #70

Sample query to reproduce error

﻿Things I've tried

Sample error messages from the logs

Sample query to reproduce error

﻿Things I've tried

Sample error messages from the logs

Sample query to reproduce error

﻿Things I've tried

Sample error messages from the logs

Sample query to reproduce error

﻿Things I've tried

Sample error messages from the logs

Sample query to reproduce error

﻿Things I've tried

Sample error messages from the logs

Things I've tried

Things I've tried

Things I've tried

Things I've tried

Things I've tried