yuch7 / cwlexec

A new open source tool to run CWL workflows on LSF
Other
36 stars 8 forks source link

CWLEXEC fails with Hibernate exception for workflows with more than 20 steps #53

Open mscheremetjew opened 4 years ago

mscheremetjew commented 4 years ago

We do bump into this issue whenever we try to execute workflows in CWLEXEC with more than 20 steps, so 21 steps for instance. Also sometimes CWLEXEC hangs after it has reported the error. Could it be that sessions are not closed properly after each database transaction? We created a simple test workflow so it becomes easy to reproduce for you guys.

Here is the workflow: test-workflow.zip

This is the command we are running:

cwlexec -debug -L -p -w <work-dir> -o <output-dir> test-workflow.cwl

CWLEXEC reports the following and exits or sometimes just hangs:

17:14:04.579 default [pool-3-thread-20] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_20) was submitted. Job <6076314> is submitted to default queue <research-rh74>.
17:14:04.579 default [pool-3-thread-16] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_16) was submitted. Job <6076305> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-12] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_12) was submitted. Job <6076312> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-14] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_14) was submitted. Job <6076313> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-21] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_21) was submitted. Job <6076319> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-18] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_18) was submitted. Job <6076322> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-17] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_17) was submitted. Job <6076317> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-13] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_13) was submitted. Job <6076321> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-2] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_2) was submitted. Job <6076316> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-5] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_5) was submitted. Job <6076310> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-1] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_1) was submitted. Job <6076324> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-15] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_15) was submitted. Job <6076318> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-4] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_4) was submitted. Job <6076320> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-19] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_19) was submitted. Job <6076307> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-3] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_3) was submitted. Job <6076325> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-7] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_7) was submitted. Job <6076306> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-6] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_6) was submitted. Job <6076315> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-11] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_11) was submitted. Job <6076309> is submitted to default queue <research-rh74>.
17:14:04.581 default [pool-3-thread-10] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_10) was submitted. Job <6076311> is submitted to default queue <research-rh74>.
17:14:04.581 default [pool-3-thread-9] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_9) was submitted. Job <6076323> is submitted to default queue <research-rh74>.
17:14:04.584 default [pool-3-thread-8] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_8) was submitted. Job <6076308> is submitted to default queue <research-rh74>.
17:14:04.737 default [pool-3-thread-6] ERROR c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Failed to submit the step touch_6, The internal connection pool has reached its maximum size and no connection is currently available!
17:14:04.743 default [pool-3-thread-6] ERROR c.i.s.c.e.e.lsf.LSFBsubExecutorTask - The exception stacks:
org.hibernate.HibernateException: The internal connection pool has reached its maximum size and no connection is currently available!
    at org.hibernate.engine.jdbc.connections.internal.PooledConnections.poll(PooledConnections.java:82)
    at org.hibernate.engine.jdbc.connections.internal.DriverManagerConnectionProviderImpl.getConnection(DriverManagerConnectionProviderImpl.java:186)
    at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:35)
    at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:106)
    at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:136)
    at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getConnectionForTransactionManagement(LogicalConnectionManagedImpl.java:254)
    at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.begin(LogicalConnectionManagedImpl.java:262)
    at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.begin(JdbcResourceLocalTransactionCoordinatorImpl.java:214)
    at org.hibernate.engine.transaction.internal.TransactionImpl.begin(TransactionImpl.java:56)
    at org.hibernate.internal.AbstractSharedSessionContract.beginTransaction(AbstractSharedSessionContract.java:409)
    at com.ibm.spectrumcomputing.cwl.exec.service.CWLInstanceService.updateCWLProcessInstance(CWLInstanceService.java:83)
    at com.ibm.spectrumcomputing.cwl.exec.executor.lsf.LSFBsubExecutorTask.runStep(LSFBsubExecutorTask.java:108)
    at com.ibm.spectrumcomputing.cwl.exec.executor.lsf.LSFBsubExecutorTask.run(LSFBsubExecutorTask.java:56)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17:14:04.744 default [pool-3-thread-6] DEBUG c.i.s.c.e.e.lsf.LSFWorkflowRunner - broadcast event EXIT, touch_6
17:14:04.746 default [pool-3-thread-6] ERROR c.i.s.c.e.e.lsf.LSFWorkflowRunner - The workflow (test-wf) exited with <255>.
skeeey commented 4 years ago

@mscheremetjew , I think this because the concurrent database connections exceed the Hibernate connection pool size (default 20), as a temporary way, you may increase the pool size at https://github.com/IBMSpectrumComputing/cwlexec/blob/master/src/main/java/com/ibm/spectrumcomputing/cwl/exec/util/DatabaseManager.java#L84, e.g.

properties.put("hibernate.connection.pool_size", "50");

We may need to limit (e.g. use Semaphore) the database concurrent request in future at https://github.com/IBMSpectrumComputing/cwlexec/blob/master/src/main/java/com/ibm/spectrumcomputing/cwl/exec/service/CWLInstanceService.java#L80

adamsla commented 4 years ago

Does it make sense to make this a setting in the users flow somewhere? Is that supported in the Specification?

mr-c commented 4 years ago

Does it make sense to make this a setting in the users flow somewhere? Is that supported in the Specification?

There is not a specific configuration option for this in the CWL standards. CWLEXEC could detect the number of steps in class: Workflow and configure itself accordingly. Ideally site specific configuration is not in a CWL description. However, if it is needed, then a cwlexec:hibernate.connection_pool_size CWL directive could be implemented by CWLEXEC