Running 'trepctl status' while replicator is going online or offline can generate a NullPointerException

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. Set up master/slave replication with any MySQL version. 
2. Execute a script that repeatedly turns the replicator offline/online in the 
background. 
3. Simultaneously run 'trepctl status' repeatedly. 

What is the expected output?

Trepctl status return output that correctly shows the current replicator state, 
e.g., ONLINE, OFFLINE:NORMAL, ONLINE:SYCHRONIZING, etc.

What do you see instead?

From time to time the trepctl status command fails and the following error 
appears in the log: 

INFO   | jvm 1    | 2014/10/07 10:34:12 | 2014-10-07 10:34:12,133 [ - RMI TCP 
Connection(5)-10.90.81.210] ERROR management.OpenReplicatorManager Status 
operation failed
INFO   | jvm 1    | 2014/10/07 10:34:12 | java.lang.NullPointerException
INFO   | jvm 1    | 2014/10/07 10:34:12 |   at 
com.continuent.tungsten.replicator.datasource.SqlDataSource.releaseConnection(Sq
lDataSource.java:429)
INFO   | jvm 1    | 2014/10/07 10:34:12 |   at 
com.continuent.tungsten.replicator.extractor.mysql.MySQLExtractor.cleanUpDatabas
eResources(MySQLExtractor.java:1891)
INFO   | jvm 1    | 2014/10/07 10:34:12 |   at 
com.continuent.tungsten.replicator.extractor.mysql.MySQLExtractor.getCurrentReso
urceEventId(MySQLExtractor.java:1859)
INFO   | jvm 1    | 2014/10/07 10:34:12 |   at 
com.continuent.tungsten.replicator.extractor.ExtractorWrapper.getCurrentResource
EventId(ExtractorWrapper.java:149)
INFO   | jvm 1    | 2014/10/07 10:34:12 |   at 
com.continuent.tungsten.replicator.management.tungsten.TungstenPlugin.status(Tun
gstenPlugin.java:956)
INFO   | jvm 1    | 2014/10/07 10:34:12 |   at 
com.continuent.tungsten.replicator.management.OpenReplicatorManager.status(OpenR
eplicatorManager.java:2169)

What is the possible cause?

The MySQLExtractor.getCurrentResourceEventId() call is now using the extractor 
data source to create a connection.  During online/offline transitions the data 
source is in a partially initialized state, which leads to the 
NullPointerException. 

What is the proposed solution?

1.) As an immediate fix we should try trap the NullPointerException within the 
extractor call and return a default value, while printing a warning in the log. 
 This will prevent monitoring system failures.  

2.) Longer term we need to address the question of how to handle these calls.  
Status calls execute outside the state model and must be able to do so, since 
one of the reasons you check status is to monitor transitions.  One option is 
to add a latch mechanism so that data sources only answer these calls when the 
data source is fully online. However, this needs careful thought to ensure it 
is completely reliable. 

Additional information

...

Use labels and text to provide additional information.

Original issue reported on code.google.com by robert.h...@continuent.com on 13 Oct 2014 at 4:31

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2636.

Add trap for NPEs generated by out-of-band status calls that can allocated or 
release connections when data source is not fully configured.

Original comment by robert.h...@continuent.com on 13 Oct 2014 at 6:56

GoogleCodeExporter commented 9 years ago

I have checked in a temporary trap to catch out-of-band calls which will 
prevent unexpected status check failures on MySQL.  If you get such a failure, 
the following warning will appear in the log: 

INFO   | jvm 1    | 2014/10/13 11:53:47 | 2014-10-13 11:53:47,594 [ - RMI TCP 
Connection(4)-192.168.128.7] WARN  extractor.mysql.MySQLExtractor Unable to run 
SHOW MASTER STATUS to find log position; this can occur when service is going 
on/offline
INFO   | jvm 1    | 2014/10/13 11:53:47 | java.lang.NullPointerException
INFO   | jvm 1    | 2014/10/13 11:53:47 |   at 
com.continuent.tungsten.replicator.datasource.SqlDataSource.getConnection(SqlDat
aSource.java:427)
INFO   | jvm 1    | 2014/10/13 11:53:47 |   at 
com.continuent.tungsten.replicator.extractor.mysql.MySQLExtractor.getCurrentReso
urceEventId(MySQLExtractor.java:1832)
INFO   | jvm 1    | 2014/10/13 11:53:47 |   at 
com.continuent.tungsten.replicator.extractor.ExtractorWrapper.getCurrentResource
EventId(ExtractorWrapper.java:149)
INFO   | jvm 1    | 2014/10/13 11:53:47 |   at 
com.continuent.tungsten.replicator.management.tungsten.TungstenPlugin.status(Tun
gstenPlugin.java:956)
INFO   | jvm 1    | 2014/10/13 11:53:47 |   at 
com.continuent.tungsten.replicator.management.OpenReplicatorManager.status(OpenR
eplicatorManager.java:2169)

The OracleCDCExtractor does not have this problem as it does not implement the 
getCurrentResourceEventId() method.

Original comment by robert.h...@continuent.com on 13 Oct 2014 at 7:01

GoogleCodeExporter commented 9 years ago

Hi Robert,

If implementation is finished, please move this to QA state and assign a Tester 
label (Tester-Gilles?).

Original comment by linas.vi...@continuent.com on 14 Oct 2014 at 6:43

GoogleCodeExporter commented 9 years ago

This is fixed temporarily but will require a better fix when we implement 
external data source management.  I have also checked in a test in the internal 
Continuent MATs to demonstrate that this problem does not occur.

Original comment by robert.h...@continuent.com on 15 Oct 2014 at 4:09

Changed state: QA
Added labels: Tester-Giuseppe

GoogleCodeExporter commented 9 years ago

Tested using two parallel loops that restart the replicator and ask for status. 
No NPE detected. 
The errors that we get on screen are the following:

Operation failed: Unable to get proxy connection to bean

Fatal RMI communication error: Error unmarshaling return header; nested 
exception is:
    java.io.EOFException

Cannot get an RMI proxy for class 
com.continuent.tungsten.replicator.management.ReplicationServiceManager because 
of this exception: java.io.IOException: Connection closed

Robert, can you confirm that these three errors are expected?

Original comment by g.maxia on 22 Oct 2014 at 6:59

Added labels: Tester-Robert
Removed labels: Tester-Giuseppe

GoogleCodeExporter commented 9 years ago

Just to be clear this case should bring the replication services online and 
offline, not restart the process. Restarting the process will only rarely 
trigger the problem.  

When the service is restarting it is normal to get the errors you show.  The 
JMX call from trepctl to the replicator will fail if the replicator JVM either 
is not available or goes away while an operation is in progress, resulting in 
the highly civilized messages shown above.

Original comment by robert.h...@continuent.com on 22 Oct 2014 at 2:55

GoogleCodeExporter commented 9 years ago

Using a loop of online/offline, there are no errors on screen or in the error 
log.

Original comment by g.maxia on 22 Oct 2014 at 3:46

Changed state: Fixed
Added labels: Tester-Giuseppe
Removed labels: Tester-Robert

GoogleCodeExporter commented 9 years ago

By Jove I think we've got it.  Cheers, Robert

Original comment by robert.h...@continuent.com on 22 Oct 2014 at 4:21

zxs / tungsten-replicator

Running 'trepctl status' while replicator is going online or offline can generate a NullPointerException #1033