Timestamps are replicated incorrectly in hadoop when the master database timezone is higher than the slave

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. install replication to hadoop, using TR 3.0.1-30, with the same procedure 
defined in Issue-1068
with the difference that the master should have a  time zone higher than the 
slave. In our test, we set the master at +04:00, while the slave uses CET
2. insert several timestamps in the master
+----+---------------------------------------------------+------------+---------
-+---------------------+---------------------+
| id | message                                           | d          | t       
 | dt                  | ts                  |
+----+---------------------------------------------------+------------+---------
-+---------------------+---------------------+
|  1 | Near past date and times, autogenerated timestamp | 2014-06-01 | 
10:08:25 | 2014-06-01 10:08:25 | 2014-11-19 12:50:11 |
|  2 | Near past date and times, explicit timestamp      | 2014-05-31 | 
09:16:45 | 2014-05-31 09:16:45 | 2014-06-17 17:29:30 |
|  3 | Distant past date and times                       | 1982-10-24 | 
23:01:01 | 1982-10-24 23:01:01 | 1982-10-24 23:01:01 |
|  4 | Near future date and times                        | 2015-12-01 | 
11:11:11 | 2015-12-01 11:11:11 | 2015-10-24 11:11:11 |
|  5 | Distant future date and times                     | 2037-01-28 | 
10:10:10 | 2037-01-28 10:10:10 | 2037-01-28 10:10:10 |
|  6 | Leap year - correct date                          | 2000-02-29 | 
23:01:01 | 2000-02-29 23:01:01 | 2000-02-29 23:01:01 |
|  7 | DST - Spring - CET                                | 2014-03-30 | 
02:30:35 | 2014-03-30 02:30:35 | 2014-03-30 02:30:35 |
|  8 | DST - Spring - EDT                                | 2014-03-09 | 
02:30:35 | 2014-03-09 02:30:35 | 2014-03-09 02:30:35 |
|  9 | DST - Fall   - CET                                | 2014-10-26 | 
01:30:35 | 2014-10-26 01:30:35 | 2014-10-26 01:30:35 |
| 10 | DST - Fall   - EDT                                | 2014-11-02 | 
01:30:35 | 2014-11-02 01:30:35 | 2014-11-02 01:30:35 |
| 11 | Minimum values                                    | 1000-01-01 | 
00:00:00 | 1000-01-01 00:00:00 | 1970-01-01 12:00:01 |
| 12 | Maximum values                                    | 9999-12-31 | 
23:59:59 | 9999-12-31 23:59:59 | 2038-01-19 03:14:07 |
+----+---------------------------------------------------+------------+---------
-+---------------------+---------------------+

3. check the results in the slave

What is the expected output?

The timestamps in the CSV file should be the same as the master.

What do you see instead?

Most of the timestamps are 4 hours behind.
 hadoop fs -cat staging/batch/test/datetest/datetest*.csv |perl -pe 's/\x01/ | /g'
I | 3 | 1 | 2014-11-19 08:50:11.000 | 1 | Near past date and times, 
autogenerated timestamp | 2014-06-01 | 10:08:25 | 2014-06-01 10:08:25 | 
2014-11-19 08:50:11.000
I | 3 | 2 | 2014-11-19 08:50:11.000 | 2 | Near past date and times, explicit 
timestamp | 2014-05-31 | 09:16:45 | 2014-05-31 09:16:45 | 2014-06-17 
13:29:30.000
I | 3 | 3 | 2014-11-19 08:50:11.000 | 3 | Distant past date and times | 
1982-10-24 | 23:01:01 | 1982-10-24 23:01:01 | 1982-10-24 19:01:01.000
I | 3 | 4 | 2014-11-19 08:50:11.000 | 4 | Near future date and times | 
2015-12-01 | 11:11:11 | 2015-12-01 11:11:11 | 2015-10-24 07:11:11.000
I | 3 | 5 | 2014-11-19 08:50:11.000 | 5 | Distant future date and times | 
2037-01-28 | 10:10:10 | 2037-01-28 10:10:10 | 2037-01-28 06:10:10.000
I | 3 | 6 | 2014-11-19 08:50:11.000 | 6 | Leap year - correct date | 2000-02-29 
| 23:01:01 | 2000-02-29 23:01:01 | 2000-02-29 19:01:01.000
I | 3 | 7 | 2014-11-19 08:50:11.000 | 7 | DST - Spring - CET  | 2014-03-30 | 
02:30:35 | 2014-03-30 02:30:35 | 2014-03-29 22:30:35.000
I | 3 | 8 | 2014-11-19 08:50:11.000 | 8 | DST - Spring - EDT | 2014-03-09 | 
02:30:35 | 2014-03-09 02:30:35 | 2014-03-08 22:30:35.000
I | 3 | 9 | 2014-11-19 08:50:11.000 | 9 | DST - Fall   - CET  | 2014-10-26 | 
01:30:35 | 2014-10-26 01:30:35 | 2014-10-25 21:30:35.000
I | 3 | 10 | 2014-11-19 08:50:11.000 | 10 | DST - Fall   - EDT  | 2014-11-02 | 
01:30:35 | 2014-11-02 01:30:35 | 2014-11-01 21:30:35.000
I | 3 | 11 | 2014-11-19 08:50:11.000 | 11 | Minimum values | 1000-01-01 | 
00:00:00 | 1000-01-01 00:00:00 | 1970-01-01 08:00:01.000
I | 3 | 12 | 2014-11-19 08:50:11.000 | 12 | Maximum values | 9999-12-31 | 
23:59:59 | 9999-12-31 23:59:59 | 2038-01-18 23:14:07.000

What is the possible cause?

N/A

What is the proposed solution?

N/A

Additional information

Original issue reported on code.google.com by g.maxia on 6 Dec 2014 at 4:01

GoogleCodeExporter commented 9 years ago

Note: you can reproduce the same result using tungsten-sandbox with the 
fileapplier data source.

Original comment by g.maxia on 6 Dec 2014 at 4:02

GoogleCodeExporter commented 9 years ago

It looks to me as if timestamps are being emitted as GMT, which is correct 
behavior if there is no other indication provided.  Tungsten does not use the 
platform time zone any longer as that leads to inconsistent result when 
outputting data. 

To alter the CSV time zone try one of the following: 

1.) Set a different timezone in the services.properties.  This alters the 
replicator default time zone and changes all timestamps. 

2.) Set the replicator.applier.dbms.timezone to another time zone.  This will 
change the time zone used for timestamp formatting.  This is the recommended 
approach, since it does not alter the default time zone of the replicator. 

We will likely encounter anomalies even with #2 so the results will be very 
interesting. :)

Original comment by robert.h...@continuent.com on 6 Dec 2014 at 4:26

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I have tested more, and seen what the problem is.
First of all, there was a configuration issue. I was setting the time zone in 
the MySQL slaves, but that does not affect Hadoop.
Anyway, the discrepancy is that in the CSV file there is no notion of time 
zones, and thus we get the timezone from MySQL as GMT, without the conversion 
that MySQL protocol does automatically when reading values.

I tried modifying the replicator time_zone, but that affects all fields, not 
only timestamps.

Original comment by g.maxia on 7 Dec 2014 at 3:24

GoogleCodeExporter commented 9 years ago

The same behaviour is seen when replicating into Vertica and RedShift; what 
we're seeing here is the CSV is generated using GMT/UTC, even though we know it 
wasn't extracted or written into the master database in that timezone. 

We have two choices:

1) Document that CSV appliers all assume that we are writing UTC timestamps, 
even if they may be extracted in another timezone. 

2) Since this is common to all CSV appliers, I suggest we change the CSV 
applier so that it writes the CSV timestamp as the modified timezone value, not 
UTC. This way, the CSV will contain the correct date. 

(2) Would be my preferred solution, but I'm willing to be convinced otherwise.

Original comment by mc.br...@continuent.com on 9 Dec 2014 at 3:44

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 11 Dec 2014 at 2:43

Added labels: FixedIn-3.1.0
Removed labels: FixedIn-3.0.1

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18

Added labels: FixedIn-4.0.1
Removed labels: FixedIn-3.1.0

zxs / tungsten-replicator

Timestamps are replicated incorrectly in hadoop when the master database timezone is higher than the slave #1069