wgzhao / Addax

Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
https://wgzhao.github.io/Addax/
Apache License 2.0
1.2k stars 304 forks source link

[Bug]: 通过hdfswriter写入orc格式数据异常 #1059

Closed chen-rj88 closed 3 months ago

chen-rj88 commented 4 months ago

What happened?

通过hdfswriter写入orc格式可以写入,但是通过hive查询时会提示ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException,此功能datax中是正常的

Version

4.1.3 (Default)

OS Type

Linux (Default)

Java JDK Version

Oracle JDK 1.8.0

Relevant log output

No response

wgzhao commented 4 months ago

感谢反馈,请给出完整的运行日志以便于复现并定位问题

chen-rj88 commented 4 months ago

您好,详细执行日志请见附件,另外截图是执行成功后,通过sql进行查询的结果。期待您的回复~ addax_oracle-hive-orc_20240711_150451_186301.log

err_msg
wgzhao commented 4 months ago

请给出addax 执行的完整日志,尽可能不要截图

chen-rj88 commented 4 months ago

以下为日志信息:

2024-07-11 15:04:51.673 [        main] INFO  VMInfo               - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2024-07-11 15:04:51.689 [        main] INFO  Engine               - 
{
    "setting":{
        "speed":{
            "channel":1
        },
        "errorLimit":{
            "record":0,
            "percentage":0.02
        }
    },
    "content":{
        "reader":{
            "name":"sqlserverreader",
            "parameter":{
                "username":"YjsySjdj",
                "password":"*****",
                "column":[
                    "[termcode]",
                    "[termname]",
                    "[year]",
                    "[term]",
                    "[weeks]",
                    "[begindate]",
                    "[enddate]",
                    "[currentxq]"
                ],
                "where":"termcode != '3'",
                "connection":[
                    {
                        "table":[
                            "dbo.py_tabterm"
                        ],
                        "jdbcUrl":[
                            "jdbc:sqlserver://10.100.98.16:1433;DatabaseName=gmis5;CurrentSchema=dbo;encrypt=false"
                        ]
                    }
                ]
            }
        },
        "writer":{
            "name":"hdfswriter",
            "parameter":{
                "defaultFS":"hdfs://10.100.98.94:8020",
                "fileType":"orc",
                "fieldDelimiter":"\u0001",
                "path":"/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709",
                "fileName":"ods_py_tabterm2",
                "writeMode":"truncate",
                "column":[
                    {
                        "name":"termcode",
                        "type":"string"
                    },
                    {
                        "name":"termname",
                        "type":"string"
                    },
                    {
                        "name":"year",
                        "type":"int"
                    },
                    {
                        "name":"term",
                        "type":"string"
                    },
                    {
                        "name":"weeks",
                        "type":"int"
                    },
                    {
                        "name":"begindate",
                        "type":"string"
                    },
                    {
                        "name":"enddate",
                        "type":"string"
                    },
                    {
                        "name":"currentxq",
                        "type":"string"
                    }
                ]
            }
        }
    }
}

2024-07-11 15:04:51.717 [        main] INFO  JobContainer         - The jobContainer begins to process the job.
2024-07-11 15:04:52.148 [       job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl [jdbc:sqlserver://10.100.98.16:1433;DatabaseName=gmis5;CurrentSchema=dbo;encrypt=false].
2024-07-11 15:04:52.188 [       job-0] INFO  OriginalConfPretreatmentUtil - The table [dbo.py_tabterm] has columns [termcode,termname,year,term,weeks,begindate,enddate,currentxq].
2024-07-11 15:04:52.427 [       job-0] WARN  NativeCodeLoader     - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-07-11 15:04:52.905 [       job-0] INFO  JobContainer         - The Reader.Job [sqlserverreader] perform prepare work .
2024-07-11 15:04:52.905 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] perform prepare work .
2024-07-11 15:04:53.053 [       job-0] INFO  JobContainer         - Job set Channel-Number to 1 channel(s).
2024-07-11 15:04:53.055 [       job-0] INFO  JobContainer         - The Reader.Job [sqlserverreader] is divided into [1] task(s).
2024-07-11 15:04:53.056 [       job-0] INFO  HdfsWriter$Job       - Begin splitting ...
2024-07-11 15:04:53.066 [       job-0] INFO  HdfsWriter$Job       - The split wrote files :[/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc]
2024-07-11 15:04:53.067 [       job-0] INFO  HdfsWriter$Job       - Finish splitting.
2024-07-11 15:04:53.067 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] is divided into [1] task(s).
2024-07-11 15:04:53.088 [       job-0] INFO  JobContainer         - The Scheduler launches [1] taskGroup(s).
2024-07-11 15:04:53.095 [ taskGroup-0] INFO  TaskGroupContainer   - The taskGroupId=[0] started [1] channels for [1] tasks.
2024-07-11 15:04:53.097 [ taskGroup-0] INFO  Channel              - The Channel set byte_speed_limit to -1, No bps activated.
2024-07-11 15:04:53.097 [ taskGroup-0] INFO  Channel              - The Channel set record_speed_limit to -1, No tps activated.
2024-07-11 15:04:53.106 [  reader-0-0] INFO  CommonRdbmsReader$Task - Begin reading records by executing SQL query: [select [termcode],[termname],[year],[term],[weeks],[begindate],[enddate],[currentxq] from dbo.py_tabterm where (termcode != '3')].
2024-07-11 15:04:53.112 [  writer-0-0] INFO  HdfsWriter$Task      - Begin to write file : [/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc]
2024-07-11 15:04:53.141 [  writer-0-0] INFO  HadoopShimsCurrent   - Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.
2024-07-11 15:04:53.151 [  reader-0-0] INFO  CommonRdbmsReader$Task - Finished reading records by executing SQL query: [select [termcode],[termname],[year],[term],[weeks],[begindate],[enddate],[currentxq] from dbo.py_tabterm where (termcode != '3')].
2024-07-11 15:04:53.189 [  writer-0-0] INFO  PhysicalFsWriter     - ORC writer created for path: /data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: none buffer: 262144
2024-07-11 15:04:53.218 [  writer-0-0] INFO  WriterImpl           - ORC writer created for path: /data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc with stripeSize: 67108864 options: Compress: none buffer: 262144
2024-07-11 15:04:53.405 [  writer-0-0] INFO  HdfsWriter$Task      - Finish write
2024-07-11 15:04:56.104 [       job-0] INFO  AbstractScheduler    - The scheduler has completed all tasks.
2024-07-11 15:04:56.104 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] perform post work.
2024-07-11 15:04:56.106 [       job-0] INFO  HdfsHelper           - Begin to move file from [hdfs://10.100.98.94:8020/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa/ods_py_tabterm2_20240711_150453_065_8tw5v1gu.orc] to [ds=20240709].
2024-07-11 15:04:56.114 [       job-0] INFO  HdfsHelper           - Finish move file(s).
2024-07-11 15:04:56.114 [       job-0] INFO  HdfsHelper           - Begin to delete temporary dir [/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa] .
2024-07-11 15:04:56.119 [       job-0] INFO  HdfsHelper           - Finish deleting temporary dir [/data/hive/warehouse/gmc_ods.db/ods_py_tabterm2/ds=20240709/.94a5ee51_7494_4de4_9093_5ceca6589faa] .
2024-07-11 15:04:56.119 [       job-0] INFO  JobContainer         - The Reader.Job [sqlserverreader] perform post work.
2024-07-11 15:04:56.121 [       job-0] INFO  StandAloneJobContainerCommunicator - Total 43 records, 1979 bytes | Speed 659B/s, 14 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2024-07-11 15:04:56.122 [       job-0] INFO  JobContainer         - 
Job start  at             : 2024-07-11 15:04:51
Job end    at             : 2024-07-11 15:04:56
Job took secs             :                  4s
Average   bps             :              659B/s
Average   rps             :             14rec/s
Number of rec             :                  43
Failed record             :                   0
wgzhao commented 4 months ago

从日志来看,似乎没有问题 麻烦再提供以下信息:

chen-rj88 commented 4 months ago

收到,具体信息如下:

wgzhao commented 4 months ago

使用了你提供的数据,在我的 Hive 环境下读取没有问题,过程如下:

创建外部表,增加分区

CREATE  external table ods_py_tabterm2 (
termcode STRING,
termname STRING,
year INT,
term STRING,
weeks INT,
begindate STRING,
enddate STRING,
currentxq STRING
) PARTITIONED BY (ds STRING) STORED AS ORC
location "/tmp/ods_py_tabterm2";

alter table ods_py_tabterm2 add if not exists partition (ds='20240711');

将orc文件上传到 对应分区目录

进行查询

select * from ods_py_tabterm2 where ds = '20240711' limit 10;
INFO  : Compiling command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52): select * from ods_py_tabterm2 where ds = '20240711' limit 10
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:ods_py_tabterm2.termcode, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.termname, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.year, type:int, comment:null), FieldSchema(name:ods_py_tabterm2.term, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.weeks, type:int, comment:null), FieldSchema(name:ods_py_tabterm2.begindate, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.enddate, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.currentxq, type:string, comment:null), FieldSchema(name:ods_py_tabterm2.ds, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52); Time taken: 0.285 seconds
INFO  : Executing command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52): select * from ods_py_tabterm2 where ds = '20240711' limit 10
INFO  : Completed executing command(queryId=hive_20240711203322_704e35c0-ab5c-440b-b18d-8946b167ba52); Time taken: 0.015 seconds
INFO  : OK
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+
| ods_py_tabterm2.termcode  | ods_py_tabterm2.termname  | ods_py_tabterm2.year  | ods_py_tabterm2.term  | ods_py_tabterm2.weeks  | ods_py_tabterm2.begindate  | ods_py_tabterm2.enddate  | ods_py_tabterm2.currentxq  | ods_py_tabterm2.ds  |
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+
| 10                        | 2006-2007春学期              | 2007                  | 2                     | 20                     | 2007-03-05 00:00:00.0      | 2007-07-14 00:00:00.0    | 0                          | 20240711            |
| 11                        | 2007-2008秋学期              | 2007                  | 1                     | 20                     | 2007-08-26 00:00:00.0      | 2008-01-20 00:00:00.0    | 0                          | 20240711            |
| 12                        | 2007-2008春学期              | 2008                  | 2                     | 20                     | 2008-02-25 00:00:00.0      | 2008-07-12 00:00:00.0    | 0                          | 20240711            |
| 13                        | 2008-2009秋学期              | 2008                  | 1                     | 20                     | 2008-08-25 00:00:00.0      | 2009-01-09 00:00:00.0    | 0                          | 20240711            |
| 14                        | 2008-2009春学期              | 2009                  | 2                     | 20                     | 2009-02-16 00:00:00.0      | 2009-07-03 00:00:00.0    | 0                          | 20240711            |
| 15                        | 2009-2010秋学期              | 2009                  | 1                     | 21                     | 2009-08-23 00:00:00.0      | 2010-01-16 00:00:00.0    | 0                          | 20240711            |
| 16                        | 2009-2010春学期              | 2010                  | 2                     | 20                     | 2010-02-27 00:00:00.0      | 2010-07-11 00:00:00.0    | 0                          | 20240711            |
| 17                        | 2010-2011秋学期              | 2010                  | 1                     | 20                     | 2010-08-28 00:00:00.0      | 2011-01-16 00:00:00.0    | 0                          | 20240711            |
| 18                        | 2010-2011春学期              | 2011                  | 2                     | 20                     | 2011-02-19 00:00:00.0      | 2011-07-08 00:00:00.0    | 0                          | 20240711            |
| 19                        | 2011-2012秋学期              | 2011                  | 1                     | 20                     | 2011-08-29 00:00:00.0      | 2012-01-15 00:00:00.0    | 0                          | 20240711            |
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+
10 rows selected (0.357 seconds)

我的 Hive 版本是 3.1.0 ,目前 hdfswriter 依赖的是 hadoop 3.0 版本,可能存在版本差异问题。

我看能不能找一个 2.1.1 环境下,然后进行测试

chen-rj88 commented 4 months ago

您好,我刚才把hdfswriter依赖的hadoop版本改为2.7.1,试了一下还是无法正常读取数据。 另外,我想问一下咱们这个平台有没有交流沟通群,有的话我希望能加入一下,谢谢~

wgzhao commented 3 months ago

提供的orc文件在 hive 2.3.2 版本下测试通过

wgzhao commented 3 months ago

提供的 orc 文件在 2.1.0 版本下也测试通过

0: jdbc:hive2://localhost:10000> select version();
+--------------------------------------------------+--+
|                       _c0                        |
+--------------------------------------------------+--+
| 2.1.0 r9265bc24d75ac945bde9ce1a0999fddd8f2aae29  |
+--------------------------------------------------+--+
1 row selected (0.099 seconds)
0: jdbc:hive2://localhost:10000> select * from ods_py_tabterm2 limit 3;
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+--+
| ods_py_tabterm2.termcode  | ods_py_tabterm2.termname  | ods_py_tabterm2.year  | ods_py_tabterm2.term  | ods_py_tabterm2.weeks  | ods_py_tabterm2.begindate  | ods_py_tabterm2.enddate  | ods_py_tabterm2.currentxq  | ods_py_tabterm2.ds  |
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+--+
| 10                        | 2006-2007???              | 2007                  | 2                     | 20                     | 2007-03-05 00:00:00.0      | 2007-07-14 00:00:00.0    | 0                          | 20240711            |
| 11                        | 2007-2008???              | 2007                  | 1                     | 20                     | 2007-08-26 00:00:00.0      | 2008-01-20 00:00:00.0    | 0                          | 20240711            |
| 12                        | 2007-2008???              | 2008                  | 2                     | 20                     | 2008-02-25 00:00:00.0      | 2008-07-12 00:00:00.0    | 0                          | 20240711            |
+---------------------------+---------------------------+-----------------------+-----------------------+------------------------+----------------------------+--------------------------+----------------------------+---------------------+--+
3 rows selected (0.098 seconds)

中文不显示的问题可以忽略,因为我没有安装中文字体。

我测试环境用 Docker 部署的,部署信息参见 https://github.com/big-data-europe/docker-hive/tree/2.1.0-postgresql-metastore