wgzhao / Addax

Addax is a versatile open-source ETL tool that can seamlessly transfer data between various RDBMS and NoSQL databases, making it an ideal solution for data migration.
https://wgzhao.github.io/Addax/
Apache License 2.0
1.2k stars 305 forks source link

[Bug]: Excel读取失败 #1156

Closed svea-vip closed 1 month ago

svea-vip commented 1 month ago

What happened?

Excel to HDFS Parquet 出现了NoSuchMethodError的错误,是包缺失么

Version

4.1.7 (Default)

OS Type

Linux (Default)

Java JDK Version

Oracle JDK 1.8.0

Relevant log output

[INFO] 2024-10-08 15:35:42.236 +0800 -  -> 
      ___      _     _            
     / _ \    | |   | |           
    / /_\ \ __| | __| | __ ___  __
    |  _  |/ _` |/ _` |/ _` \ \/ /
    | | | | (_| | (_| | (_| |>  < 
    \_| |_/\__,_|\__,_|\__,_/_/\_\

    :: Addax version ::    (v4.1.7)

    2024-10-08 15:35:41.845 [        main] INFO  VMInfo               - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
    2024-10-08 15:35:41.881 [        main] INFO  Engine               - 
    {
        "setting":{
            "speed":{
                "channel":2
            }
        },
        "content":{
            "reader":{
                "name":"excelreader",
                "parameter":{
                    "column":[
                        {
                            "name":"UUID",
                            "type":"string"
                        },
                        {
                            "name":"time",
                            "type":"long"
                        },
                        {
                            "name":"id",
                            "type":"string"
                        }
                    ],
                    "encoding":"UTF-8",
                    "fieldDelimiter":",",
                    "header":true,
                    "path":[
                        "/tmp/in"
                    ],
                    "skipHeader":true
                }
            },
            "writer":{
                "name":"hdfswriter",
                "parameter":{
                    "column":[
                        {
                            "name":"UUID",
                            "type":"string"
                        },
                        {
                            "name":"time",
                            "type":"long"
                        },
                        {
                            "name":"id",
                            "type":"string"
                        }
                    ],
                    "compress":"SNAPPY",
                    "defaultFS":"hdfs://10.254.21.21:8020",
                    "fileName":"data",
                    "fileType":"parquet",
                    "hadoopConfig":{
                        "dfs.client.failover.proxy.provider.nameservice1":"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
                        "dfs.ha.namenodes.nameservice1":"namenode1,namenode2",
                        "dfs.namenode.rpc-address.nameservice1.namenode1":"namenode1:8020",
                        "dfs.namenode.rpc-address.nameservice1.namenode2":"namenode2:8020",
                        "dfs.nameservices":"nameservice1"
                    },
                    "path":"/user/hive/warehouse/external_path_ods.db/test-excelreader/dt=2024-10-07",
                    "writeMode":"overwrite"
                }
            }
        }
    }

    2024-10-08 15:35:41.925 [        main] INFO  JobContainer         - The jobContainer begins to process the job.
    2024-10-08 15:35:41.965 [       job-0] INFO  FileHelper           - Adding the file [/tmp/in/out.xlsx] as a candidate to be read.
    2024-10-08 15:35:41.966 [       job-0] INFO  ExcelReader$Job      - The number of files to read is: [1]
[INFO] 2024-10-08 15:35:43.237 +0800 -  -> 2024-10-08 15:35:42.594 [       job-0] WARN  NativeCodeLoader     - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[INFO] 2024-10-08 15:35:44.238 +0800 -  -> 2024-10-08 15:35:43.612 [       job-0] INFO  JobContainer         - The Reader.Job [excelreader] perform prepare work .
    2024-10-08 15:35:43.613 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] perform prepare work .
    2024-10-08 15:35:43.800 [       job-0] INFO  JobContainer         - Job set Channel-Number to 2 channel(s).
    2024-10-08 15:35:43.803 [       job-0] INFO  JobContainer         - The Reader.Job [excelreader] is divided into [1] task(s).
    2024-10-08 15:35:43.804 [       job-0] INFO  HdfsWriter$Job       - Begin splitting ...
    2024-10-08 15:35:43.819 [       job-0] INFO  HdfsWriter$Job       - The split wrote files :[/user/hive/warehouse/external_path_ods.db/test-excelreader/dt=2024-10-07/.81d7f9f2_de3d_4cb3_9ea1_c23494366ddf/data_20241008_153543_815_b0x9fr1q.parquet]
    2024-10-08 15:35:43.820 [       job-0] INFO  HdfsWriter$Job       - Finish splitting.
    2024-10-08 15:35:43.820 [       job-0] INFO  JobContainer         - The Writer.Job [hdfswriter] is divided into [1] task(s).
    2024-10-08 15:35:43.865 [       job-0] INFO  JobContainer         - The Scheduler launches [1] taskGroup(s).
    2024-10-08 15:35:43.879 [ taskGroup-0] INFO  TaskGroupContainer   - The taskGroupId=[0] started [1] channels for [1] tasks.
    2024-10-08 15:35:43.886 [ taskGroup-0] INFO  Channel              - The Channel set byte_speed_limit to -1, No bps activated.
    2024-10-08 15:35:43.886 [ taskGroup-0] INFO  Channel              - The Channel set record_speed_limit to -1, No tps activated.
    2024-10-08 15:35:43.903 [  reader-0-0] INFO  ExcelReader$Task     - The first row is skipped as a table header
    2024-10-08 15:35:43.904 [  reader-0-0] INFO  ExcelReader$Task     - begin read file /tmp/in/out.xlsx
    2024-10-08 15:35:43.940 [  writer-0-0] INFO  HdfsWriter$Task      - Begin to write file : [/user/hive/warehouse/external_path_ods.db/test-excelreader/dt=2024-10-07/.81d7f9f2_de3d_4cb3_9ea1_c23494366ddf/data_20241008_153543_815_b0x9fr1q.parquet]
    2024-10-08 15:35:43.983 [  writer-0-0] INFO  ParquetWriter        - Begin to write parquet file [/user/hive/warehouse/external_path_ods.db/test-excelreader/dt=2024-10-07/.81d7f9f2_de3d_4cb3_9ea1_c23494366ddf/data_20241008_153543_815_b0x9fr1q.parquet]
    2024-10-08T07:35:44.100Z reader-0-0 ERROR Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
    2024-10-08 15:35:44.177 [  reader-0-0] ERROR ReaderRunner         - Reader runner Received Exceptions:
    java.lang.NoSuchMethodError: org.apache.commons.io.input.BoundedInputStream.builder()Lorg/apache/commons/io/input/BoundedInputStream$Builder;
        at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:145)
        at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:145)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:189)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:156)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:351)
        at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:64)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:315)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:289)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelHelper.open(ExcelHelper.java:63)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelReader$Task.startRead(ExcelReader.java:144)
        at com.wgzhao.addax.core.taskgroup.runner.ReaderRunner.run(ReaderRunner.java:82)
        at java.lang.Thread.run(Thread.java:745)
    Exception in thread "taskGroup-0" com.wgzhao.addax.common.exception.AddaxException: java.lang.NoSuchMethodError: org.apache.commons.io.input.BoundedInputStream.builder()Lorg/apache/commons/io/input/BoundedInputStream$Builder;
        at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:145)
        at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:145)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:189)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:156)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:351)
        at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:64)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:315)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:289)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelHelper.open(ExcelHelper.java:63)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelReader$Task.startRead(ExcelReader.java:144)
        at com.wgzhao.addax.core.taskgroup.runner.ReaderRunner.run(ReaderRunner.java:82)
        at java.lang.Thread.run(Thread.java:745)

        at com.wgzhao.addax.common.exception.AddaxException.asAddaxException(AddaxException.java:66)
        at com.wgzhao.addax.core.taskgroup.TaskGroupContainer.start(TaskGroupContainer.java:188)
        at com.wgzhao.addax.core.taskgroup.runner.TaskGroupContainerRunner.run(TaskGroupContainerRunner.java:44)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.NoSuchMethodError: org.apache.commons.io.input.BoundedInputStream.builder()Lorg/apache/commons/io/input/BoundedInputStream$Builder;
        at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:145)
        at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:145)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:189)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:156)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:351)
        at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:64)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:315)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:289)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelHelper.open(ExcelHelper.java:63)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelReader$Task.startRead(ExcelReader.java:144)
        at com.wgzhao.addax.core.taskgroup.runner.ReaderRunner.run(ReaderRunner.java:82)
        ... 1 more
[INFO] 2024-10-08 15:35:45.239 +0800 -  -> 2024-10-08 15:35:44.743 [  writer-0-0] INFO  CodecPool            - Got brand-new compressor [.snappy]
[INFO] 2024-10-08 15:35:47.160 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/lbx_source/9362483767904/15136168146176_1/3511/10756, processId:3805096 ,exitStatusCode:2 ,processWaitForStatus:true ,processExitValue:2
[INFO] 2024-10-08 15:35:47.162 +0800 - Send task execute result to master, the current task status: TaskExecutionStatus{code=6, desc='failure'}
[INFO] 2024-10-08 15:35:47.162 +0800 - Remove the current task execute context from worker cache
[INFO] 2024-10-08 15:35:47.162 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /tmp/dolphinscheduler/exec/process/lbx_source/9362483767904/15136168146176_1/3511/10756
[INFO] 2024-10-08 15:35:47.163 +0800 - Success clear the task execute file: /tmp/dolphinscheduler/exec/process/lbx_source/9362483767904/15136168146176_1/3511/10756
[INFO] 2024-10-08 15:35:47.240 +0800 -  -> 2024-10-08 15:35:46.896 [       job-0] ERROR JobContainer         - The scheduler failed to run.
    2024-10-08 15:35:46.899 [       job-0] INFO  StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 0.00%
    2024-10-08 15:35:47.080 [       job-0] ERROR Engine               - com.wgzhao.addax.common.exception.AddaxException: java.lang.NoSuchMethodError: org.apache.commons.io.input.BoundedInputStream.builder()Lorg/apache/commons/io/input/BoundedInputStream$Builder;
        at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:145)
        at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:145)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:189)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:156)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:351)
        at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:64)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:315)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:289)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelHelper.open(ExcelHelper.java:63)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelReader$Task.startRead(ExcelReader.java:144)
        at com.wgzhao.addax.core.taskgroup.runner.ReaderRunner.run(ReaderRunner.java:82)
        at java.lang.Thread.run(Thread.java:745)

        at com.wgzhao.addax.common.exception.AddaxException.asAddaxException(AddaxException.java:66)
        at com.wgzhao.addax.core.job.scheduler.processinner.ProcessInnerScheduler.dealFailedStat(ProcessInnerScheduler.java:63)
        at com.wgzhao.addax.core.job.scheduler.AbstractScheduler.schedule(AbstractScheduler.java:107)
        at com.wgzhao.addax.core.job.JobContainer.schedule(JobContainer.java:440)
        at com.wgzhao.addax.core.job.JobContainer.start(JobContainer.java:128)
        at com.wgzhao.addax.core.Engine.start(Engine.java:62)
        at com.wgzhao.addax.core.Engine.entry(Engine.java:113)
        at com.wgzhao.addax.core.Engine.main(Engine.java:139)
    Caused by: java.lang.NoSuchMethodError: org.apache.commons.io.input.BoundedInputStream.builder()Lorg/apache/commons/io/input/BoundedInputStream$Builder;
        at org.apache.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:145)
        at org.apache.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.verifyZipHeader(ZipHelper.java:145)
        at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:189)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:156)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:351)
        at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:64)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:315)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:289)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelHelper.open(ExcelHelper.java:63)
        at com.wgzhao.addax.plugin.reader.excelreader.ExcelReader$Task.startRead(ExcelReader.java:144)
        at com.wgzhao.addax.core.taskgroup.runner.ReaderRunner.run(ReaderRunner.java:82)
        at java.lang.Thread.run(Thread.java:745)

    err:     exit status 2
    exit status 2
[INFO] 2024-10-08 15:35:47.241 +0800 - FINALIZE_SESSION
wgzhao commented 1 month ago

The issue is caused by a version conflict with commons-io. A temporary solution is:

  1. DELETE the common-io-<version>.jar file in the $ADDAX_HOME/lib directory,
  2. copy the common-io-<version>.jar file from the $ADDAX_HOME/plugin/reader/excelreader/libs/directory to the $ADDAX_HOME/lib directory.
svea-vip commented 1 month ago

我测试了$ADDAX_HOME/plugin/reader/excelreader/libs/下的也是2.15.1的,要换成2.16.1版本的

wgzhao commented 1 month ago

The latest release, version 4.2.0, has resolved the issue.