wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Characterize and extract metadata step is failing #3394

Closed jtweed closed 5 years ago

jtweed commented 5 years ago

The characterize and extract metadata step in Archivematica is failing every time it is run.

I think it might be a service discovery issue, the logs say:

connect: Connection refused
Command FITS failed with exit status 230; stderr:
alexwlchan commented 5 years ago

This is appearing in the MCP client logs

alexwlchan commented 5 years ago

Found the issue.

This is from the docker-compose file I based our ECS task definition on:

NAILGUN_SERVER: "fits"
NAILGUN_PORT: "2113"

https://github.com/artefactual-labs/am/blob/9567e9578a85fd10657cb815fb2714dbb5caa333/compose/docker-compose.yml#L129-L130

And this is our ECS task definition:

MAILGUN_SERVER = "${local.fits_service_hostname}"
MAILGUN_PORT   = "2113"

https://github.com/wellcometrust/archivematica-infra/blob/d730ad7fb84569de8ee8ff25c980d01cd145998b/terraform/services.tf#L107-L108

🤦‍♀️

jtweed commented 5 years ago

This is still failing, but it is getting further than before:

recv: Connection reset by peer
Command FITS failed with exit status 227; stderr:
alexwlchan commented 5 years ago

Just before that is this line:

edu.harvard.hul.ois.fits.exceptions.FitsConfigurationException: /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/OfficeDocs-48d49662-7f44-4393-b271-24e7cf0f848e/objects/_datavibe-l__FW__job_vacancy.rtf does not exist or is not readable

so I wonder if I’m not mounting the shared volume in the fits container?

alexwlchan commented 5 years ago

Hmm, no, that is loaded in the ECS task definition. I’ll dig around some more.

helenst commented 5 years ago

It looks as if the environment variables are set up correctly in the MCP client container.

NAILGUN_SERVER=am-fits.archivematica NAILGUN_PORT=2113

Inside the fits docker container:

helenst commented 5 years ago

Running the ng-nailgun command manually on the mcp-client container and monitoring the connection attempt with netstat on the other end, it looks like it's able to locate and connect to fits container but the connection gets dropped almost straight away. If I simulate the connection with nc -v it keeps the connection open. So quite possibly something related to the higher level protocol...? I'm going to briefly step away from this one and see if any bright ideas come up :smile:

alexwlchan commented 5 years ago

@helenst Running your command without the output is possibly a little more informative:

$ ng-nailgun edu.harvard.hul.ois.fits.Fits -i /var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx

A fair chunk of XML pops out before that error occurs:

<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.8.4" timestamp="3/7/19 5:51 PM">
  <identification status="CONFLICT">
    <identity format="Microsoft Word 2007+" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" toolname="FITS" toolversion="0.8.4">
      <tool toolname="file utility" toolversion="5.14" />
    </identity>
    <identity format="OpenDocument Text" mimetype="application/vnd.oasis.opendocument.text" toolname="FITS" toolversion="0.8.4">
      <tool toolname="Exiftool" toolversion="9.13" />
      <tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />
    </identity>
    <identity format="Office Open XML Document" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" toolname="FITS" toolversion="0.8.4">
      <tool toolname="Tika" toolversion="1.3" />
    </identity>
  </identification>
  <fileinfo>
    <lastmodified toolname="Exiftool" toolversion="9.13" status="SINGLE_RESULT">2019:02:01 12:06:48+00:00</lastmodified>
    <filepath toolname="OIS File Information" toolversion="0.2" status="SINGLE_RESULT">/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</filepath>
    <filename toolname="OIS File Information" toolversion="0.2" status="SINGLE_RESULT">AM_transcript__13_March_2018.docx</filename>
    <size toolname="OIS File Information" toolversion="0.2" status="SINGLE_RESULT">166280</size>
    <fslastmodified toolname="OIS File Information" toolversion="0.2" status="SINGLE_RESULT">1549022808000</fslastmodified>
  </fileinfo>
  <filestatus />
  <metadata />
  <toolOutput>
    <tool name="file utility" version="5.14">
      <fileUtilityOutput xmlns="">
        <rawOutput>Microsoft Word 2007+
application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary</rawOutput>
        <mimetype>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mimetype>
        <format>Microsoft Word 2007+</format>
      </fileUtilityOutput>
    </tool>
    <tool name="Exiftool" version="9.13">
      <exiftool xmlns="">
        <rawOutput>ExifToolVersion  9.13
FileName    AM_transcript__13_March_2018.docx
Directory   /var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects
FileSize    162 kB
FileModifyDate  2019:02:01 12:06:48+00:00
FileAccessDate  2019:03:06 14:39:37+00:00
FileInodeChangeDate 2019:03:06 14:39:37+00:00
FilePermissions rw-r--r--
Warning Install Archive::Zip to decode compressed ZIP information
FileType    ZIP
MIMEType    application/zip
ZipRequiredVersion  20
ZipBitFlag  0x0006
ZipCompression  Deflated
ZipModifyDate   1980:01:01 00:00:00
ZipCRC  0x4b278a6d
ZipCompressedSize   358
ZipUncompressedSize 1364
ZipFileName [Content_Types].xml</rawOutput>
        <ExifToolVersion>9.13</ExifToolVersion>
        <FileName>AM_transcript__13_March_2018.docx</FileName>
        <Directory>/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects</Directory>
        <FileSize>162 kB</FileSize>
        <FileModifyDate>2019:02:01 12:06:48+00:00</FileModifyDate>
        <FileAccessDate>2019:03:06 14:39:37+00:00</FileAccessDate>
        <FileInodeChangeDate>2019:03:06 14:39:37+00:00</FileInodeChangeDate>
        <FilePermissions>rw-r--r--</FilePermissions>
        <Warning>Install Archive::Zip to decode compressed ZIP information</Warning>
        <FileType>ZIP</FileType>
        <MIMEType>application/zip</MIMEType>
        <ZipRequiredVersion>20</ZipRequiredVersion>
        <ZipBitFlag>0x0006</ZipBitFlag>
        <ZipCompression>Deflated</ZipCompression>
        <ZipModifyDate>1980:01:01 00:00:00</ZipModifyDate>
        <ZipCRC>0x4b278a6d</ZipCRC>
        <ZipCompressedSize>358</ZipCompressedSize>
        <ZipUncompressedSize>1364</ZipUncompressedSize>
        <ZipFileName>[Content_Types].xml</ZipFileName>
      </exiftool>
    </tool>
    <tool name="NLNZ Metadata Extractor" version="3.4GA">
      <OPENOFFICE xmlns="">
        <METADATA>
          <FILENAME>AM_transcript__13_March_2018.docx</FILENAME>
          <SEPARATOR>/</SEPARATOR>
          <PARENT>/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects</PARENT>
          <CANONICALPATH>/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</CANONICALPATH>
          <ABSOLUTEPATH>/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</ABSOLUTEPATH>
          <PATH>/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</PATH>
          <FILE>true</FILE>
          <DIRECTORY>false</DIRECTORY>
          <FILELENGTH>166280</FILELENGTH>
          <HIDDEN>false</HIDDEN>
          <ABSOLUTE>true</ABSOLUTE>
          <URL>file:/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</URL>
          <URI>file:/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</URI>
          <READ>true</READ>
          <WRITE>true</WRITE>
          <EXTENSION>docx</EXTENSION>
          <MODIFIED>2019-02-01 12:06:48</MODIFIED>
          <DATE>20190201</DATE>
          <DATEPATTERN>yyyyMMdd</DATEPATTERN>
          <TIME>120648000</TIME>
          <TIMEPATTERN>HHmmssSSS</TIMEPATTERN>
          <TYPE>application/open-office-1.x</TYPE>
          <PID>null</PID>
          <OID>null</OID>
          <FID>null</FID>
          <PROCESSOR>unknown</PROCESSOR>
        </METADATA>
      </OPENOFFICE>
    </tool>
    <tool name="OIS File Information" version="0.2">
      <fits xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd">
        <fileinfo>
          <filepath>/var/archivematica/sharedDirectory/failed/SAACT_2433-7db2fb67-793e-4bb1-9748-fd4458bbe57d/objects/AM_transcript__13_March_2018.docx</filepath>
          <filename>AM_transcript__13_March_2018.docx</filename>
          <size>166280</size>
          <fslastmodified>1549022808000</fslastmodified>
        </fileinfo>
      </fits>
    </tool>
    <tool name="ffident" version="0.2">
      <ffidentOutput xmlns="">
        <shortName>ZIP</shortName>
        <longName>ZIP Format</longName>
        <group>archive</group>
        <mimetypes>
          <mimetype>application/zip</mimetype>
        </mimetypes>
        <fileExtensions>
          <extension>zip</extension>
        </fileExtensions>
      </ffidentOutput>
    </tool>
    <tool name="Tika" version="1.3">
      <metadata xmlns="">
        <field name="Content-Type">
         recv: Connection reset by peer

Possibly working out what’s meant to go in that tag would tell us something?

alexwlchan commented 5 years ago

It seems to go through a bunch of tools – file utility, Exiftool, NLNZ Metadata Extractor, OIS File Information, ffident, Tika, and then it falls over. What’s Tika? 🤔

helenst commented 5 years ago

Thanks for continuing the investigation. That's a good question...

helenst commented 5 years ago

Trying with a different file, it fails at a different point (during OIS File Info) . So I don't think tika related. Running on the fits machine with fits.sh is successful. Running same file from mcp-client via nailgun produces the failure. Maybe nailgun related.

helenst commented 5 years ago

With a really small file (echo 'test' > test.txt), sometimes the command succeeds. In the fits container logs, fits seems to always exit with code zero. So possibly nothing fits-related actually failing. At the moment my suspicions lie with nailgun doing something weird with the connection.

alexwlchan commented 5 years ago

I did find this one a while back which looked related: https://github.com/facebook/nailgun/issues/23

alexwlchan commented 5 years ago

It sounds like fits is closing stdout somewhere, and that causes ng-nailgun to close the connection. Possibly upgrading to a new version of nailgun would help?

alexwlchan commented 5 years ago

I’m going to get this running in a couple of local containers, then play with different versions to see if that makes a difference.

alexwlchan commented 5 years ago

Unfortunately there are no instructions in the facebook/nailgun repo for updating your version of nailgun. I tried downloading the JAR from the releases page, but couldn't get it to start, and there aren't any Debian packages for it.

The problem does seem to happen more often in AWS than locally (I haven't been able to reproduce it locally yet, even with the same containers and file) – probably the increased latency between ECS containers?

alexwlchan commented 5 years ago

Looks like it was indeed an outdated version of nailgun.

There’s a PR here which creates a Docker image with the newest version of the nailgun server, automatically publishes it to ECR, and uses it in Terraform: https://github.com/wellcometrust/archivematica-infra/pull/29

In limited testing, this seems to work correctly, and no longer exhibits the issue!

alexwlchan commented 5 years ago

I think this is fixed, but it was intermittent when I was trying to test it. I’ll leave it open and close if we haven’t seen it after a few days.

helenst commented 5 years ago

I've not seen this happen in any of the newer transfers. OK to close I think.