wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Address/correct MTE 1.3.0 validation errors #41

Closed wkiri closed 2 years ago

wkiri commented 2 years ago

Scott VanBommel used version 2.1.0 of the validate tool and identified some issues that need investigation. The full output is included below.


PDS Validate Tool Report

Configuration: Version 2.1.0 Date 2022-01-27T13:32:31Z

Parameters: Targets [file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/] Rule Type pds4.bundle Severity Level WARNING Recurse Directories true File Filters Used [.xml, .XML] Data Content Validation on Product Level Validation on Allow Unlabeled Files false Max Errors 100000 Registered Contexts File C:\PDS\Tools\Validate\bin..\resources\registered_context_products.json

Product Level Validation Results

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml 1 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/aliases.xml 2 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/collection_mer2_inventory.xml 3 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/components.xml 4 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/contains.xml 5 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/documents.xml 6 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/has_property.xml 7 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/mentions.xml 8 product validation(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/properties.xml 9 product validation(s) completed

FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml Begin Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv

PDS4 Bundle Level Validation Results

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mpf/collection_mpf_inventory.xml 1 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_phx/collection_phx_inventory.xml 2 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/document/collection_document_inventory.xml 3 integrity check(s) completed

PASS: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/bundle_mars_target_encyclopedia.xml

Summary:

2 error(s) 12 warning(s)

Product Validation Summary: 31 product(s) passed 2 product(s) failed 0 product(s) skipped

Referential Integrity Check Summary: 33 check(s) passed 0 check(s) failed 0 check(s) skipped

Message Types: 1 error.validation.internal_error 1 error.validation.invalid_field_value 9 warning.integrity.unreferenced_member 3 warning.integrity.missing_context_reference

End of Report

wkiri commented 2 years ago

This error was rather interesting to track down:

FAIL: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.xml Begin Content Validation: file:/C:/Users/vanbommel/Desktop/mars_target_encyclopedia/data_mer/mer2/sentences.csv ERROR [error.validation.invalid_field_value] table 1, record 1884, field 3: The field value 'For example, Rayleigh fractional crystallization of Adirondack magma steadily increases incompatible element concentrations (K; ! D Kbulk " 0) and rapidly decreases compatible element concentrations (Ni; ! D Ni bulk >>1).' that starts with double quote should not contain double quote(s)

Here we have a double-quote that appears by itself, so it is not caught by this regular expression, which assumes balanced quoting: https://github.com/wkiri/MTE/blob/c59b66a3895dd96afd1218f4d79c61e345d59719/src/deliver_sqlite.py#L154

I think we should just replace all double quotes, not just balanced ones. In that case, we can use sentences_df.replace(regex='"', value="''", inplace=True) prior to writing out the CSV. This also avoids having to read it back in and fix in replace_internal_double_quote().

One outcome is that sentences (fields) which previously were enclosed in double quotes because they had a double-quote internally now are not quoted (because they only have single quotes). I think this is fine, but wanted to hear from @stevenlujpl .

Another outcome is that this function which relies on converting the CSV lines into a numpy array no longer behaves as expected, because numpy's array conversion isn't really smart enough to parse CSV content: https://github.com/wkiri/MTE/blob/c59b66a3895dd96afd1218f4d79c61e345d59719/src/generate_pds4_bundle.py#L182-L191

However, by using pandas, we can simplify this to:

# Compute maximum_field_length for all columns in the csv lines.                                            
def get_max_field_len(csv_lines, field_index):                                                              
    # content_2d_array = np.array([line.split(',') for line in csv_lines])                                  
    content_df = pd.DataFrame(csv_lines)                                                              
    if len(content_df) == 1:                                                                                
        max_len = 0                                                                                         
    else:                                                                                                   
        max_len = content_df[field_index][1:].astype(bytes).str.len().max() 

    return max_len

@stevenlujpl Please browse the relevant commits (in branch issue41-validate). If you agree with this change, I think we can also remove the function replace_internal_double_quote().

wkiri commented 2 years ago

I meant to add, with this change the bundle no longer generates the error with validate 2.1.4.

wkiri commented 2 years ago

Resolved by using validate 2.1.4:

wkiri commented 2 years ago

WARNING [warning.integrity.unreferenced_member] Identifier 'urn:nasa:pds:mars_target_encyclopedia:data_mer2:aliases::1.0' is not a member of any collection within the given target

This (and similar) warnings was caused because it turns out that the collection inventory and associated .xml need to appear at the collection level (inside the data_mer directory) rather than nested in the data_mer/mer2 subdirectory. This is a little confusing and I am not completely sure it will work to have two collection .xmls inside data_mer (for mer1 and for mer2). If not, we can rearrange our directory structure to have top-level data_mer1 and data_mer2.

For now, moving the mer2 collection inventory files up to data_mer resolves all remaining validate issues.

wkiri commented 2 years ago

Validate output (warnings about the manifest and md5 files are expected):

PDS Validate Tool Report                                                                                    

Configuration:                                                                                              
   Version                       2.1.4                                                                      
   Date                          2022-02-01T01:16:28Z                                                       

Parameters:                                                                                                 
   Targets                       [file:/home/wkiri/Research/MTE/git/pds4_bundle/bundle_v1.3.1/mars_target_e\
ncyclopedia/]                                                                                               
   Rule Type                     pds4.bundle                                                                
   Severity Level                WARNING                                                                    
   Recurse Directories           true                                                                       
   File Filters Used             [*.xml, *.XML]                                                             
   Data Content Validation       on                                                                         
   Product Level Validation      on                                                                         
   Allow Unlabeled Files         false                                                                      
   Max Errors                    100000                                                                     
   Registered Contexts File      /proj/mte/pds4_validation_tool/v2.1.4/resources/registered_context_product\
s.json  

[...]

  PASS: file:/home/wkiri/Research/MTE/git/pds4_bundle/bundle_v1.3.1/mars_target_encyclopedia/urn-nasa-pds-m\
ars_target_encyclopedia.manifest                                                                            
      WARNING  [warning.file.not_referenced_in_label]   File is not referenced by any label                 
        5 integrity check(s) completed                                                                      

  PASS: file:/home/wkiri/Research/MTE/git/pds4_bundle/bundle_v1.3.1/mars_target_encyclopedia/urn-nasa-pds-m\
ars_target_encyclopedia.md5                                                                                 
      WARNING  [warning.file.not_referenced_in_label]   File is not referenced by any label                 
        6 integrity check(s) completed 

[...]

Summary:                                                                                                    

  0 error(s)                                                                                                
  2 warning(s)                                                                                              

  Product Validation Summary:                                                                               
    33         product(s) passed                                                                            
    0          product(s) failed                                                                            
    0          product(s) skipped                                                                           

  Referential Integrity Check Summary:                                                                      
    35         check(s) passed                                                                              
    0          check(s) failed                                                                              
    0          check(s) skipped                                                                             

  Message Types:                                                                                            
    2            warning.file.not_referenced_in_label                                                       

End of Report                                                                                               
Completed execution in 7826 ms 
wkiri commented 2 years ago

Note: Remember to emphasize the benefits of PDS using standard CSV handling of quotes instead of custom format when we reply with the updated bundle.