zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
957 stars 120 forks source link

Password read from env. variable not obfuscated in the console log #705

Closed mehdi-infostrux closed 11 months ago

mehdi-infostrux commented 1 year ago

Describe the bug Username and Password not obfuscated when printed to the console

To Reproduce Steps to reproduce the behavior:

  1. In config file, when setting the snowflake db connection, use $var$ syntax (eg: $SF_USERNAME$, $SF_PASSWORD$) to define the credentials
  2. Save the config file with a .env suffix so environment variables can be read
  3. Run EMR create-cluster command passing in classifications --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"SF_USERNAME":"aws secretsmanager get-secret-value --secret-id zinggSnowflakeCreds --region us-east-2 --query SecretString --output text | jq -r \'.\"userName\"\'","SF_PASSWORD":"$SF_PASSWORD"}}]}]' \
  4. Go to the EMR management console, select the running cluster, what for it to complete, then go to the steps and check the stderr
  5. At some point of the log you can check a line like this:
    2023-10-25 03:14:31,292 WARN util.PipeUtil: Reading Pipe [name=customers, format=net.snowflake.spark.snowflake, preprocessors=null, props={sfUrl=IJA16463-CFLDEV.snowflakecomputing.com, sfUser=the_actual_username, sfPassword=the_actual_password, sfDatabase=my_db, sfSchema=my_schema, sfRole=my_role, sfWarehouse=my_wh, dbtable=denormalized_table}, schema=null]

Expected behavior

A clear and concise description of what you expected to happen.
I'd see that line not displayed at all or at least looking like this:
2023-10-25 03:14:31,292 WARN util.PipeUtil: Reading Pipe [name=customers, format=net.snowflake.spark.snowflake, preprocessors=null, props={sfUrl=IJA16463-CFLDEV.snowflakecomputing.com, sfUser=the_actual_username, sfPassword=*******, sfDatabase=my_db, sfSchema=my_schema, sfRole=my_role, sfWarehouse=my_wh, dbtable=denormalized_table}, schema=null]

Run from AWS cloudshell

sonalgoyal commented 1 year ago

@gnanaprakash-ravi can you please edit the toString in Pipes.java and see if any property contains password ignore case string, we obfuscate?

mehdi-infostrux commented 1 year ago

@gnanaprakash-ravi have you had a chance to look into this?

sonalgoyal commented 11 months ago

@mehdi-infostrux can you try using a log4j redactor like https://github.com/cloudera/logredactor ? Zingg gets different kind of sensitive data based on the data source so it is best to handle this at the cluster infrastructure level rather than code level.

sonalgoyal commented 11 months ago

fixed in dd40675eca74853ad33abf100bab3144c0707d7a