zinggAI / zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML
GNU Affero General Public License v3.0
947 stars 119 forks source link

Password read from env. variable not obfuscated in the console log #705

Closed mehdi-infostrux closed 9 months ago

mehdi-infostrux commented 11 months ago

Describe the bug Username and Password not obfuscated when printed to the console

To Reproduce Steps to reproduce the behavior:

  1. In config file, when setting the snowflake db connection, use $var$ syntax (eg: $SF_USERNAME$, $SF_PASSWORD$) to define the credentials
  2. Save the config file with a .env suffix so environment variables can be read
  3. Run EMR create-cluster command passing in classifications --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"SF_USERNAME":"aws secretsmanager get-secret-value --secret-id zinggSnowflakeCreds --region us-east-2 --query SecretString --output text | jq -r \'.\"userName\"\'","SF_PASSWORD":"$SF_PASSWORD"}}]}]' \
  4. Go to the EMR management console, select the running cluster, what for it to complete, then go to the steps and check the stderr
  5. At some point of the log you can check a line like this:
    2023-10-25 03:14:31,292 WARN util.PipeUtil: Reading Pipe [name=customers, format=net.snowflake.spark.snowflake, preprocessors=null, props={sfUrl=IJA16463-CFLDEV.snowflakecomputing.com, sfUser=the_actual_username, sfPassword=the_actual_password, sfDatabase=my_db, sfSchema=my_schema, sfRole=my_role, sfWarehouse=my_wh, dbtable=denormalized_table}, schema=null]

Expected behavior

A clear and concise description of what you expected to happen.
I'd see that line not displayed at all or at least looking like this:
2023-10-25 03:14:31,292 WARN util.PipeUtil: Reading Pipe [name=customers, format=net.snowflake.spark.snowflake, preprocessors=null, props={sfUrl=IJA16463-CFLDEV.snowflakecomputing.com, sfUser=the_actual_username, sfPassword=*******, sfDatabase=my_db, sfSchema=my_schema, sfRole=my_role, sfWarehouse=my_wh, dbtable=denormalized_table}, schema=null]

Run from AWS cloudshell

sonalgoyal commented 11 months ago

@gnanaprakash-ravi can you please edit the toString in Pipes.java and see if any property contains password ignore case string, we obfuscate?

mehdi-infostrux commented 10 months ago

@gnanaprakash-ravi have you had a chance to look into this?

sonalgoyal commented 9 months ago

@mehdi-infostrux can you try using a log4j redactor like https://github.com/cloudera/logredactor ? Zingg gets different kind of sensitive data based on the data source so it is best to handle this at the cluster infrastructure level rather than code level.

sonalgoyal commented 9 months ago

fixed in dd40675eca74853ad33abf100bab3144c0707d7a