wri-dssg-omdena / policy-data-analyzer

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Other
34 stars 9 forks source link

Create Dockerfile to isolate project dependencies #25

Closed dfhssilva closed 3 years ago

dfhssilva commented 3 years ago

Creating this PR for sharing what I have been doing regarding the spell checker and the project's docker image. Don't merge it yet.

The dockerfile is already able to create a nice container that we can use at least for the OCR pipeline. It would be nice if someone else could try it out and give some feedback. To build the docker image do: docker build -f Dockerfile -t policy_container .. After building the docker image we can instantiate the image (class-object analogy) as a container by docker run -p 8888:8888 policy_container, this will initialize a sh shell inside the container which can be interacted with. Also you should be able to launch jupyter notebook and run the notebooks jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root. Let me know if you have any question. I will continue working on this and develop some instructions/ documentation.

The Dockerfile is meant to ultimately isolate the entire project. This commit only cares about isolating dependencies related to the OCR pipeline.

The requirements.txt was updated to include the necessary python packages. The .dockerignore file was created to avoid sending large or sensitive files and directories to the docker daemon and image.

dfhssilva commented 3 years ago

To run the docker container use docker run -ti -p 8888:8888 --env-file .env policy_container and make sure you have a .env file in the project root directory. This file should have key=value pairs that hold the environment variables that will be held by the scripts and notebooks such as S3_BUCKET and SECRET_KEY (necessary to connect to Omdena's S3 bucket)

jordiplanescutxi commented 3 years ago

To run the docker container use docker run -ti -p 8888:8888 --env-file .env policy_container and make sure you have a .env file in the project root directory. This file should have key=value pairs that hold the environment variables that will be held by the scripts and notebooks such as S3_BUCKET and SECRET_KEY (necessary to connect to Omdena's S3 bucket)

About the S3_BUCKET redentials, this this env file kept aside from external exposure. On Saturday evening AWS asked Omdena to change them, maybe they detected some leakage problem. Let us talk about it.

dfhssilva commented 3 years ago

Hey @jordiplanascuchi @thefirebanks ! My last commit produces a development environment we can use for this project. Basically we have a dockerfile that produces the docker image with the dependencies we need (we can add more over time) based on Ubuntu 18.04 and then we mount the project directory on the container so we can edit the files and the changes will appear locally as well. Then we use git locally as usual and we can also edit the files locally and the changes will appear right away in the container so we can run them.

Build the image: docker build -f Dockerfile -t policy_container .

Create a container by running the image: docker run -ti --rm -p 8888:8888 --mount source=$(pwd),target=/app,type=bind policy_container:latest

Launch a jupyter notebook from the container jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root

I also explored a more complete approach for development on the container using VSCode. This requires the installation of the Remote-Containers extension. Take a look at this article for a more in-depth explanation/ reference. This second approach seems more practical to me (at least if you use VSCode regularly). It runs the docker image from our Dockerfile and allows to run and edit the mounted files from VSCode within the container with all its isolated dependencies. It also launches a Jupyter Notebook server that we can access at any time during development if we want. For this second approach I would highly suggest you to read the article above to get used to how this process works. You can access the jupyter notebook server from within the docker container in this second approach by simply going to 127.0.0.1:8888/ in your browsers and use the policydata password.

dfhssilva commented 3 years ago

Please try to set up the development environment as mentioned in the previous comment and let me know if you can reproduce it or find any questions

jordiplanescutxi commented 3 years ago

Ok, if you are working on windows, the running instruction should be something like: docker run -ti --rm -p 8888:8888 --mount type=bind,source="/C/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/policy-data-analyzer",target="/app" policy_container:latest just change the absolute path to fit into your system.

thefirebanks commented 3 years ago

@DavidSilva98 excellent job! Just tested the commands and it works for me. Two things:

Once that is solved, @jordiplanascuchi if everything is good, let's approve this and merge please!

dfhssilva commented 3 years ago

PR merged into master! Should we keep this branch open so in the future we can try to implement the JamSpell?