I72 fix train loop - Githubissues

Evaluation code now only stores results optionally and needs 2 parameters to do so, instead of many. This is easier because we use the embeddings for visualization and the predictions for the confusion matrix, that's why we couldn't fully decouple the two processes. It has been also written following code from huggingface and sbert :)
Evaluation now is done on dev set instead of test set
Added separate methods to evaluuate test set

IMPORTANT:

Removed the JSON output file to keep track of results, as the f1 was already being stored in a csv per the original evaluator code. Since we're going to add/modify the hyperparameters that we're using for fine tuning, I'll leave the design of the output results to whenever we tackle that issue. Accuracy, f1 and current hyperparameters are still being kept track of and stored in the csv file - However, this needs to change once we agree on which hyperparameters to tune and how to keep track of experiments.

wri-dssg-omdena / policy-data-analyzer