Open heinrichreimer opened 8 months ago
Attention: Patch coverage is 75.00000%
with 49 lines
in your changes missing coverage. Please review.
Project coverage is 89.68%. Comparing base (
3554fc7
) to head (6d05c24
). Report is 20 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Closes #9
Fixes #6
Restucture the crawing and parsing to store structured data in Elasticsearch indices instead of in the file system. Also store WARCs in S3 instead of raw files. The new storage backends should be flexible enough to allow for re-parsing parts of the dataset without having to delete anything. The second key requirement is to be able to scale up massively by only interacting with standard ES/S3 APIs instead of having to mount a shared file system on all nodes.