webis-de / archive-query-log

📜 The Archive Query Log.
https://tira.io/task/archive-query-log
MIT License
22 stars 0 forks source link

Elasticsearch storage backend #25

Open heinrichreimer opened 8 months ago

heinrichreimer commented 8 months ago

Restucture the crawing and parsing to store structured data in Elasticsearch indices instead of in the file system. Also store WARCs in S3 instead of raw files. The new storage backends should be flexible enough to allow for re-parsing parts of the dataset without having to delete anything. The second key requirement is to be able to scale up massively by only interacting with standard ES/S3 APIs instead of having to mount a shared file system on all nodes.

codecov[bot] commented 8 months ago

Codecov Report

Attention: Patch coverage is 75.00000% with 49 lines in your changes missing coverage. Please review.

Project coverage is 89.68%. Comparing base (3554fc7) to head (6d05c24). Report is 20 commits behind head on main.

Files Patch % Lines
archive_query_log/legacy/queries/parse.py 19.04% 17 Missing :warning:
archive_query_log/legacy/urls/iterable.py 53.84% 12 Missing :warning:
archive_query_log/legacy/results/parse.py 56.52% 10 Missing :warning:
archive_query_log/legacy/model/parse.py 78.57% 3 Missing :warning:
archive_query_log/legacy/__init__.py 77.77% 2 Missing :warning:
archive_query_log/legacy/download/iterable.py 90.00% 2 Missing :warning:
archive_query_log/legacy/model/__init__.py 87.50% 1 Missing :warning:
archive_query_log/legacy/services/__init__.py 75.00% 1 Missing :warning:
archive_query_log/legacy/util/text.py 80.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #25 +/- ## =========================================== + Coverage 54.00% 89.68% +35.67% =========================================== Files 92 61 -31 Lines 4607 2724 -1883 =========================================== - Hits 2488 2443 -45 + Misses 2119 281 -1838 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

heinrichreimer commented 7 months ago

Closes #9

heinrichreimer commented 7 months ago

Fixes #6