zouzias / spark-lucenerdd

Spark RDD with Lucene's query and entity linkage capabilities
Apache License 2.0
124 stars 36 forks source link

Elasticsearch Snapshots? #435

Open matan129 opened 2 years ago

matan129 commented 2 years ago

Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered N/A

zouzias commented 2 years ago

Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like Ideally? sparkRDD.fromEs(<es_connection>). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered N/A

If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36

Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES.

Hope this helps.

PS. If your data are snapshotted using an internal ES snapshot representation the above solution will not work. You must have a copy of your data that you can easily read with Spark. In the past, it was a common practice to keep a backup of the ES indices to prevent data losses. Maybe these days things are more stable with ES.

matan129 commented 2 years ago

Hi, thanks for the response. The data in s3 is elasticsearch's own format, not something standard like Parquet.

AFAIK, ES's format is just a lucene file, so I was wondering if this library could be used for parsing it.

On Thu, Nov 17, 2022, 16:48 Anastasios Zouzias @.***> wrote:

Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.

Describe the solution you'd like Ideally? sparkRDD.fromEs(). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD.

Describe alternatives you've considered N/A

If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36

Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES.

Hope this helps.

— Reply to this email directly, view it on GitHub https://github.com/zouzias/spark-lucenerdd/issues/435#issuecomment-1318747385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKCW3NDCWYQFQUD7DNACLWIZASLANCNFSM6AAAAAASDFICIY . You are receiving this because you authored the thread.Message ID: @.***>