Open matan129 opened 2 years ago
Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage
luceneRDD
to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.Describe the solution you'd like Ideally?
sparkRDD.fromEs(<es_connection>)
. Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these filesluceneRDD
.Describe alternatives you've considered N/A
If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36
Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES.
Hope this helps.
PS. If your data are snapshotted using an internal ES snapshot representation the above solution will not work. You must have a copy of your data that you can easily read with Spark. In the past, it was a common practice to keep a backup of the ES indices to prevent data losses. Maybe these days things are more stable with ES.
Hi, thanks for the response. The data in s3 is elasticsearch's own format, not something standard like Parquet.
AFAIK, ES's format is just a lucene file, so I was wondering if this library could be used for parsing it.
On Thu, Nov 17, 2022, 16:48 Anastasios Zouzias @.***> wrote:
Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage luceneRDD to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.
Describe the solution you'd like Ideally? sparkRDD.fromEs(
). Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these files luceneRDD. Describe alternatives you've considered N/A
If you have your data in S3, you can read your data from S3 to a Spark DataFrame and then instantiate a LuceneRDD from your DataFrame. See for example here: https://github.com/zouzias/spark-lucenerdd-aws/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/aws/indexing/WikipediaIndexingExample.scala#L36
Speaking of which, if you heavily batch query your ElasticSearch cluster from Spark, you can easily put a lot pressure to ES.
Hope this helps.
— Reply to this email directly, view it on GitHub https://github.com/zouzias/spark-lucenerdd/issues/435#issuecomment-1318747385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKCW3NDCWYQFQUD7DNACLWIZASLANCNFSM6AAAAAASDFICIY . You are receiving this because you authored the thread.Message ID: @.***>
Is your feature request related to a problem? Please describe. Nope. The use case is new, but kind of related to this project - I have an Elasticsearch cluster with large indices that are being snapshotted to S3. I was wondering if I could somehow leverage
luceneRDD
to load the data directly from S3; currently, I have Spark heavily query Elasticsearch, which puts a lot of strain on the cluster. Usually I just need a full dump of the data anyways, so I don't need sophisticated ES query capabilities when dumps the data from ES to Spark.Describe the solution you'd like Ideally?
sparkRDD.fromEs(<es_connection>)
. Jokes aside - basically, Elasticsearch snapshots are saved as "dumb dumps" of the Lucene index of every shard in the Elasticsearch index. I though we might be able to parse these filesluceneRDD
.Describe alternatives you've considered N/A