weso / sparkwdsub

Spark processing of wikidata subsets
MIT License
0 stars 3 forks source link

Experiments on AWS EMR #4

Open thewillyhuman opened 2 years ago

thewillyhuman commented 2 years ago

AWS EMR is the Amazon Web Services Elastic Map Reduce service. It allows creating spark clusters. Our aim was to run these experiments on the WESO Spark cluster but unfortunately, we run out of RAM. This issue will be the history of the experiments in AWS EMR.

thewillyhuman commented 2 years ago

2021/09/02 11:41 - Built full Wikidata graph with Apache Spark GraphX

We were able to build a graph from the latest Wikidata dump (1.2Tb JSON).

Output:

==========================================
JOB INFORMATION

Job Name: test_2
Job Date: 2021-09-02 05:11:45
Job Cores: 256
Job Mem: 1024 Gb
Job Time: 1402.883987493 seconds
==========================================
JOB RESULTS

 Graph Edges: 631672396
 Graph Vertices: 93946908
thewillyhuman commented 2 years ago

2021/09/02 11:44 - Problems executing validation on AWS EMR

One of the libraries that we are using in the validation step was compiled for Java 1.11. Unfortunately, AWS EMR employs Java 1.8. This error has already been reported at https://github.com/weso/shex-s/issues/258.

thewillyhuman commented 2 years ago

2021/09/03 18:21 - Moving Wikidata dump to AWS S3

AWS EMR uses AWS S3 as a file system. Therefore we need to move our data there to process it with AWS EMR. The latest Wikidata dump occupies 1.2Tib and the AWS S3 ingest speed is limited to 50MiB/s. That results in about a 7 hours upload. This upload has been started this morning and right now is about to end. Once the upload is finished I will run a small integrity check to validate the uploaded data and then we will start the subset generation process through AWS EMR.

Note: It is very important to enable fast transfer acceleration on both S3 buckets. Otherwise, the CLI API raises a 400 response error.

thewillyhuman commented 2 years ago

2021/09/04 02:42 - Wikidata dump moved to AWS S3

The upload of the dump and the integrity check finished after 8 hours. I had to re-upload the data because I attached the wrong ACL policy and was unreadable. The dump is located at s3://weso/datasets/wikidata/dump_20210903.json.

thewillyhuman commented 2 years ago

2021/09/04 11:03 - Creating AWS EMR Cluster

Stating the creation of an EMR cluster with the following resources.

Node Type Quantity Storage (GiB) Cores Memory (GiB)
Master 1 128G 8 32
Main 32 128G 8 32
Total Reources 33 4224 264 1056

Command to replicate cluster:

aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications Name=Hadoop Name=Spark --ebs-root-volume-size 10 --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-1ea4ae65","EmrManagedSlaveSecurityGroup":"sg-0bd2ee0a7497ff38e","EmrManagedMasterSecurityGroup":"sg-0216072475cb6a2ba"}' --service-role EMR_DefaultRole --enable-debugging --release-label emr-6.3.0 --log-uri 's3n://aws-logs-787866851299-eu-west-3/elasticmapreduce/' --name 'WSub Cluster' --instance-groups '[{"InstanceCount":32,"BidPrice":"OnDemandPrice","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","Name":"Principal - 2"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Maestro - 1"}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region eu-west-3
thewillyhuman commented 2 years ago

2021/09/04 11:03 - Creating second AWS EMR Cluster

Stating the creation of an EMR cluster with the following resources.

Node Type Quantity Storage (GiB) Cores Memory (GiB)
Master 1 7600GB 32 244
Main 15 7600Gb 32 244
Total Reources 16 121Tb 512 3904

Output:

==========================================
JOB INFORMATION
Job Name: cities_subset_full_dump
Job Date: 2021-32-04 03:32:28
Job Cores: 512
Job Mem: 3904
Job Time: 9781.008920513 seconds
==========================================
JOB RESULTS

prefix wde: <http://www.wikidata.org/entity/>Start = @<City><City> { wde:P31 @<CityCode> }<CityCode> [ wde:Q515 ]
Result: 4484 lines.
thewillyhuman commented 2 years ago

2021/09/06 14:27 - Executing the first optimization on 2014 dump on the WESO local cluster

The time without the optimization in the same cluster was 37 minutes while with the first optimization the time was 42 minutes. The next step is to figure out the reason for the worse performance.