MIDACO Parallelization in Python (Spark)
MIDACO 6.0 Python Spark Gateway | |||
midaco.py.txt
|
|||
Above gateway uses Spark instead of Multiprocessing in order to execute several solution candidats in parallel. |
MIDACO parallelization with Apache Spark is especially useful for (massive) parallelization on cluster- and cloud-computing systems which consist of individual virtual machines (VM). Such system are provided for example by Amazon EC2, Google Cloud, IBM Cloud, Digital Ocean and many academic institutions.
Setup a Spark Cluster (Linux)
|
|
Step 0.1 |
Setup several virtual machines (VM). Each VM with its own IP, for example: IP-VM1 = 100.100.10.01, IP-VM2 = 100.100.10.02, IP-VM3 = 100.100.10.03 Ensure that each VM can access each other via SSH Keys |
Step 0.2 |
Download Spark: https://spark.apache.org/downloads.html For example: spark-2.3.0-bin-hadoop2.7.tgz (pre-built for Apache Hadoop) |
Step 0.3 | Store a copy of the unzipped spark folder on every VM. Name it for example "spark" |
Step 0.4 |
Select one VM (e.g. 100.100.10.01) as master node by executing the command: ./spark/sbin/start-master.sh --host 100.100.10.01 |
Step 0.5 |
Select each other VM as slave node by executing the command: ./spark/sbin/start-slave.sh spark://100.100.10.01:7077 |
Step 0.6 |
The Spark cluster should now be up and running. Visiting the address 100.100.10.01:8080 in a web-browser should now look something like this |
Running MIDACO on the Spark Cluster
|
|
Step 1 | Download above MIDACO python spark gateway and remove .txt extension |
Step 2 | Download appropriate library file midacopy.dll or midacopy.so here |
Step 3 | Download an example (e.g. example.py) and remove .txt extension |
Step 4 |
Execute MIDACO on the Spark cluster with a command like this: ./spark/bin/spark-submit --master spark://100.100.10.01:7077 example.py |
Note: The advanced Text-I/O examples are particularly well suited to be used with Spark
|
Screenshot of MIDACO running on a Spark Cluster with 32 Quad-Core CPU's
Comprehensive MIDACO Spark step-by-step instruction
These are some preliminary bash scripts for a 36 machine spark cluster:
Note that Spark relevant commands inside midaco.py.txt itself are minimal:
[Line 24] from pyspark import SparkContext
[Line 237] sc = SparkContext(appName="MIDACO-SPARK-PARALLEL")
[Line 254] rdd = sc.parallelize( A , p ).map(lambda x: problem_function(x))
[Line 256] B = rdd.take(p)