MIDACO Parallelization in Python (Spark)

 

MIDACO 6.0 Python Spark Gateway

 

midaco.py.txt

 

Above gateway uses Spark instead of Multiprocessing

in order to execute several solution candidats in parallel.

 

MIDACO parallelization with Apache Spark is especially useful for (massive) parallelization on cluster- and cloud-computing systems which consist of individual virtual machines (VM). Such system are provided for example by Amazon EC2, Google Cloud, IBM Cloud, Digital Ocean and many academic institutions.

 

 

Setup a Spark Cluster (Linux)

 

Step 0.1

 Setup several virtual machines (VM). Each VM with its own IP, for example:

 IP-VM1 = 100.100.10.01, IP-VM2 = 100.100.10.02, IP-VM3 = 100.100.10.03

 Ensure that each VM can access each other via SSH Keys

Step 0.2

 Download Spark:  https://spark.apache.org/downloads.html

 For example:  spark-2.3.0-bin-hadoop2.7.tgz (pre-built for Apache Hadoop)

Step 0.3  Store a copy of the unzipped spark folder on every VM. Name it for example "spark"
Step 0.4

 Select one VM (e.g. 100.100.10.01) as master node by executing the command:

 ./spark/sbin/start-master.sh --host 100.100.10.01

Step 0.5

 Select each other VM as slave node by executing the command:

 ./spark/sbin/start-slave.sh spark://100.100.10.01:7077

Step 0.6

 The Spark cluster should now be up and running. Visiting the address

 100.100.10.01:8080  in a web-browser should now look something like this

  

Running MIDACO on the Spark Cluster

 

Step 1  Download above MIDACO python spark gateway and remove .txt extension
Step 2  Download appropriate library file midacopy.dll or midacopy.so here
Step 3  Download an example (e.g. example.py) and remove .txt extension
Step 4

 Execute MIDACO on the Spark cluster with a command like this:

  ./spark/bin/spark-submit --master spark://100.100.10.01:7077 example.py

 

Note: The advanced Text-I/O examples are particularly well suited to be used with Spark

 

 

 Screenshot of MIDACO running on a Spark Cluster with 32 Quad-Core CPU's

 

 

Comprehensive MIDACO Spark step-by-step instruction

These are some preliminary bash scripts for a 36 machine spark cluster:

spark_setup_commands.sh

run_example_on_spark.sh

Note that Spark relevant commands inside midaco.py.txt itself are minimal:

[Line  24]   from pyspark import SparkContext
[Line 237]   sc = SparkContext(appName="MIDACO-SPARK-PARALLEL")
[Line 254]   rdd = sc.parallelize( A , p ).map(lambda x: problem_function(x))
[Line 256]   B = rdd.take(p)