Distributed Computations

1 minute read

Comes into being after deploying distributed data storage

Scatter/Gather

Scatter the data to a lots of individual nodes where its processed and gather those results back together.

  • Data stored locally is the key

Spark : scatter/Gather rater than map-reduce

Map-reduce

  • Hadoop - legacy pattern

Apache Storm : event based processing rather than Batch processing.

Map reduce

mappers and reducers

1. **Map Phase:**

   +------------------------+      +------------------------+
   |        Input Data      | ---> |        Mapper          |
   +------------------------+      +------------------------+
                                  |   (Key, Value) Pairs    |
                                  +------------------------+
                                         |         |
                                         |         |
                                         |         |
                                  +------------------------+
                                  |     Shuffle & Sort      |
                                  +------------------------+
                                         |         |
                                         V         V
                                 +----------+  +----------+
                                 |   Key    |  |   Key    |
                                 | Partition|  | Partition|
                                 +----------+  +----------+
                                         |         |
                                         V         V
                                  +------------------------+
                                  |      Reducer           |
                                  +------------------------+
                                         |         |
                                         V         |
                                  +------------------------+
                                  |      Output Data       |
                                  +------------------------+

2. **Reduce Phase:**

   +------------------------+
   |     Intermediate      |
   |     Key-Value Pairs   |
   +------------------------+
              |
              V
   +------------------------+
   |        Reducer         |
   +------------------------+
   |    (Key, List of Values)|
   +------------------------+
              |
              V
   +------------------------+
   |      Output Data       |
   +------------------------+

Hadoop

Distributed Computing Framework

  • map reduce API
  • map reduce job management
  • HDFS (Hadoop distributed filesystem)
  • Enormous eco system
    • hbase, hive, pig, zoo keeper, mahaut, sqoop, flume

HDFS

  • files & directories
  • metadata management by a replicated master
  • files stored in large, immutable, replicated blocks