1. Immutability of RDDs/Dataframes in Apache Spark
Immutability ensures consistency and fault tolerance in distributed environments. It simplifies parallel processing and caching mechanisms, making Spark more resilient. It allows the ability to regenerate the previous steps of the DAG in case of failures, crucial in distributed environments with concurrent processes.
2. "count" in groupBy as Transformation vs. Action
Count acts as a transformation when used after groupBy, returning a DataFrame. When used directly, it acts as an action, returning data to the driver. This distinction is made based on the context of its usage.
3. Laziness of Transformations in Spark
Laziness in transformations allows Spark to optimize execution plans and minimize unnecessary computations. It enhances performance by executing only when necessary, providing optimized computation plans by the time an action is triggered.
4. Partition Skew in Apache Spark
Partition skew occurs when data distribution is significantly uneven across partitions. Mitigation involves techniques like repartitioning, using salting, or leveraging advanced partitioning strategies in Spark.
5. Normal Join vs. Broadcast Join
Normal join involves shuffling data across nodes, leading to potential overhead. Broadcast join optimizes the process by distributing a smaller dataset to each node, minimizing shuffling and improving efficiency.
6. Serialization Issue in Spark
Serialization in Spark converts objects to a format for storage or transmission. A serialization issue may arise in scenarios like inefficient shuffling. Solutions include using efficient serialization formats like Kryo and implementing the Serializable interface for custom classes.
7. Overview of Apache Spark Concepts
Various Apache Spark concepts include Resilient Distributed Dataset (RDD), DataFrame, Dataset, transformations, actions, directed acyclic graph (DAG), stages, tasks, cluster manager, driver program, executors, shuffling, Catalyst optimizer, distributed caching, broadcast variables, and more.
8. Cluster vs. Cluster Manager
In distributed computing, a cluster is a group of interconnected computers or nodes, while a cluster manager is software responsible for managing and coordinating resources across the nodes in a cluster.
9. Driver Memory Allocation in Spark
When submitting a Spark application, driver memory can be specified using properties like spark.driver.memory and spark.driver.memoryOverhead. Overhead memory is used for managing resources and Spark application state.
Post a Comment