Spark performance tuning guidelines

Srinivasa Rao • April 21, 2020

Spark tuning for high performance

1 Introduction

This document will outline various spark performance tuning guidelines and explain in detail how to configure them while running spark jobs. This process guarantees that the Spark has optimal performance and prevents resource bottlenecking. Effective changes are made to each property and settings, to ensure the correct usage of resources based on system-specific setup. Apache Spark has in-memory computation in nature. As a result resources in the cluster (CPU, memory etc.) may get bottlenecked.

2 Spark application execution modes

Spark Jobs can run in 3 modes, local, client and cluster modes:

2.1 Local Mode

Local mode runs both driver and executors on a single node. In this mode, the partitions are processed by multiple threads in parallel. The number of threads can be controlled by the user while submitting a job. This is used only for a single node cluster for learning spark purposes.

Eg:

$ spark-submit --master local <other parameters>

2.2 Client mode

In the client mode, the driver process runs on the client node (that is, the edge or gateway node) on which the job was submitted. The client node provides resources, such as memory, CPU, and disk space to the driver program, but the executors run on the cluster nodes and they are maintained by the cluster manager, such as YARN.

E1, E2 and E3 are executors.

Eg:

$ spark-submit \

--master yarn

--deploy-mode client

--<other parameters>

2.3 Cluster mode

Cluster mode is similar to client mode, except that the driver process runs on one of the cluster workers machines, and the cluster manager is responsible for both driver and executor processes. This gives an advantage of running multiple applications at the same time because cluster managers will distribute the driver load across the cluster.

This mode is the most common and recommended mode for running the Spark applications in production environments.

E1, E2 and E3 are executors.

Eg:

$ spark-submit \

--master yarn

--deploy-mode cluster

--<other parameters>

Always run production spark jobs using Yarn cluster mode.

3 Spark execution options

The following are some resource allocation parameters that can be set while submitting spark jobs from various clients like spark-submit, Talend and so on. If you don’t pass any of these parameters, spark will take it from spark-defaults.conf, which will have minimal resources allocated most of the time, that may not be suitable to all spark jobs.

These parameters need to be set at the job level while submitting a job because different jobs need different kinds and amount of resources based on its complexity and the number of parallel jobs running at that time and so on. There is no single standard which suits all the purposes. The following are some of the things that you should take into consideration while setting up these parameters.

Number of concurrent jobs run during that time (or need to run at that time).

Resources allocated to Spark Yarn queue and there by amount of resources available for Spark.

Amount of the data Spark job processes.

Whether the Spark job uses a wide transformation like reduceByKey, aggregateBy and so on.

Amount of shuffle happening.

Whether it is caching any data to memory.

Even though there are a lot more parameters that are available with spark submit, this section will explain some of them that are critical and have considerable impact on Spark applications.

3.1 Spark driver memory

Amount of memory to use for the driver process. Default is 1g. When you use accumulators, or collect a lot of data like using collect() or take(N) actions, you might need to increase it to an appropriate size.

spark.driver.memory (from configuration file) or --driver-memory (from spark-submit)

3.2 Spark driver core

Number of cores used by the driver process. Default is 1 core. It's a good idea to allocate 2 to 4 based on cluster size and complexity of the code.

Spark.driver.cores (from configuration file) --driver-cores (from spark-submit)

3.3 Spark number of executors

This is one of the critical parameters that needs attention and may need some calculations. This is the number of executors spark can initiate when submitting a spark job. This parameter is for the cluster as a whole and not per the node.

There are multiple ways you can calculate how many executors you need to set. In the Fat executors method, you set one executor per node and in the Tiny executor approach, you set one executor each CPU core. But my experience shows it is somewhere in between. Allocate a few cores 2 to 5 for each executor in general.

For example, if you have 30 nodes in the cluster (assume each node has 8 cores allocated for the yarn queue that you are using to submit spark jobs), setting 30 executors will have 1 executor per node with 8 cores for each executor. Setting 240 executors will have 8 executors per node with 1 core allocated to each executor.

Number of executors determine parallelism. Tasks run within executors. Parallelism comes with some overheads and sometimes, overheads outweigh benefits of the parallelism. Especially, your job does a lot of shuffling or uses wide transformations, it's a good idea to go with smaller partitions. It also depends on the complexity of the spark code and number of concurrent jobs running at the same time.

I would generally go with at least 2 to 5 cores for each executor based on availability and also leave 1 to 2 cores not allocated leaving for other tasks.

Below are some examples on how to calculate the number of executors:

Number of Nodes: 30

Number of cores on each node: 32

Amount of RAM: 64GB

Spark Yarn queue: 30% which means approximately 10 cores and 21GB ram available on each node. Total of 300 (30 nodes * 10) cores and 630GB (30 * 21g) of memory available.

Scenario I

Number of parallel spark jobs = 1

Number of cores per executor = 4

Number of executors = (300 -60)/4 = 60 (leaving 2 cores on each node for other purposes)

That will make 2 executors for each node.

Scenario II

Number of parallel spark jobs = 3

Number of cores per executor = 4

Number of executors for each job = ((300 -60)/4) = 60/3 = 20 (leaving 2 cores unused on each node for other purposes). This will make up to 2 executors per node but running 3 jobs instead of one job as in scenario I. Each spark job gets a total of 20 executors.

Scenario III

Number of parallel spark jobs = 3

Number of cores per executor = 3

Number of executors for each job = ((300 -30)/3) = 90/3 = 30 (leaving 1 cores unused on each node for other purposes). In this case 3 executors on each node but 3 jobs running so one executor on each node will be allocated to each job. Each spark job gets a total of 30 executors.

In the above example (Scenario II and III), assuming all 3 spark jobs require similar resources and have similar complexity. You can allocate different cores and different numbers of executors for each spark job as long as the total will be within allocated resources for that queue. In case if the total exceeds the resources available, the consecutive spark jobs will be waiting for the resources to be released.

You can use resources like Spark UI and Spark application logs to see how the resources utilized and tweak back and forth.

Spark.executor.instances (from configuration file) or –num-executors (from spark-submit)

3.4 Spark executor memory

Amount of the memory allocated for each executor process. Default is 1g. There are many things you need to consider while allocating this like how much data you are processing, any shuffling activity and so on.

For example if you have 320 GB available on cluster ( 30 nodes * 10GB) for the Yarn queue you are using to run spark jobs and you are determined to run on 30 executors based on the above calculations, you may set this parameter at 10g (appx 320/30) assuming only one spark job running at that time.

Spark.executor.memory (from configuration file) or –executor-memory (from spark-submit)

3.5 Spark executor cores

Number of cores allocated for each executor. Default is one. Please see the above examples on how to calculate this and come to the right number.

--executor-cores(from configuration file) or spark.executor.cores (from spark-submit)

eg:

spark-submit <other parameters>

--master yarn

--deploy-mode cluster

--executor-memory 10g

--executor-cores 3

--num-executors 30

--driver-memory 5g

--driver-cores 2

4 Compression

Compressing data at various levels improves overall performance. It reduces storage space, speeds up IO, reduces network costs, improves memory usage. But it comes with some computation overheads.

Spark can transparently read files that are compressed using LZ4, Snappy and GZIP codecs.

Here are some properties that can be set within spark code and also at the cluster level which will improve the performance of spark jobs.

Spark.sql.inMemoryColumnarStorage.compressed = true

Spark SQL will automatically select a compression codec for each column based on statistics of the data

Spark.io.compression.codec=snappy

Specifies the codec used to compress internal data such as RDD partitions, event log, broadcast variables and shuffle outputs.

Spark.broadcast.compress=true

Whether to compress broadcast variables before sending them. Generally a good idea.

MemoryColumnarStorage.batchSize=10000

The size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.

5 Parallelism and Partitions

Spark automatically sets the number of “map” tasks to run each input file according to its size. For distributed “reduce” operations such as groupByKey and reduceByKey, it uses RDDs number of partitions. Sometimes, you may need to use repartition or coalesce to change the number of partitions.

You can also set spark.default.parallelism parameter to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. 2 to 4 partitions per CPU core is highly recommended.

It is important to know tasks and executors are different. Tasks are the small processes run within the executor. Number of executors and number of tasks determine total parallelism.

You can also set the number of partitions within the transformations.

val rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)

For Narrow transformation like “read”, the number of partitions is equal to (size of the file / how much data you want to process in each partition).

For the “reduce” operations where there is a lot of shuffling, keep multiplying X by 1.5 times with the number of original partitions when read from file (map tasks) until performance stops improving.

6 Storage format

Each file format has pros and cons and each output type needs to support a unique set of use-cases. For each output type, we chose the file format that maximizes the pros and minimizes the cons.

The following file formats are used for each output type.

Ingested data: SequenceFiles provide efficient writes for blob data.

Intermediate data: Avro offers rich schema support and more efficient writes than Parquet, especially for blob data.

Final output: Combination of Parquet, Avro, and JSON files

Imagery metadata: Parquet is optimized for efficient queries and filtering.

Imagery: Avro is better optimized for binary data than Parquet and supports random access for efficient joins.

Aggregated metadata: JSON is efficient for small record counts distributed across a large number of files and is easier to debug than binary file formats.

7 Shuffling

Reduce, GroupBy, ReduceByKey, sortBy transformations use shuffling. Shuffling is an expensive operation. Use reduceByKey instead for groupBy or aggregateBy. Sometimes, sorting datasets will help before using Reduce operations.

Increase level of parallelism to reduce outOfMemory errors.

Using broadcasting can greatly reduce the size of each serialized task and cost of launching a job over the cluster. If a task uses any large object from the driver program inside of them like a static lookup table, consider turning it into a broadcast variable.

You can set default shuffle partitions within spark code or at configuration level in spark defaults or at cluster.

Spark.sql.shuffle.partitions=20. Configures the number of partitions to use when shuffling data for joins or aggregations

8 Broadcasting

8.1 Broadcast hints

The BROADCAST hint guides Spark to broadcast each specified table when joining them with another table or view. When Spark decides the join methods, the broadcast hash join (i.e., BHJ) is preferred, even if the statistics is above the configuration spark.sql.autoBroadcastJoinThreshold value. When both sides of a join are specified, Spark broadcasts the one having the lower statistics. Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) support BHJ. When the broadcast nested loop join is selected, we still respect the hint.

eg.:

from pyspark.sql.functions import broadcast

broadcast(spark.table("src")).join(spark.table("records"), "key").show()

8.2 Broadcast Variables

The broadcast variables are only copied once to each executor and then tasks can use that copy for computation. This improves both run time and storage space. Broadcasting large variables and some small datasets improve overall performance of the spark code.

The broadcast variables are read-only variables. Spark also provides accumulators that are equivalent to counters in map reduce. Accumulators are the write-only variables and can be used to get statistics such as the number of bad records in the input file while running your application.

Eg:

val lookup = Map(“This” -> “frequent”, “is” -> “frequent”, “my” -> “moderate”, “file” -> “rare”)

val broadcastLookup = sc.broadcast(lookup)

val acc = sc.accumulator(v)

9 In-memory Caching

Spark SQL can cache tables in-memory and can be reused.

Spark.catalog.cacheTable(“tableName”)

Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. Also, please remember to uncache the table from memory once it's no longer required.

10 Data Serialization in Spark

It is the process of converting the in-memory object to another format that can be used to store in a file or send over the network. It plays a distinctive role in the performance of any distributed application. The computation gets slower due to formats that are slow to serialize or consume a large number of files. Apache Spark gives two serialization libraries:

· Java serialization

· Kryo serialization

Java serialization – Objects are serialized in Spark using an ObjectOutputStream framework, and can run with any class that implements java.io.Serializable. The performance of serialization can be controlled by extending java.io.Externalizable. It is flexible but slow and leads to large serialized formats for many classes.

Kryo serialization – To serialize objects, Spark can use the Kryo library (Version 2). Although it is more compact than Java serialization, it does not support all Serializable types. For better performance, we need to register the classes in advance. We can switch to Kyro by initializing our job with SparkConf and calling-

conf.set(“spark.serializer”, “org.apache.spark.serializer.KyroSerializer”)

We use the registerKryoClasses method, to register our own class with Kryo. In case our objects are large we need to increase spark.kryoserializer.buffer config. The value should be large so that it can hold the largest object we want to serialize.

//Scala

conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(ClassOf[FirstCustomClass],ClassOf[SecondCustomClass]))

11 Spark Driver

As Spark driver has the responsibility of DAG construction and task scheduling, sometimes, the driver itself becomes the bottleneck. You can configure Spark driver to enable dynamic executor allocation. This means Spark can dynamically add or remove executors on the fly. If dynamic executor allocation is enabled and your job has performed an operation such as coalesce() and does not require some executors, then the driver will free the resources. Compared to traditional resource allocation, where we have to reserve the resources. This feature brings better resource utilization and can be useful in multi-tenant environments. You can enable it by adding the following properties to your application code:

spark.dynamicAllocation.enabled = true

spark.dynamicAllocation.executorIdleTimeout = 2m

spark.dynamicAllocation.minExecutors = n (set minimum number of executors)

spark.dynamicAllocation.maxExecutors = m (set maximum number of executors)

You can also configure the memory allocated to the driver program by setting a flag for --driver-memory.

12 Other Performance tuning considerations

12.1 Caching Level

In the case of caching, we need to be careful when choosing the storage level. The default caching level is MEMORY_ONLY, which keeps the data in memory. If the RDD is big enough and can't fit in memory, then Spark fits as many partitions as it can and the remaining partitions will be recomputed. This can be avoided if your RDD is expensive to recompute. In such cases, you can use MEMORY_AND_DISK as the storage level. This will move the remaining partitions on the disk and return it to the memory when they are needed without recomputing them.

There are other storage levels available like memory_only_ser, memory_and_disk_ser, disk_only, memory_only_2 and memory_and_disk_n which are rarely used but may improve performance in certain use cases.

12.2 Enabling Tungsten

The goal of Project Tungsten is to improve the efficiency of CPU and memory for Spark applications. Three of the main optimizations are the following:

Memory management: This allows Spark to manage memory by eliminating the overhead of Java virtual machine (JVM) objects that reduces memory consumption and garbage collection (GC) overhead.

Binary processing: One of the aims of Project Tungsten is to process the data in binary format itself, instead of JVM objects, which are heavier in size. This binary format is known as the UnsafeRow format.

Code generation: Using this feature, Spark uses optimization of structured APIs to directly generate the bytecode for your code. This can bring lots of advantages when you write large queries.

Project Tungsten is also home to some other optimizations, such as sorting, join, and shuffle. You can enable or disable Tungsten by using spark.sql.tungsten.enabled configuration.

12.3 Task scheduler

It is the responsibility of the task scheduler to schedule these tasks with an executor based on the available cores and data locality. By default, a task is assigned a single core to perform its operation. You can change this behavior by configuring spark.task.cpus. Use multiple cores for this parameter.

12.4 Catalyst optimizer

If you are using structured APIs like SQL, DataFrames and Datasets, Catalyst optimizer helps to create logical query plans for each transformation and then optimizes to create optimized query plans and then it efficiently transforms optimized logical query plans to physical query plans.

12.5 Join Performance

Spark SQL provides a variety of joins including Shuffle hash join, Broadcast hash join and Cartesian join.

If the size of both the tables is large and they both are bucketed/partitioned on the same joining field, then shuffle hash join can best suit your needs. It works best when the data is evenly distributed based on the key field and there are enough unique values for that key field to achieve the necessary parallelism. If one of the tables is small in size, you can use broadcast join, which caches the small data on each machine and avoids shuffle.

12.6 Data Locality

Data locality can have a major impact on performance. It means code and data stay together. Shipping code is much easier than shipping data as data is much larger than code. Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

· PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible

· NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

· NO_PREF data is accessed equally quickly from anywhere and has no locality preference

· RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

· ANY data is elsewhere on the network and not in the same rack

You can configure and override this behaviour using spark.locality spark configuration option.

12.7 Code generation

Spark can generate bytecode for your queries on the fly. Spark compiles each query to its Java bytecode equivalent code when spark.sql.codegen is set to true. By default, this property is set as false. In case your queries are big, you can set this property to true to improve on performance.

12.8 Speculative execution

You often find some tasks running slower than other tasks from Spark UI. When spark.speculation is enabled, Spark will identify slow running tasks and run the slow running task on other nodes to complete jobs quickly. Whichever task completes first, the other task will get killed by Spark.

#Python

conf.set("spark.speculation","true")

12.9 Disk

Even though Spark applications benefit largely from memory, the disk also plays an important role. Spark uses disk space to store the shuffle data temporarily. In Spark standalone mode and Mesos, this location can be configured in SPARK_LOCAL_DIRS variable. In YARN mode, Spark inherits YARN's local directories. Using SSDs for these directories, Spark jobs can benefit a lot.

12.10 Tuning Data Structures

The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this:

Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.

Avoid nested structures with a lot of small objects and pointers when possible.

Consider using numeric IDs or enumeration objects instead of strings for keys.

If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight. You can add these options in spark-env.sh.

12.11 Choosing the right transformations

One of the ways optimizing spark application is by writing effective code. If you choose transformations very carefully to avoid a lot of shuffling of your data, the performance improves. Lesser the shuffle, less in the execution time. For example, reduceByKey() and groupByKey() both achieve the same results but reduceByKey is more efficient than using groupByKey() because they both process data differently. groupByKey() transformation shuffles data across the nodes based on the keys and on the other hand reduceByKey() first aggregates the data locally on each node and then its transfered to other nodes based on the key. reduceByKey() shuffles less data compared to groupByKey().

In some cases, when the data is highly skewed, you might face issues with groupByKey() because all the records with the same key are sent to a single node.

Consider using filter, sortBy, orderBy before using reduce functions, which may improve performance.

13 Delta lakes

Delta lakes is a new initiative by databricks and it is an open source storage layer that brings ACID capabilities to Spark APIs.

It offers a lot of new features like ACID capabilities, data modifications like update, insert and delete at row level to Spark dataframes. This makes Spark dataframes mutable.

< Older Post

Newer Post >

"About Author"

The author has extensive experience in Big Data Technologies and worked in the IT industry for over 25 years at various capacities after completing his BS and MS in computer science and data science respectively. He is certified cloud architect and holds several certifications from Microsoft and Google. Please contact him at srao@unifieddatascience.com if any questions.

Unified Lambda Architecture (Realtime DB + DW + OLAP)

By Srinivasa Rao • June 19, 2023

Database types Realtime DB The database should be able to scale and keep up with the huge amounts of data that are coming in from streaming services like Kafka, IoT and so on. The SLA for latencies should be in milliseconds to very low seconds. The users also should be able to query the real time data and get millisecond or sub-second response times. Data Warehouse (Analytics) A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. The data is generally stored in denormalized form using Star or Snowflake schema. Data warehouse is used in a little broader scope, I would say we are trying to address Data Marts here which is a subset of the data warehouse and addresses a particular segment rather than addressing the whole enterprise. In this use case, the users not only query the real time data but also do some analytics, machine learning and reporting. OLAP OLAP is a kind of data structure where the data is stored in multi-dimensional cubes. The values (or measures) are stored at the intersection of the coordinates of all the dimensions.

Infrastructure and Platform Architecture – Modern data lakes and databases

By Srinivasa Rao • June 18, 2023

This blog puts together Infrastructure and platform architecture for modern data lake. The following are taken into consideration while designing the architecture: Should be portable to any cloud and on-prem with minimal changes. Most of the technologies and processing will happen on Kubernetes so that it can be run on any Kubernetes cluster on any cloud or on-prem. All the technologies and processes use auto scaling features so that it will allocate and use resources minimally possible at any given time without compromising the end results. It will take advantage of spot instances and cost-effective features and technologies wherever possible to minimize the cost. It will use open-source technologies to save licensing costs. It will auto provision most of the technologies like Argo workflows, Spark, Jupyterhub (Dev environment for ML) and so on, which will minimize the use of the provider specific managed services. This will not only save money but also can be portable to any cloud or multi-cloud including on-prem. Concept The entire Infrastructure and Platform for modern data lakes and data platform consists of 3 main Parts at very higher level: Code Repository Compute Object store The main concept behind this design is “Work anywhere at any scale” with low cost and more efficiently. This design should work on any cloud like AWS, Azure or GCP and on on-premises. The entire infrastructure is reproducible on any cloud or on-premises platform and make it work with some minimal modifications to code. Below is the design diagram on how different parts interact with each other. The only pre-requisite to implement this is Kubernetes cluster and Object store.

Compute for Big Data

By Srinivasa Rao • June 17, 2023

Spark-On-Kubernetes is growing in adoption across the ML Platform and Data engineering. The goal of this blog is to create a multi-tenant Jupyter notebook server with built-in interactive Spark sessions support with Spark executors distributed as Kubernetes pods. Problem Statement Some of the disadvantages of using Hadoop (Big Data) clusters like Cloudera and EMR: Requires designing and build clusters which takes a lot of time and effort. Maintenance and support. Shared environment. Expensive as there are a lot of overheads like master nodes and so on. Not very flexible as different teams need different libraries. Different cloud technologies and on-premises come with different sets of big data implementations. Cannot be used for a large pool of users. Proposed solution The proposed solution contains 2 parts, which will work together to provide a complete solution. This will be implemented on Kubernetes so that it can work on any cloud or on-premises in the same fashion. I. Multi-tenant Jupyterhub JupyterHub allows users to interact with a computing environment through a webpage. As most devices have access to a web browser, JupyterHub makes it easy to provide and standardize the computing environment of a group of people (e.g., for a class of data scientists or an analytics team). This project will help us to set up our own JupyterHub on a cloud and leverage the cloud's scalable nature to support large groups of users. Thanks to Kubernetes, we are not tied to a specific cloud provider. II. Spark on Kubernetes (SPOK) Users can spin their own spark resources by creating sparkSession. Users can request several executors, cores per executor, memory per executor and driver memory along with other options. The Spark environment will be ready within a few seconds. Dynamic allocation will be used if none of those options are chosen. All the computes will be terminated if they’re idle for 30 minutes (or can be set by the user). The code will be saved to persistent storage and available when the user logs-in next time. Data Flow Diagram

Data lake design patterns on cloud. Build scalable and highly performing data lake on Azure

Data lake design patterns on Azure (Microsoft) cloud

By Srinivasa Rao • May 9, 2020

Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Microsoft (Azure) cloud.

Data lake design patterns on cloud. Build scalable and highly performing data lake on AWS (Amazon)

Data lake design patterns on AWS (Amazon) cloud

By Srinivasa Rao • May 8, 2020

Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the Amazon (AWS) cloud.

Data lake design patterns on cloud. Build scalable and highly performing data lake on google (GCP)

Data lake design patterns on google (GCP) cloud

By Srinivasa Rao • May 7, 2020

Various data lake design patterns on the cloud. Build scalable and highly performing data lake on the google (GCP) cloud.

Different strategies to fully implement DR and BCP across the GCP toolset and resources.

Disaster Recovery and Business Continuity Plan on google cloud

By Srinivasa Rao • April 23, 2020

Different strategies to fully implement DR and BCP across the toolset and resources you are currently using and probably will use in near future on GCP.

Monitoring, Operations, Alerts and Notification and Support on Cloud

Cloud Operations and Monitoring on GCP

By Srinivasa Rao • April 23, 2020

Google Cloud Platform offers Stackdriver, a comprehensive set of services for collecting data on the state of applications and infrastructure. Specifically, it supports three ways of collecting and receiving information

Data Governance for data lakes on Cloud

By Srinivasa Rao • April 22, 2020

Data Governance on cloud is a vast subject. It involves lot of things like security and IAM, Data cataloging, data discovery, data Lineage and auditing. Security Covers overall security and IAM, Encryption, Data Access controls and related stuff. Please visit my blog for detailed information and implementation on cloud. https://www.unifieddatascience.com/security-architecture-for-google-cloud-datalakes Data Cataloging and Metadata It revolves around various metadata including technical, business and data pipeline (ETL, dataflow) metadata. Please refer to my blog for detailed information and how to implement it on Cloud. https://www.unifieddatascience.com/data-cataloging-metadata-on-cloud Data Discovery It is part of the data cataloging which explained in the last section. Auditing It is important to audit is consuming and accessing the data stored in the data lakes, which is another critical part of the data governance. Data Lineage There is no tool that can capture data lineage at various levels. Some of the Data lineage can be tracked through data cataloging and other lineage information can be tracked through few dedicated columns within actual tables. Most of the Big Data databases support complex column type, it can be tracked easily without much complexity. The following are some examples of data lineage information that can be tracked through separate columns within each table wherever required. 1. Data last updated/created (add last updated and create timestamp to each row). 2. Who updated the data (data pipeline, job name, username and so on - Use Map or Struct or JSON column type)? 3. How data was modified or added (storing update history where required - Use Map or Struct or JSON column type). Data Quality and MDM Master data contains all of your business master data and can be stored in a separate dataset. This data will be shared among all other projects/datasets. This will help you to avoid duplicating master data thus reducing manageability. This will also provide a single source of truth so that different projects don't show different values for the same. As this data is very critical, we will follow type 2 slowly changing dimensional approach which will be explained my other blog in detail. https://www.unifieddatascience.com/data-modeling-techniques-for-modern-data-warehousing There are lot of MDM tools available to manage master data more appropriately but for moderate use cases, you can store this using database you are using. MDM also deals with central master data quality and how to maintain it during different life cycles of the master data. There are several data governance tools available in the market like Allation, Collibra, Informatica, Apache Atlas, Alteryx and so on. When it comes to Cloud, my experience is it’s better to use cloud native tools mentioned above should be suffice for data lakes on cloud/