How To Build Data Pipelines for a Multi-Cloud Environment
It’s not uncommon for data engineers to be given the job of building reliable data pipelines via Hadoop/Spark clusters.
Let’s start by clarifying what we are talking about. A data pipeline is simply a way of describing a series of data processing steps. In theory, when building a data pipeline, it’s essential to understand the business objective to be achieved by using the pipeline. From there, the developers need to know what, where, and how data is collected. Using that information, the pipeline has to be designed to eliminate errors and overcome bottlenecks or latency. And it has to produce information in a format that can be used to make timely decisions.
In practice, the first step is to identify all the data sources that are to be used. These may be IoT (Internet of Things) devices, mobile apps, web apps, etc. The next stage is to ‘ingest’ the data into the pipeline. The data may be in the form of files, BLOBs and streams. A Binary Large OBject (BLOB) is a collection of binary data. The information in a BLOB is usually graphics, audio, or multimedia. Streaming data is data that is continuously generated by different sources. There then follows a series of steps to produce the final required output. Each individual step in the pipeline will produce an output, which is then used as an input by the next step. Sometimes, it’s necessary to insert buffer storage in between elements in a pipeline. To complicate matters, some steps can be run in parallel with others. The pipeline is not necessarily composed of all sequential steps. Sometimes, parts of a pipeline are executed in a time-sliced fashion. To further complicate things, data produced in one pipeline may be used by other pipelines.
The steps in a pipeline may perform tasks such as data aggregation, augmentation, enrichment, extraction, filtering, grouping, loading, transformation, validation, and running various algorithms against the data. The data may be stored in a data lake before the next stage. The analysis or computation stage is where software such as Hadoop or Spark is used to produce useful information stored in a data warehouse.
The final stage is presentation, where the insights derived from the data analysis are presented, often through dashboards, to make business decisions.
Learn More: How to Bolster Collaborative Analytics with Cloud Data Pipelines
Data Aggregation
Data aggregation is the name given to the process of collecting data from different sources, combining it, processing it, and then presenting it in a summarized format, which can then be used for data analysis. The input data needs to be accurate, and there must be enough of it so that the analysis results are relevant. The process makes the next stage – the data analysis – much quicker. Data aggregation can also be used as a way to anonymize personally identifiable information. For example, individuals’ salaries may be listed in the data collected initially, but an average salary is shown in the summarized data. The data can then be analyzed by the right software.
Examples of data aggregation software include:
- 
- Datorama and Domo: used for business intelligence
- Funnel.io and Improvado: used by marketers
- Stitch: a cloud-first, developer-focused platform for rapidly moving data
 
Learn More: Data Lake vs. Data Warehouse: Is It Time to Ditch On-Premise Data Warehouses?
Data Analysis
Using Hadoop or Spark (both from Apache), it’s possible to analyze the data produced by the data aggregator. Many other products are available for data analysis, but these two are, perhaps, the best known. Hadoop can analyze the data by using its MapReduce algorithm. This divides the analysis task into smaller parts that can then be processed. Hadoop also uses its own file system called Hadoop Distributed File System (HDFS). Spark was designed to produce faster and easier-to-use analytics than Hadoop. From the end users’ point of view, they perform a similar function.
Spark, for example, can link to Hadoop HDFS, MySQL, Elastic, Kafka, Redis, Mongo, Cassandra, Apache HBase, and more. And Spark can process data using Java, Scala, SQL, Python, and R. This illustrates the flexibility of what can be done to analyze Big Data.
The thing to bear in mind with this type of data pipeline is that there can be lots of data and it is being created frequently, and there may be different formats of data. This is what is sometimes called volume, velocity, and variety. As the volume and velocity increase, so the data pipeline needs to scale to handle the increasing load – so that the data can be processed as quickly as possible, and analysts can make decisions based on the output in as near real-time as possible.
Learn More: Close to the Edge: Automated Data Pipelines for Real-time Processing
Best Practices for Architecting Data Pipelines
In terms of best practice, certain questions need to be answered about the data pipeline being created. These include: How much data processing is involved? What types of data processing are required? What rate of data needs to be handled? Will streaming data be used? Will data only come from on-premise? Will data come from the cloud? Will the data produced need to go to the cloud or just on-premise? Will data need to move from one cloud to another?
Obviously, there needs to be some way to monitor the pipeline’s operation to ensure data integrity and identify problems such as network congestion or the final destination of the data being offline. Also, there needs to be some way to alert the IT team if a problem occurs.
Multi-cloud Platforms
When it comes to using cloud environments that are designed with scalability in mind, there are several choices, including Amazon Redshift, Azure SQL Data Warehouse, and Google BigQuery.
The next question is why your organization should want to migrate to a multi-cloud data pipeline environment? As some organizations grow their use of the data, they may find different teams within the organization using subsets of the data as input for different applications or they need to format it differently. In addition, different cloud solutions may have, historically, been used for different functions within a company. This, at the time, would have been to leverage the particular strengths of that cloud provider.
However, this can create ‘cloud silos’ of data. Creating a multi-cloud pipeline allows data to be taken from one cloud provider and worked on before loading it on a different cloud provider. This will enable organizations to utilize cloud-specific tooling and overcome any restrictions they may face from a specific provider. It’s also worth considering whether vendors offer incentives to store data on their platforms or whether there are inhibitory costs if data sizes grow too large.
According to Gartner most organizations choose to work with multiple cloud providers for many different reasons. For example, where an organization wants to use cloud services in different localities worldwide, they may find it hard to have all their requirements met by a single provider. For them, moving to multi-cloud is an obvious decision. Gartner’s survey showed that 81% of public cloud users reported using two or more cloud providers.
Learn More: How to Keep Your Pipelines Clear and Avoid Delivery Bottlenecks
Reducing Data Latency
In any multi-cloud environment, moving from one system to another in the pipeline data can hit bottlenecks (causing latency). These can increase as the data being used increases in scale. Latency depends on the efficiency of the message queue, stream computing, and databases used for storing computation results. Breaking up the data into smaller pieces and processing it speeds up the production and reduces latency.
The business intelligence applications are time-sensitive; keeping the latency as low as possible is important to provide timely data for decision-making. Whatever solution to the latency problem is used, it must scale as the data volume and velocity grows.
If only a subset of the data is required with exceptionally low latency, then just part of the data might be streamed elsewhere, and it can then be computationally efficient to work on that data and deliver the output to the required dashboard, for example.
In a serverless cloud environment, analytics platforms like Hadoop and Spark provide a number of managed services. These make it possible for users to work on data in near real-time, to scale their systems as required, and, of course, to have much less overhead in terms of maintaining servers. While managed services can be expensive and may result in specific tools being used that aren’t portable to other cloud providers, the use of a cloud pipeline, as mentioned earlier, can overcome some of these limitations.
Where the data pipeline doesn’t only run in the cloud, it’s necessary to have a speedy and reliable Internet connection. This is described as an edge hybrid. In this situation, it’s important to minimize the dependency between the two systems – the ones running at the edge and running in the cloud environment. Each dependency can impact the reliability and latency of the whole environment. It makes sense to minimize the dependencies to minimize the loss in latency.
Conclusion
It’s important to understand the requirements of the data pipeline when first designing it, and it’s important that it can scale to handle any increase in data being fed into it. Breaking down tasks into smaller parts allows some to be prioritized, reducing any latency so that output to, for example, dashboards can be in near real-time. And using multi-cloud environments can maximize the use of tooling on each platform to achieve the desired results.
Let us know your thoughts in the comment section below or on LinkedIn, Twitter, or Facebook. We would love to hear from you!
 
					 
						 
																				 
																				 
																				 
																				 
																				