Recently I attended a two week data engineering workshop which targeted mainly data ingestion, processing and workflow management.I got the chance to step into the shoes of a data engineer for these two weeks, and, it was no piece of cake.
In the workshop we were introduced to the tools of the big data world and then we were given a problem statement, which we were expected to solve using them.The option of selection of the tools, and coming up with creative visualizations was left to us.Lets have a glimpse of what approaches, challenges and learnings we gathered during the workshop.
The problem statement:
We will be given 2 types of data — data which will be fed to us via batches mostly using a relational database like PostgreSQL and data which will be coming in live streams preferably Kafka streams.The stream data here is mainly the row updates to the batch data.We would need to ingest and process this data(both batch and stream simultaneously) using the given aggregations/logic and then provide appropriate visualizations.An additional point being, the quantum of data we would be dealing with is around hundreds of millions of records.
For the first use-case i.e for batch data, we wrote Spark jobs using Scala(since all of us weren’t comfortable coding in python :P).This worked well for us since, creating the dataframes and then exporting it to elastic search sort of went smooth without many hiccups.
The problem arose when we had to deal with streaming data too.We started off using a similar approach, mainly because we didn’t want to duplicate the code we had written earlier.We used Spark’s structured streaming and everything would have gone as per expectation, if only the world would have been nicer and we had one single aggregation logic in our problem statement.But since that was not the case,we were hit with reality when we got to know that Spark structured streaming doesn’t support multiple aggregations. We immediately changed our approach, and used Kafka connect instead.We had to duplicate the logic while using Kafka connect, but it worked, so we were happy. Finally we pushed the data to elastic search and visualized it using Kibana.
The major challenge that we faced was joining batch and stream data.Since we had to join and execute multiple aggregations on the resultant data, Spark’s structured streaming couldn’t help us.This kind of consumed time as it led to the change in approach and design.Another challenge was the version mismatch between Java and Scala.You have to find the sweet spot of the versions for the Spark job to run peacefully.
- Choose the correct file format:-The file format that you choose for your use-case is very crucial in optimizing the performance.We had several joins in between columns hence we went with parquet(columnar file format).If your use-case is row-based you could go with avro.
- Aptly partition your data for faster joins.
- Cleanse your data well:-Look for anomalies, nulls and negative values
- Always have a backup of raw data.
- Have tests for the jobs you run, ScalaTest/MiniCluster can be used for this
Most importantly, what I realized was, there is no single mantra to solve many of these data related problems. Whether to go with ETL or ELT, or choosing lambda architecture over kappa ,or simply deciding which tool to use, all these questions can only be answered if we understand the data well and also have a strong grasp over the requirement.
Thats’s it I had to share.May the data be with you :)