Browse by Domains

Design Efficient Data Pipelines Using Python and Clustering 

Table of contents

Hi, I’m Shriyam. I’m currently working as a Data Engineer II and work with my stakeholders to ensure proper data is available for their reporting requirements. This work includes setting up a data pipeline, maintaining data quality and ensuring the timely availability of data. I have been working in the data domain for the entirety of my career. So, choosing this program was more of an upskill decision and trying to move into the data science role from my current data engineering role.

Working as a data engineer, I have come across a few use cases where I can apply data science concepts. However, python and clustering have been instrumental in helping me design efficient pipelines to ensure the timely availability of data and also enhance data quality. One example is – introducing data slices in one of the datasets and grouping the rules based on some sort of business rules. Now for this particular use case, I had to use classification for tagging each rule to a specific slice. The hydration of the dataset was causing an issue in the Production environment. So, users were not able to receive the data for reporting from this dataset. Although the reports were not critical initially, it was necessary for the Quarter end reporting.

The tool I used was Python and the technique used was multi-label KNN classification. Since I had 5 slices that I wanted to tag the rules to, using a binary classification technique was not going to be sufficient for this. Hence, I went ahead with the use of multi-label classification. Also, I wanted the rules in a particular slice to be functionally grouped based on their usage, performance, and scope. Keeping in mind these various use cases, I arrived at the KNN classification. The steps involved here were as follows:

  1. Preparing the data and I had to create a dataset with all the different metadata of the rules
  2. Creating dummy variables for all string-valued features
  3. Applying the algorithm to this dataset
  4. This was done multiple times and performance was evaluated after each classification output

The main challenge was to pick the variables correctly so that the KNN algorithm would return a classification based on business logic and not just randomly.

As this was a data engineering problem, there were no insights as such. However, the new slices as suggested by the KNN algorithm. were very efficient and reduced the overall processing time from 70 minutes to just under 25 minutes. This turns out to be a 64% improvement in the performance of the dataset and makes it more scalable. The solution provided by me was the Rule Slice Map which was generated as an output to the KNN classification exercise. I also suggested the broader team look at data science concepts as a solution to some problems that we face in the data engineering space.

The hydration of the dataset in question was a big blocker for Quarter End reporting for the users and was being worked upon for a long time but slicing the dataset based on KNN classification results helped us achieve successful and efficient hydration. This also demonstrates that data engineering problems can also be solved using data science concepts.

This exercise gave me a chance to apply the knowledge gained in the program modules to a real-world problem and try and solve it efficiently and quickly.

Avatar photo
Great Learning
Great Learning's Blog covers the latest developments and innovations in technology that can be leveraged to build rewarding careers. You'll find career guides, tech tutorials and industry news to keep yourself updated with the fast-changing world of tech and business.

Leave a Comment

Your email address will not be published. Required fields are marked *

Great Learning Free Online Courses
Scroll to Top