Python, Apache Beam, Relational Database (MySQL, PostgreSQL), Struggle and The Solution

Background

We all(most of us) are fans of python programming due to the ease of development efforts we need to put with this programming language.

Motivation

There are good chances that you will have to work on either MySQL or PostgreSQL on your every third or fourth project. However, the amount of engagement with these databases can be different. Here is where my story starts.

Why Apache Beam pipeline was not scaling

The bad approach

Being no IO connector available to read data from the AWS RDS Database or in particular PostgreSQL or MySQL database, I write a ParDo function and was creating my connection to RDS and in doFn I was reading from the RDS based on a SQL query. Now because of how the ParDo works or you can say how I was reading data from the RDS in my ParDo function was not the correct way to Read Data from any data source in Apache beam or any other scalable data processing pipeline.

The Right Approach

As I deep dived into how the in-built IO connectors are coded in Apache beam, I came to know that it is not so easy to write an IO connector in Apache Beam.

  1. Import the package in your python apache beam pipeline
  1. If the user pass the table name, generate the SQL query by
  2. SELECT * FROM TABLE_NAME
  3. Find the number of records to be returned by the query
  4. SELECT COUNT(1) FROM TABLE_NAME
  5. Generate pagination SQL query using pagination feature available by the MySQL and PostgreSQL
  6. Based on the batch size passed by the user, this step will generate the total_records/batch_size number of paginated SQL query
  7. SELECT * FROM TABLE_NAME OFFSET ((PAGE_NUM-1)*BATCH_SIZE) LIMIT BATCH_SIZE
  8. Then these paginated SQL queries are distributed to apache beam workers
  9. And then processing and reading are performed on distributed workers.
  10. Workflow is explained below

Data Analytics | Machine Learning | Kubernetes | Cloud Architect | Data Architect | Python