Python, Apache Beam, Relational Database (MySQL, PostgreSQL), Struggle and The Solution
We all(most of us) are fans of python programming due to the ease of development efforts we need to put with this programming language.
Apache Beam is an SDK to develop a data processing pipeline for batch and streaming data. Now when it comes to the practical use of Apache beam SDK in the real world, we often encounter the limitation or feature supported by Apache beam SDK to process a certain type of source using in-built connectors.
The Apache Beam SDK for python only supports a limited database connectors Google BigQuery, Google Cloud Datastore, Google Cloud Bigtable (Write), MongoDB. The Real-world also depends on MySQL and PostgreSQL being the widely used relational database across all the domains and all the levels of software development.
Apache beam also provides a guide to develop your IO connector but it is not that easy to write a connector. You need to take care of a lot of factors like distributing your queries across your apache beam workers, collecting the records and all that stuff, and most importantly designing your IO connector so that your fellow developer can call them easily and be able to specify the table or SQL query to read using.
There are good chances that you will have to work on either MySQL or PostgreSQL on your every third or fourth project. However, the amount of engagement with these databases can be different. Here is where my story starts.
I was working on one of the Customer projects who have their databases on AWS and was majorly using Redshift. Now as we know Redshift can be used with PostgreSQL connector as well. I still have an assumption that Redshift is built on the top of PostgreSQL.
Now as I started developing my apache beam and trying to read data from the Amazon RDS database, My apache-beam dataflow pipeline struggles to scale to multiple workers.
Why Apache Beam pipeline was not scaling
The bad approach
Being no IO connector available to read data from the AWS RDS Database or in particular PostgreSQL or MySQL database, I write a ParDo function and was creating my connection to RDS and in doFn I was reading from the RDS based on a SQL query. Now because of how the ParDo works or you can say how I was reading data from the RDS in my ParDo function was not the correct way to Read Data from any data source in Apache beam or any other scalable data processing pipeline.
The Right Approach
As I deep dived into how the in-built IO connectors are coded in Apache beam, I came to know that it is not so easy to write an IO connector in Apache Beam.
But anyway ended up writing a New IO connector to read from PostgreSQL and MySQL database. The Code is available here https://github.com/yesdeepakverma/pysql-beam and can be downloaded using
pip install pysql-beam
command as well.
How to use this package
- Install the package using pip install pysql-beam command
- Import the package in your python apache beam pipeline
3. Create a PTransform object
4. The pipeline options are defined as below
And Use the PTransform in your pipeline like this
How this works behind the scene
- User pass the table name or SQL query
- If the user pass the table name, generate the SQL query by
- SELECT * FROM TABLE_NAME
- Find the number of records to be returned by the query
- SELECT COUNT(1) FROM TABLE_NAME
- Generate pagination SQL query using pagination feature available by the MySQL and PostgreSQL
- Based on the batch size passed by the user, this step will generate the total_records/batch_size number of paginated SQL query
- SELECT * FROM TABLE_NAME OFFSET ((PAGE_NUM-1)*BATCH_SIZE) LIMIT BATCH_SIZE
- Then these paginated SQL queries are distributed to apache beam workers
- And then processing and reading are performed on distributed workers.
- Workflow is explained below
This python package solves the issue when you try to read from the database in a ParDo function as your data pipeline is unable to scale. This solution scales your pipeline based on the batch size you pass when building your pipeline.
I hope this will solve the long-standing problem of reading SQL databases from the Python apache beam pipeline.
Note: Support for MSSQL is coming soon(Thanks to jac2130 for adding support for the MSSQL database). For the updated code, please refer to https://github.com/yesdeepakverma/pysql-beam.git Github repo.