Solving Data Silos

Deepak Verma
5 min readJun 18, 2021

In a modern data system, It is hard to assume that all of your data will be available in a single database/data warehouse so that you can query efficiently and generate more insights from the data.

In a typical scenario, data is located in different system before we build some highly manageable , hight faulty data pipeline to ingest them into a common place like BigQuery or Redshift or Snowflake for analysis.

and Sometimes it results in more technical overhead to build and manage these pipeline itself. Even if you automated/orchestrated these pipeline, you end up employing another tool like Airflow, which is again an overhead and require some manual inspection to make sure Airflow is working smoothly. and sometimes there is always a dedicated team to manage these system and hence we just waste our technical resources in doing unproductive work.

One of the way to explain this problem and how we approached it, is using the STAR method.

STAR Method depiction — Situation, Task, Action and Result
STAR Method

Situation

While working with one of our customer, Customer want to quickly analyze the data in order to produce answer to many ad-hoc queries. The data was huge and there were no pattern in data queries. The marketing will just come with any random question and we have to send response to them asap.

The data was in multiple data stores, AWS Aurora, AWS s3. The data in S3 keep on refreshing every month which contain some old data as well new data. The files were gzipped compressed CSV with each more then 1TB in compressed format and more than 100 such files in S3. It was hard for Aurora to catch up with the size of these S3 file and loading and performing the ETL on Aurora is just a nightmare, performing on this size of ETL on Aurora will just kill the performance and it will take hours to for ETL and them perform the query.

Task

So the challenge was to efficiently query the data and produce result faster. As the data was in multiple location, we did not wanted to build any unnecessary ETL jobs. Following are the feature we were looking to build

  1. We wanted to setup a system so that we can query data from multiple system without bringing data to a single data warehouse.
  2. It should support the SQL like query.
  3. Ability to join data from multiple data stores.
  4. Can query TB or PB of data within seconds or minutes.

Action

After evaluating multiple products and solution, we finally stopped at AWS Athena. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

Your data can be in different locations, like blob storage S3, Relational Database or Tabular database like MySQL or PostgreSQL or Redshift, or in No-SQL like MongoDB or DynamoDB and your data can in be different format like CSV, TEXT, parquet, ORC, JSON, AVRO and even Redis.

With Athena Federated queries, you can query data that is in S3 and data that is in PostgreSQL and data in MongoDB at the same time in a single query and performing join between them. This really solves the building unnecessary data pipeline and bringing all the data in a common place while you do not want to do so or do not have enough technical expertise or technical bandwidth.

Athena Run on top of Presto. Read here what are the Athena SQL query limitations.

For data source like S3, Athena manages the schema in catalog like Data catalog in AWS Glue or Athena data catalog. The is some sort of Schema on Read mechanism while your data can have have 50 columns and you can just have 10 columns in you data catalog and then you can build a new data catalog from the same source that have 15 columns and grant the access to these data catalog to different set of user.

So How it works!

Architecture Diagram

Steps:

Create Athena Data Source For Aurora Postgres DB

  1. Go to Athena and create a new data source from (Query a data source and select Postgres)
  2. Create a new lambda from there, fill below like details
  3. spill bucket: {YOUR_STAGING_BUCKET}. (This can be any staging/temp bucket you want to use, keep lambda and s3 bucket in the same region)
  4. Lambda Function name: AthenaJDBCConnectorLambda
  5. Default connection string: postgres://jdbc:postgresql://{DATABASE_HOST}:5432/postgres?user={USERNAME}&password={PASSWORD}
  6. You can alternative configure above connection string to fetch from secret manager
  7. security group id: {SECURITY_GROUP_ID} (should have permission to rds and s3) change this to a valid security group
  8. subnet id: {SUBNET_ID} (change this to a valid subnet id)
  9. Once this lambda is created, select this lambda from the drop down menu (creating the lambda may take some time, behind the scene WAS run Cloudformation stackset)
  10. Specify a catalog name — Like Aurora or RDS or Postgres or AppName, say we name it aurora
  11. Create the data source
  12. Go to lambda function AthenaJDBCConnectorLambda and add a new environment variable with the below details
  13. catalogname_connection_string : database connection string. For example: aurora_connection_string: postgres://jdbc:postgresql://{DATABASE_HOST}:5432/postgres?user={USERNAME}&password={PASSWORD}
  14. The above string to be with actual username and password, there are ways to drive the username and password from the secret manager but let do this for the time being.

Create Data Source For S3

  1. Create a new data source for S3 bucket as well.
  2. select glue catalog and, select add a table and schema information manually
  3. select to create a new database or select existing
  4. specify the table name for example {TABLE_NAME}
  5. specify a bucket location, if your CSV files are in a folder then specify the path till that, apart from the CSV file or your data source files, no other files should be there
  6. select the type of data format: select TEXT with the custom delimiter, (now field delimiter-> other and then specify | (pipe char)), leave other default unless you want to change
  7. add the columns based on the schema
  8. Click on create table
  9. This will generate a SQL and lands on the Athena dashboard, now from the left panel select the data source you specifies in step 3
  10. now click on Run Query
  11. Repeat step 1 to 10 for other s3 tables

Once above steps are done, you are ready to query data from Postgres and S3 from within Athena Console. You can run Athena queries from Lambda, schedule them the way you want.

Result

Using AWS Athena, we were able to query data from the Aurora and S3 and perform join without the need to write any ETL job or build a data warehouse. AWS Athena is pay as you go service so doesn’t cost you if you are not using AWS Athena.

--

--

Deepak Verma

Data Analytics | Kubernetes | Cloud Architect | Data Architect | Python