Data marching one by one

Moving Data around in Google Cloud using Dataflow and Pub sub

The goal of this article is simple. There are a few common patterns that I’ve now realized is quite useful. Applications need to publish, probably large amounts of it, and you want to store it in the cloud and for analysis or machine learning purposes. Google cloud is quite powerful, but definitely takes some getting used to and to get everything configured properly takes some time to learn and there are gotchas everywhere. This article will hopefully allow you to copy and paste and get started.

Here are what we are going to build:

  1. An app engine, that will continuously create some data, and publish the file in GCS.
  2. A cloud function, that will be triggered by new files uploaded to a GCS bucket.
  3. The cloud function will also read the newly created file, and then publishes the data to Google Pub Sub.
  4. A streaming dataflow job, that will read the data from the pubsub subscription, and then write the data into a BigQuery table.

Here are the steps for prepping GCP and creating the proper resources. Most should be fairly straight forward:

  • Create the Pub Sub Topic.
  • Create the Pub sub subscription
  • Create a Google Cloud Storage bucket for storing data
  • Create another Google Cloud Storage bucket, for running the dataflow job. Within the bucket, we need a folder for staging, and another folder for temp.
  • Create a Big query dataset and table. You can create the dataset from Google Cloud Console using the UI. For creating the table though, the simplest way is to use a schema json file with a command line: Here’s the command:
    bq mk — table project_id:dataset.table schema.json
  • Here’s a sample schema json file:

Next we will look at the key elements of each elements. You can also find all the codes here:
https://github.com/ronald-leung/gcp

Cloud function pub sub

  • Create a new cloud function and select Cloud Storage as the trigger. And to get the function to run on new files, select Finalize / Create. Give the function a proper name
  • Ideally your code should be in a git repo, and you can use a cloud build yaml file to deploy it. But for simple function the inline editor is fine. But of course the code should still be checked into git, and I’d still use a code editor to make sure all the code is spaced properly, and when you are done you then just copy and paste to the inline editor. Check to make sure your function name in the code matches the setting, as circled in the screenshot below.
  • The Python function does just two things; Read the content from the Google Cloud Storage that triggered this function, parse the content line by line, and the publishes it to the pub sub topic.

Streaming Dataflow job

  • This is the more complicated component in the pipeline, but it’s still a very small amount of Python code. Hopefully the provided sample will allow you to get a head start, and you can customize it for your own data. Here is the main pipeline code.
  • The pipeline is simply reading from the pub sub topic, decoding the text data and constructing a JSON object, and then writing that JSON data into BigQuery. The following shows the code that doing the text to JSON transform.

Summary

That is it! These are all the components you need, to build a streaming pipeline to process data from GCS -> Dataflow -> Bigquery.

Senior Engineering Manager at NCR