Preparing For And Taking (and passing) The Google Cloud Professional Data Engineer Test Exam

Here are my notes on the preparation I did and the contents of the Google Data Engineer exam (taken June 2019)

The exam was updated at the end of March 2019 and there is little information online on preparing for the new version.  The main changes seem to be that they have added more machine learning questions (also added questions for newly added APIs and offerings), beefed up BigQuery a bit and removed the “case studies” which were specific use cases you were expected to be familiar with before the exam.  I have also heard that the old version of the exam asked BigQuery SQL questions, whereas I got none of these (shame as we have a lot of SQL experience between us - BigQuery syntax is pretty similar to GoogleSQL!).

They have also added questions on Cloud Composer (Airflow wrapper) and Data Loss API (API for finding PII or other sensitive information that may have gotten into your data by accident).

My previous experience

I have very little previous “Cloud” experience.  I do have lots of years of (now a little outdated) software development experience plus years of working, albeit in a roundabout and incomplete way with release and infrastructure for enterprise software (at Oracle) - the stuff I was working on was released to Oracle Cloud, but that wasn’t really that useful knowledge.  Plus years of designing software to solve data problems (bizarrely so many of the problems of pipeline design that we went back and forth designing at my old company 15 years ago are concepts that are still relevant - batch/streaming/windowing, etc!)

I also have a strong background in maths, and a fair amount of self taught Machine Learning experience (very useful, as I didn’t need to learn all this stuff), good knowledge of  RDBMS database principles plus practical application (no previous experience with columnar, noSQL databases whatsoever, though I know the principles), and general ideas of how software fits together absorbed over the years.  I have some experience with data visualization (Tableau, Plx dashboards (lol)), and Dataprep (google’s data quality/visual data cleansing tool) is very very similar to the product I worked on at Oracle for years and years (except Dataprep is better - oh dear…).  

I also knew the UI/Console from Raybeam work with AppEngine (though a very limited part of it!)

If you are a new graduate/relatively inexperienced, passing this exam will be possible, but I think it will probably take you longer than me to study for it. 

Preparation

There is so much to know.  And everything is known as Google Data XX or Google Big YY (my kids have suggested a new service called Google BigPoo <eye roll>). 

Coursera - We get free access to the coursera gcloud courses - I took all five coursera “data” courses (did not quite finish the ML one as it was a bit crap and repetitive).  I dabbled with the “Preparing for the Exam” one but did not find it very good and it was not updated for the latest version of the exam (still had whole sections on the case studies, etc). Honestly, I didn’t really like the Coursera courses - so much listening to a dude talking - I much prefer reading and doing in order to learn.  But they are useful and give a lot of fairly in-depth info. And the Qwiklabs attached to them are great as they get you doing the stuff you have just fallen asleep listening about...

Qwiklabs - We also get enough free credits on Qwiklabs to do as many labs as you are likely to want.  Qwiklabs are really good. You could try to do the exam without having hands on experience, but I would not recommend it.  There are not many exam questions on how to use the console itself, or the specific API calls you need to use, etc - but hands on experience is so useful for just getting your head round everything.  List of some of the labs that I did:

IAM

GSP190 for IAM roles

Teaches concept of custom roles and predefined roles.

GSP064 for basic roles

Security/Keys

GSP499 User Authentication: Identity Aware Proxy

GSP079 Getting started with Cloud KMS

BigQuery

GSP410 Creating permanent tables and access controlled view in BigQuery

Teaches you to create tables in BigQuery, emphasis on things that can go wrong/errors

Discusses creating views, and controlling access on them.

https://cloud.google.com/bigquery/docs/views-intro

GSP414 Creating Date-Partition tables in BigQuery

Teaches you how to create date partitioned tables.

Why it then processes less data when you run a query

Teaches you to autoexpire old data

GSP071 BigQuery qwik start command line

Nice little lab on creating bigquery datasets from the command line

GSP413 Creating a data warehouse through joins and unions

Maybe not as useful for someone who is familiar with GoogleSQL.  Does have some useful bits

  • Pinning projects from elsewhere (e.g public dataset)

  • Looking at the Natural language processing API

  • Querying tables with suffixes (can use _TABLE_SUFFIX)

GSP292 Analyzing Financial Time Series using BigQuery and Cloud Datalab

  • another qwiklab where you fire up datalab and just run through a notebook

  • Useful to see how you can run BigQuery from datalab using %%bq commands - pretty useful stuff!

  • Can run SQL commands on big query through this and read into a pandas dataframe, from where you can do visualization

BigTable

GSP 099 BigTable Qwik Start - command line

  • Create a BigTable instance using the console

  • access it using cbt on the command line, adding a table, column family and one row to the table

GSP 098 Qwik Start - hbase shell

  • Does the above example from GSP 099 using hbase

GSP 142 Using OpenTSDB to monitor Time-Series data on cloud platform

  • Did this one as it involved BigTable and trying to revise BigTable

  • In reality not that useful for that - it’s more an architecture type project, firing up various things in Kubernetes and gluing them together

  • It’s quite involved and probably useful if you want to know Kubernetes better

Others

GSP 285 Streaming IOT Kafka to Google Pub/Sub

  • This one is quite involved and quite difficult to understand what is going on!

GSP 403 How to build a BI Dashboard with Google DataStudio and BigQuery

GSP 089 Stackdriver: Qwikstart

  • Nice intro to stackdriver, but not data product centric at all.  

GSP 283 Cloud Composer: Copying BigQuery data across different locations

  • Intro to Cloud Composer (Airflow wrapper)

GSP 430 Creating a transformation pipeline with Cloud DataPrep

  • Very useful hands on with DataPrep

Linux academy - Linux academy is great.  Unfortunately they haven’t yet updated their course to the latest content for the Data Engineer course, but by the time any of you get to do it most likely they will have (in progress at the moment).  Listen to the lectures and pay attention every time he says “This may be in the exam” (hint: it may be in the exam :-)). There is also a very helpful Slack group that I only found shortly before the exam.  The exam course instructor is on it, he is super nice, and very responsive. I am recommending to Wes/Bob that Raybeam pay for Linux academy membership for 1-2 months for anyone that is serious about doing one of the exams, it’s the best out there I think!

A couple of “cheat sheets” that I found useful:

https://github.com/ml874/Data-Engineering-on-GCP-Cheatsheet/blob/master/data_engineering_on_GCP.pdf

https://www.slideshare.net/GuangXu5/gcp-data-engineer-cheatsheet

This guy’s blog post - basically one of the only things online I could find about actual experience of the new version of the exam.

https://deploy.live/blog/google-cloud-certified-professional-data-engineer/

Machine learning crash course from google + google’s ML glossary (do the test questions on the crash course, they may be useful, ahem).  From the crash course, I only read articles and did the quiz questions, did not bother with the videos

Google’s documentation!  I’ve spent so long on this site until my eyes have gone skewiff.  (Honestly preparing for this exam has not been good for my eyesight!).  The docs in general are really good.

In retrospect, google has loads of hands on tutorials (all referenced throughout the docs), that look a bit more in depth that Qwiklabs - it may be possible to do these by firing up a Qwiklabs account but following these instead.

Lily’s notes - Lily has written much more organized notes than I have (I do have notes but they would need some tidying up before I could share them) - they don’t cover the whole course as she hasn’t finished preparing yet.  But those she has done are very thorough - she will share access with anyone who wants them I am sure!

Test exams

Warning: If you google for example questions, you will get loads of fairly dodgy sites that offer test exams.  They apparently all have very similar questions and often wrong answers. For test exams, I would suggest sticking with the official Google test, Coursera’s test, and the Linux academy test (be aware - Linux academy test is much easier than the real one as some of the questions are things like “What technology is Dataflow based on?” rather than situational, you won’t get anything as easy as this in the exam!)

How long it took me to prepare

It’s taken me 2 months from start to finish.  I’m on quite an unstressful project at the moment - I haven’t been doing much revision in actual work time (occasionally a bit on a Friday, perhaps), all after work - but my current project is easy and doesn’t take up much headspace, plus I am not working on it for long hours at the moment.  If your project is stressful, it will take you longer than this (unless you have stacks of Google Cloud experience already).  

In addition, I have spent WAY more time on it than I expected in the last month.  The last three weeks, I have worked on it most evenings and sometimes before work + the final weekend I spent most of it working.  Just to say it’s not a quick job (and I don’t think I breezed the exam either, more details below) 

The exam

So the exam is 2 hours long, it is multiple-choice, with 50 questions.  Some questions you have to select 2+ answers (from 4 or 5 suggestions). If there’s more than one answer, it always tells you how many.  You have the chance to select specific answers to review later, and you can look through your answers as many times as you like after you’ve made a first pass at all the questions.

When I did test exams, they took me about 30 minutes.  In the real exam, it took me 1hr 15 for a first pass, and then I spent the last 45 minutes looking over the questions.  I was still debating one last question with 40 seconds to go! Hardly any of the questions are straightforward. A lot of the questions were based on concepts that felt I was solid on, but they would try and test you on some small detail.  A lot had answers that sounded feasible but that I think were made up :-) And almost all the questions had some complicated scenario (think The client has 50 kafka servers connected to 100,000 IOT devices each emitting 20 records per second. The responses are passed as JSON via a hybrid cloud into AWS, they are having performance problems with the kafka servers - they would like to move to google cloud - what cloud native solution would you suggest? - this is a made up example but feasible - and after about 25 of these your brain begins to fry. 

It was harder than I expected.  I was not sure if I had passed (I was hoping to feel fairly confident I had passed when I clicked the submit button, and I really wasn’t!)  You find out about the pass/fail straight away (but you don’t get a mark). The below is my brain dump from the questions I got (without sharing specifics).  I do not think you will be able to pass this exam by blagging it/vaguely knowing what all the services are and applying logic/common sense, you will need to know things in depth.

Exam brain dump

These were the areas I was tested most on:  (may have forgotten some that I found easy!)

Storage

  • Lots of questions on where to store things.  BigQuery vs Cloud Storage, nearline, coldline, regional/multi-regional (is there a trade off between performance and availability here?)

Kafka

  • Know the existence of the Kafka Pub/Sub connector

  • Kafka was mentioned a lot, mostly wrt to migrations though, so you didn’t actually need to know much about it in the end. I just got a wee panic in my chest everytime I saw “Kafka” as I know little about it!

Mention of IOT  (but didn’t really need to know much about it, just that there is an IOT service that can sit in front of pub sub)

Pub/sub

  • Monitoring

  • Knowing why things may go wrong (e.g. subscriber not processing fast enough)

BigQuery

  • Oh so much BigQuery

  • No syntax questions

  • Streaming data - when to do that, how to make data available asap

  • Partitioning tables, including how to make queries process less data

  • Clustering tables (clustering is new to BigQuery so make sure you know it, the questions were actually really easy on this)

  • How to verify your BigQuery migration (I didn’t know this!)

  • Storing data in BigQuery vs outside

  • BigQuery point in time snapshots.

  • Authorized views (of course - they are obsessed with authorized views!)

  • Lots on updating tables, merging tables (in general BigQuery is not good at updates so you should avoid where possible)

BigTable 

  • Keys (easy question)

  • Monitoring to automatically scale

  • Multi cluster use cases.  Including single cluster/multi cluster replication

Dataflow

  • Side Inputs, Side outputs, handling errors

  • Know Apache Beam, but not in detail (e.g. know ParDo, all the steps etc).  No code as such.

Machine learning

  • Feature crosses

  • L1/L2 Regularization

  • Test/Train overfit

  • Nothing on Tensorflow itself

  • APIs, VisionAutoML, vs VisionAPI etc

  • BigQueryML - productionizing

  • TPUs

Data loss Prevention API

  • One question on this, but it was dead easy

DataPrep

  • How to run dataprep jobs in a scheduled fashion (i.e. that it’s dataflow behind the scenes)

  • Basically know what data prep is and when to use it.

Transfer appliance, etc 

  • Questions on how to migrate into the cloud.  Swot up on Transfer appliance, etc as these are easy points.

Dataproc

  • Preemptible workers (graceful shutdown)

  • Scaling your clusters

  • Nowt on Hadoop framework (Hive, Pig, etc)

  • SSD rather than HDD for improving performance

IAM

  • Hard one on IAM across an organization

  • Easy ones on within services - know the quirks of IAM roles for each service

  • Know about service accounts

CloudSQL

  • I didn’t get any questions on CloudSQL apart from an easy one where the answer was “CloudSQL”

Cloud Spanner

  • As above

Datastore

  • I had a very confusing question about backing up Datastore (with two answers to select)

gsutil rsync - know it and when to use it

Cloud composer/ vs cloud scheduler vs Cron jobs on compute engine

  • Learn when to use which (I don’t think I got all of these right!)

How to monitor a compute engine hosted Database (fluentd)

Datastudio/datalab

  • Nothing on Data Studio/Data lab for me!  (But I am sure there are potential questions in the question bank)

Key Management/security

  • Nothing directly on this either

Conclusion - is it worth taking the exam?

Why take the exam?  

Several reasons:

  • The most obvious one - Raybeam need certified cloud practitioners in order to be an official Google (and AWS) partner

  • You’ll learn a lot about Google’s data offerings, and you’ll get a breadth of knowledge that you are unlikely to get from working on real projects (just knowing what is available is useful)

  • It’s kind of fun until the last week when you have to start a bit of a cram fest

Will it make me a google cloud expert?

In a word, no.  

What it will do is enable you to know what is out there and the advantages and disadvantages of each offering.  It should help you know how to design and recommend full cloud solutions (note the word help - I’m not sure it makes you qualified to do this from end to end)

But you won’t be an expert until you’ve done this for real, probably several times!  

If you already have cloud experience, particularly with Google products, I would still (or even more so) recommend the exam because you inevitably will not have worked with all the offerings covered in the exam, and will give you a broad knowledge of what’s out there.  Note, however, that google (and other cloud providers) are constantly changing and adding new services - so you will need to stay on top of it to stay current (hence why the certification is only valid for two years)

Good luck and may the odds be ever in your favour (sorry, kids into the Hunger Games at the moment :-))

Previous
Previous

The “Office Space” Model Of Data Warehousing

Next
Next

Google Data Studio - A Review