- From Channel:
- Data Engineer
Your ace in the hole for deploying and managing analytics and machine learning applications, this book provides everything you need to prepare for this important exam and master the skills necessary to land that coveted Google Cloud Professional Data Engineer certification.
YOU MIGHT ALSO LIKE
GCP Professional Data Engineer Prep.
Preparing for the GCP Professional Data Engineer exam? Don’t know where to start? This post is the GCP Professional Data Engineer Certification Study Guide (with links to each objective in the exam domain).
I have curated a detailed list of articles from the Google documentation and other blogs for each objective of the Google Cloud Platform certified Professional Data Engineer exam. Please share the post within your circles so it helps them to prepare for the exam.
GCP Professional Data Engineer Course Material
GCP Professional Data Engineer Practice Test
GCP Professional Data Engineer Other Materials
Check out all the other GCP certificate study guides
Full Disclosure: Some of the links in this post are affiliate links. I receive a commission when you purchase through them.
Section 1. Designing data processing systems
1.1 Selecting the appropriate storage technologies. Considerations include:
Mapping storage systems to business requirements
Cloud storage options
Data modeling
Schema and data model
Data model
Introduction to data models in Cloud Datastore
Trade-offs involving latency, throughput, transactions
The trade-off between high throughput and low latency
Optimize database service
Distributed systems
Distributed systems in Google Cloud
Schema design
Schema design for time-series data
Schema design best practices
1.2 Designing data pipelines. Considerations include:
Data publishing and visualization (e.g., BigQuery)
Visualizing BigQuery data in a Jupyter notebook
Visualize BigQuery data using Data Studio
Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)
Create a batch processing job on GCP Dataflow
Building Batch Data Pipelines on GCP
Coding a batch processing pipeline with Dataflow & Apache Beam
Run an Apache Spark batch workload
Hadoop ecosystem in GCP
Build a Dataflow pipeline: PubSub to Cloud Storage
Streaming pipelines with Scala and Kafka on GCP
Online (interactive) vs. batch predictions
Online versus batch prediction
Job automation and orchestration (e.g., Cloud Composer)
Automating infrastructure with Cloud Composer
Choose Cloud Composer for service orchestration
1.3 Designing a data processing solution. Considerations include:
Choice of infrastructure
Processing large-scale data
System availability and fault tolerance
Breake down Cloud SQL’s 3 fault tolerance mechanisms
Compute Engine Service Level Agreement (SLA)
Use of distributed systems
Google Distributed Cloud
Capacity planning
Manage capacity and quota
Capacity management with load balancing
Hybrid cloud and edge computing
What is Hybrid Cloud?
Announcing Google Distributed Cloud Edge and Hosted | Google Cloud Blog
Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
Building batch data pipelines on GCP
What is Pub/Sub?
Serverless computing solutions
At least once, in-order, and exactly once, etc., event processing
Exactly-once processing in Google Cloud Dataflow
At least once delivery
Ordering messages
1.4 Migrating data warehousing and data processing. Considerations include:
Awareness of current state and how to migrate a design to a future state
The four phases of a data center migration to the cloud
Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
Transfer service for on-premises data overview
Transfer appliance
Overview of on-premises to GCP migration
Validating a migration
Verify a migration
Amazon link (affiliate)
Section 2. Building and operationalizing data processing systems
2.1 Building and operationalizing storage systems. Considerations include:
Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)
What Cloud Bigtable is good for?
Key features of Cloud Spanner
Cloud SQL use cases
BigQuery key features
Datastore overview
Memorystore features
Storage costs and performance
Cloud storage pricing
Performance optimization
Life cycle management of data
Data lifecycle
2.2 Building and operationalizing pipelines. Considerations include:
Data cleansing
Google Cloud Dataprep: Prepare data of any size
Batch and streaming
Streaming pipelines
Working with data pipelines
Transformation
Creating a data transformation pipeline with Cloud Dataprep
Data acquisition and import
Real-time CDC replication into BigQuery
Best practices for importing and exporting data
Integrating with new data sources
What is data integration?
2.3 Building and operationalizing processing infrastructure. Considerations include:
Provisioning resources
Resource manager
Monitoring pipelines
Using monitoring for Dataflow pipelines
Using the Dataflow monitoring interface
Adjusting pipelines
Setting pipeline options
Testing and quality control
Test GCP Dataflow pipeline
Testing Dataflow pipelines with Cloud Spanner Emulator
Section 3. Operationalizing machine learning models
3.1 Leveraging pre-built ML models as a service. Considerations include:
ML APIs (e.g., Vision API, Speech API)
Detect labels in an image by using client libraries
Detect text in images
Transcribe speech to text by using the Cloud Console
Customizing ML APIs (e.g., AutoML Vision, Auto ML text)
Label images by using AutoML Vision
AutoML natural language API tutorial
Conversational experiences (e.g., Dialogflow)
Dialogflow quickstart
3.2 Deploying an ML pipeline. Considerations include:
Ingesting appropriate data
Introduction to loading data
Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)
Training overview
Automated Model retraining with Kubeflow pipelines
Use Dataproc, BigQuery, and Apache Spark ML
Continuous evaluation
Continuous evaluation overview
3.3 Choosing the appropriate training and serving infrastructure. Considerations include:
Distributed vs. single machine
Distributed training structure
Use of edge compute
Bringing intelligence to the edge with Cloud IoT
Hardware accelerators (e.g., GPU, TPU)
Using GPUs for training models in the cloud
Using TPUs to train your model
3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:
Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
Machine Learning glossary
Impact of dependencies of machine learning models
Data dependencies
Common sources of error (e.g., assumptions about data)
Assumptions of common Machine Learning models
Section 4. Ensuring solution quality
4.1 Designing for security and compliance. Considerations include:
Identity and access management (e.g., Cloud IAM)
Identity and Access Management
Overview of IAM
Data security (encryption, key management)
Encryption at rest in Google Cloud
Encryption in Transit in Google Cloud
Cloud Key Management Service deep dive
Ensuring privacy (e.g., Data Loss Prevention API)
Cloud Data Loss Prevention
Cloud Data Loss Prevention (DLP) API client library
Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))
HIPAA compliance on Google Cloud Platform
COPPA compliance
FedRAMP marketplace
GDPR and Google Cloud
4.2 Ensuring scalability and efficiency. Considerations include:
Building and running test suites
Using Cloud Build as a test runner
Using Cloud Build as a test runner
Pipeline monitoring (e.g., Cloud Monitoring)
Monitoring your Dataflow pipelines
Assessing, troubleshooting, and improving data representations and data processing infrastructure
Troubleshooting service infrastructure
Global infrastructure
Resizing and autoscaling resources
gcloud compute disks resize
Resizing a cluster
Autoscaling groups of instances
4.3 Ensuring reliability and fidelity. Considerations include:
Performing data preparation and quality control (e.g., Dataprep)
A peek into data preparation using Google Cloud Dataprep
Improve data quality for ML and analytics with Cloud Dataprep
Verification and monitoring
Validating data at scale for machine learning
Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
Disaster recovery scenarios for data
Restartable jobs
Breaking down Cloud SQL’s 3 fault tolerance mechanisms
Choosing between ACID, idempotent, eventually consistent requirements
Balancing Strong and Eventual Consistency with Datastore
4.4 Ensuring flexibility and portability. Considerations include:
Mapping to current and future business requirements
Best practices for enterprise organizations
Designing for data and application portability (e.g., multicloud, data residency requirements)
Meet data residency requirements with Google Cloud
Hybrid and multi-cloud patterns and practices
Data staging, cataloging, and discovery
What is Data Catalog?
This brings us to the end of the GCP Professional Data Engineer Study Guide.
What do you think? Let me know in the comments section if I have missed out on anything. Also, I love to hear from you about how your preparation is going on!
In case you are preparing for other GCP certification exams, check out the GCP study guide for those exams.
Follow Me to Receive Updates on CGP Exam
Want to be notified as soon as I post? Subscribe to the RSS feed / leave your email address in the subscribe section. Share the article to your social networks with the below links so it can benefit others.