Show
Your ace in the hole for deploying and managing analytics and machine learning applications, this book provides everything you need to prepare for this important exam and master the skills necessary to land that coveted Google Cloud Professional Data Engineer certification. YOU MIGHT ALSO LIKEGCP Professional Data Engineer Prep.Preparing for the GCP Professional Data Engineer exam? Don’t know where to start? This post is the GCP Professional Data Engineer Certification Study Guide (with links to each objective in the exam domain). I have curated a detailed list of articles from the Google documentation and other blogs for each objective of the Google Cloud Platform certified Professional Data Engineer exam. Please share the post within your circles so it helps them to prepare for the exam. GCP Professional Data Engineer Course MaterialGCP Professional Data Engineer Practice TestGCP Professional Data Engineer Other MaterialsCheck out all the other GCP certificate study guides Full Disclosure: Some of the links in this post are affiliate links. I receive a commission when you purchase through them. Section 1. Designing data processing systems1.1 Selecting the appropriate storage technologies. Considerations include:Mapping storage systems to business requirements Cloud storage options Data modeling Schema and data model Data model Introduction to data models in Cloud Datastore Trade-offs involving latency, throughput, transactions The trade-off between high throughput and low latency Optimize database service Distributed systems Distributed systems in Google Cloud Schema design Schema design for time-series data Schema design best practices 1.2 Designing data pipelines. Considerations include:Data publishing and visualization (e.g., BigQuery) Visualizing BigQuery data in a Jupyter notebook Visualize BigQuery data using Data Studio Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka) Create a batch processing job on GCP Dataflow Building Batch Data Pipelines on GCP Coding a batch processing pipeline with Dataflow & Apache Beam Run an Apache Spark batch workload Hadoop ecosystem in GCP Build a Dataflow pipeline: PubSub to Cloud Storage Streaming pipelines with Scala and Kafka on GCP Online (interactive) vs. batch predictions Online versus batch prediction Job automation and orchestration (e.g., Cloud Composer) Automating infrastructure with Cloud Composer Choose Cloud Composer for service orchestration 1.3 Designing a data processing solution. Considerations include:Choice of infrastructure Processing large-scale data System availability and fault tolerance Breake down Cloud SQL’s 3 fault tolerance mechanisms Compute Engine Service Level Agreement (SLA) Use of distributed systems Google Distributed Cloud Capacity planning Manage capacity and quota Capacity management with load balancing Hybrid cloud and edge computing What is Hybrid Cloud? Announcing Google Distributed Cloud Edge and Hosted | Google Cloud Blog Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions) Building batch data pipelines on GCP What is Pub/Sub? Serverless computing solutions At least once, in-order, and exactly once, etc., event processing Exactly-once processing in Google Cloud Dataflow At least once delivery Ordering messages 1.4 Migrating data warehousing and data processing. Considerations include:Awareness of current state and how to migrate a design to a future state The four phases of a data center migration to the cloud Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking) Transfer service for on-premises data overview Transfer appliance Overview of on-premises to GCP migration Validating a migration Verify a migration Amazon link (affiliate) Section 2. Building and operationalizing data processing systems2.1 Building and operationalizing storage systems. Considerations include:Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore) What Cloud Bigtable is good for? Key features of Cloud Spanner Cloud SQL use cases BigQuery key features Datastore overview Memorystore features Storage costs and performance Cloud storage pricing Performance optimization Life cycle management of data Data lifecycle 2.2 Building and operationalizing pipelines. Considerations include:Data cleansing Google Cloud Dataprep: Prepare data of any size Batch and streaming Streaming pipelines Working with data pipelines Transformation Creating a data transformation pipeline with Cloud Dataprep Data acquisition and import Real-time CDC replication into BigQuery Best practices for importing and exporting data Integrating with new data sources What is data integration? 2.3 Building and operationalizing processing infrastructure. Considerations include:Provisioning resources Resource manager Monitoring pipelines Using monitoring for Dataflow pipelines Using the Dataflow monitoring interface Adjusting pipelines Setting pipeline options Testing and quality control Test GCP Dataflow pipeline Testing Dataflow pipelines with Cloud Spanner Emulator Section 3. Operationalizing machine learning models3.1 Leveraging pre-built ML models as a service. Considerations include:ML APIs (e.g., Vision API, Speech API) Detect labels in an image by using client libraries Detect text in images Transcribe speech to text by using the Cloud Console Customizing ML APIs (e.g., AutoML Vision, Auto ML text) Label images by using AutoML Vision AutoML natural language API tutorial Conversational experiences (e.g., Dialogflow) Dialogflow quickstart 3.2 Deploying an ML pipeline. Considerations include:Ingesting appropriate data Introduction to loading data Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML) Training overview Automated Model retraining with Kubeflow pipelines Use Dataproc, BigQuery, and Apache Spark ML Continuous evaluation Continuous evaluation overview 3.3 Choosing the appropriate training and serving infrastructure. Considerations include:Distributed vs. single machine Distributed training structure Use of edge compute Bringing intelligence to the edge with Cloud IoT Hardware accelerators (e.g., GPU, TPU) Using GPUs for training models in the cloud Using TPUs to train your model 3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics) Machine Learning glossary Impact of dependencies of machine learning models Data dependencies Common sources of error (e.g., assumptions about data) Assumptions of common Machine Learning models Section 4. Ensuring solution quality4.1 Designing for security and compliance. Considerations include:Identity and access management (e.g., Cloud IAM) Identity and Access Management Overview of IAM Data security (encryption, key management) Encryption at rest in Google Cloud Encryption in Transit in Google Cloud Cloud Key Management Service deep dive Ensuring privacy (e.g., Data Loss Prevention API) Cloud Data Loss Prevention Cloud Data Loss Prevention (DLP) API client library Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR)) HIPAA compliance on Google Cloud Platform COPPA compliance FedRAMP marketplace GDPR and Google Cloud 4.2 Ensuring scalability and efficiency. Considerations include:Building and running test suites Using Cloud Build as a test runner Using Cloud Build as a test runner Pipeline monitoring (e.g., Cloud Monitoring) Monitoring your Dataflow pipelines Assessing, troubleshooting, and improving data representations and data processing infrastructure Troubleshooting service infrastructure Global infrastructure Resizing and autoscaling resources gcloud compute disks resize Resizing a cluster Autoscaling groups of instances 4.3 Ensuring reliability and fidelity. Considerations include:Performing data preparation and quality control (e.g., Dataprep) A peek into data preparation using Google Cloud Dataprep Improve data quality for ML and analytics with Cloud Dataprep Verification and monitoring Validating data at scale for machine learning Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis) Disaster recovery scenarios for data Restartable jobs Breaking down Cloud SQL’s 3 fault tolerance mechanisms Choosing between ACID, idempotent, eventually consistent requirements Balancing Strong and Eventual Consistency with Datastore 4.4 Ensuring flexibility and portability. Considerations include:Mapping to current and future business requirements Best practices for enterprise organizations Designing for data and application portability (e.g., multicloud, data residency requirements) Meet data residency requirements with Google Cloud Hybrid and multi-cloud patterns and practices Data staging, cataloging, and discovery What is Data Catalog? This brings us to the end of the GCP Professional Data Engineer Study Guide. What do you think? Let me know in the comments section if I have missed out on anything. Also, I love to hear from you about how your preparation is going on! In case you are preparing for other GCP certification exams, check out the GCP study guide for those exams. Follow Me to Receive Updates on CGP ExamWant to be notified as soon as I post? Subscribe to the RSS feed / leave your email address in the subscribe section. Share the article to your social networks with the below links so it can benefit others. How do I prepare for Google Professional Data Engineer certification?Exam overview. Follow the learning path. Prepare for the exam by exploring online training, in-person classes, hands-on labs, and other resources from Google Cloud. Start preparing.. Take a webinar.. Additional resources. In-depth discussions on the concepts and critical components of Google Cloud:. How do I prepare Google Cloud Professional Data Engineer?To get well prepared for the exam, I encourage you to complete the Official Data Engineer course videos and read about the best practices of GCP products, followed by the ML Crash Course provided by Google. You should be ready to pass the exam by combining your studies with your knowledge. Good luck!
How difficult is Google Cloud Data Engineer certification?It's not an easy certification because earning it means you don't just understand what all of Google's cloud-based data processing and analysis tools and utilities do, it means you know how to incorporate them into plans, deploy them on your networks, and manage their operations after they're up and running.
How long does it take to prepare for GCP data engineer?It could take you one to three months for preparation depending on your experience and background in cloud data engineering. There are 5 courses in this Specialization including: Google Cloud Platform Big Data and Machine Learning Fundamentals. Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform.
|