GCP DATA
GCP Practice Questions about Storage DB and BigData
You are designing a relational data repository on Google Cloud to grow as needed. The data will be transactionally consistent and added from any location in the world. You want to monitor and adjust node count for input traffic, which can spike unpredictably. What should you do?
What is the difference between a deep and wide neural network? What would you use a deep AND wide neural network for? (Choose all that apply)
Your company is planning the infrastructure for a new large-scale application that will need to store over 100 TB or a petabyte of data in NoSQL format for Low-latency read/write and High-throughput analytics. Which storage option should you use?
For this question, refer to the MJTelco case study here: https://cloud.google.com/certification/guides/data-engineer/casestudy-mjtelcoMJTelco needs to develop their machine learning model to control topology definitions. There are a large number of possible configurations to achieve the best results. What components of their machine learning model would they adjust to account for increased complexity? (Choose two answers)
You currently have a Bigtable instance you've been using for development running a development instance type, using HDD's for storage. You are ready to upgrade your development instance to a production instance for increased performance. You also want to upgrade your storage to SSD's as you need maximum performance for your instance. What should you do?
Which of these is NOT a data format that Dataprep can use?
In Cloud ML Engine, what does the CUSTOM tier allow you to configure? Choose the best answer.
You need to run analytical queries using SQL syntax against data formatted in JSON format. What should you do? Choose the best answer.
As part of your backup plan, you set up regular snapshots of Compute Engine instances that are running. You want to be able to restore these snapshots using the fewest possible steps for replacement instances. What should you do?
You have a project using BigQuery. You want to list all BigQuery jobs for that project. You want to set this project as the default for the bq command-line tool. What should you do?
What open source software is Datalab based on?
Your CI/CD pipeline process is shown in the diagram. Which GCP services should you use in boxes 1, 2, and 3?
Captionless Image
What is the open source equivalent to Cloud Pub/Sub?
You are migrating your existing data center environment to Google Cloud Platform. You have a 1 petabyte Storage Area Network (SAN) that needs to be migrated. What GCP service will this data map to?
Which of the following statements on BigQuery are true?
You want to display aggregate view counts for your YouTube channel data in Data Studio. You want to see the video tiles and view counts summarized over the last 30 days. You also want to segment the data by the Country Code using the fewest possible steps. What should you do?
What is the process of loading Cloud SQL data into BigQuery for analysis?
You are selecting a streaming service for log messages that must include final result message ordering as part of building a data pipeline on Google Cloud. You want to stream input for 5 days and be able to query the most recent message value. You will be storing the data in a searchable repository. How should you set up the input messages?
You need to estimate the annual cost of running a Bigquery query that is scheduled to run nightly. What should you do?
You have a mission-critical database running on an instance on Google Compute Engine. You need to automate a database backup once per day to another disk. The database must remain fully operational and functional and can have no downtime. How can you best perform an automated backup of the database with minimal downtime and minimal costs?
Which of the following statements are true?
You are building a data pipeline on Google Cloud. You need to prepare source data for a machine-learning model. This involves quickly deduplicating rows from three input tables and also removing outliers from data columns where you do not know the data distribution. What should you do?
What happens if you do not maintain separate test and training data for your learning model?
You want more control over the configuration of your Cloud ML Engine cluster. Which scaling tier would you choose?
Your team has decided to use Datalab for interactive machine learning exercises. You want your team members to share their work and progress with each other. How do you accomplish this?
Your organization has migrated their Hadoop workloads to Cloud Dataproc. To fully take advantage of the cloud, you want to decouple your Hadoop storage and compute, and be able to destroy your cluster when compute is complete in order to save costs while preserving your data. What should you do?
Which of the following statements on Cloud Bigtable are true?
How would you best connect your Dataflow pipeline to Bigtable for output?
Your company plans to migrate a multi-petabyte data set to the cloud. The data set must be available 24hrs a day. Your business analysts have experience only with using an SQL interface. How should you store the data to optimize it for ease of analysis?
Which of the following statements are true?
You have 250,000 devices which produce a JSON device status event every 10 seconds. You want to capture this event data for outlier time series analysis. What should you do?
What is the purpose of hyperparameters in a machine learning training model?
Why do you want to train a machine learning model locally before training on cloud resources? (Choose all that apply)
For this question, refer to the JencoMart case study. https://cloud.google.com/certification/guides/cloud-architect/casestudy-jencomart JencoMart has decided to migrate user profile storage to Google Cloud Datastore and the application servers to Google Compute Engine (GCE). During the migration, the existing infrastructure will need access to Datastore to upload the data. What service account key-management strategy should you recommend?
You regularly use prefetch caching with a Data Studio report to visualize the results of BigQuery queries. You want to minimize service costs. What should you do?
You are selecting a streaming service for log messages that must include final result message ordering as part of building a data pipeline on Google Cloud. You want to stream input for 5 days and be able to query the most recent message value. You will be storing the data in a searchable repository. How should you set up the input messages?
Pick two benefits of using denormalized data in BigQuery? (Choose all that apply)
Which of these is not true for IAM on Cloud Pub/Sub
You need to correct streaming messages that arrive out of order due to latency. Which Google Cloud service would you use to resolve this?
Choose two best practices for creating more efficient queries and saving costs
For this question, refer to the TerramEarth case study. TerramEarth has equipped unconnected trucks with servers and sensors to collect telemetry data. Next year they want to use the data to train machine learning models. They want to store this data in the cloud while reducing costs. What should they do?
You are working on a project with two compliance requirements. The first requirement states that your developers should be able to see the Google Cloud Platform billing charges for only their projects. The second requirement states that your finance team members can set budgets and view the current charges for all projects in the organization. The finance team should not be able to view the project contents. You want to set permissions. What should you do?
Your company wants to reduce cost on infrequently accessed data by moving it to the cloud. The data will still be accessed approximately once a month to refresh historical charts. In addition, data older than 5 years is no longer needed. How should you store and manage the data?
Your infrastructure runs on another cloud and includes a set of multi-TB enterprise databases that are backed up nightly both on-premises and also to that cloud. You need to create a redundant backup to Google Cloud. You are responsible for performing scheduled monthly disaster recovery drills. You want to create a cost-effective solution. What should you do?
What open source software is Cloud Pub/Sub most similar to?
You have a streaming Dataflow pipeline that you need to shut down. You want data already in the pipeline to finish and be sent to output before shutting down. Which shutdown option should you use to complete the shutdown process?
What is a difference between example (training) data and test data?
You have data stored in a Cloud Storage dataset and also in a BigQuery dataset. You need to secure the data and provide 3 different types of access levels for your Google Cloud Platform users: administrator, read/write, and read-only. You want to follow Google-recommended practices.What should you do?
You are building an application that needs to convert recorded customer service calls into text format, and will then examine call transcripts to determine customer sentiment. What is the most time effective method of doing this?
You are building a data pipeline on Google Cloud. You need to select services that will host a deep neural network machine-learning model also hosted on Google Cloud. You also need to monitor and run jobs that could occasionally fail. What should you do?
You are planning the design of your Bigtable table, which will be used to collect speed limit data on highways. You anticipate needing to query by: Highway name: Mile marker - Timestamp of measurement taken. How should you design your schema in order to maximize efficiency, query all necessary data, and avoid hotspots in the row key?
You need to create a model that predicts stock prices given a variety of factors. What type of problem are we modeling for?
Your application is hosted across multiple regions and consists of both relational database data and static images. Your database has over 10 TB of data. You want to use a single storage repository for each data type across all regions. Which two products would you choose for this task? (Choose two)
You want to optimize the performance of an accurate, real-time, weather-charting application. The data comes from 50,000 sensors sending 10 readings a second, in the format of a timestamp and sensor reading. Where should you store the data?
Your company’s architecture is shown in the diagram. You want to keep data in sync across Region 1 and Region 2. Which product should you use?
Captionless Image
Which of the following statements are true?
You are developing an application on Google Cloud that will label famous landmarks in users’ photos. You are under competitive pressure to develop the predictive model quickly. You need to keep service costs low. What should you do?
Your company is forecasting a sharp increase in the number and size of Apache Spark and Hadoop jobs being run on your local datacenter. You want to utilize the cloud to help you scale this upcoming demand with the least amount of operations work and code change.Which product should you use?
You need to extract an address field from a multi-column element using Dataflow. Which mechanism is able to help with this task?
As part of a complex rollout, you have hired a third party developer consultant to assist with creating your Dataflow processing pipeline. The data that this pipeline will process is very confidential, and the consultant cannot be allowed to view the data itself. What actions should you take so that they have the ability to help build the pipeline but cannot see the data it will process?
For this question, refer to the Dress4Win case study. As part of their new application experience, Dress4Wm allows customers to upload images of themselves. The customer has exclusive control over who may view these images.Customers should be able to upload images with minimal latency and also be shown their images quickly on the main application page when they log in. Which configuration should Dress4Win use?
For future phases, Dress4Win is looking at options to deploy data analytics to the Google Cloud. Which option meets their business and technical requirements?
What is a deep neural network?
You are designing storage for CSV files and using an I/O-intensive custom Apache Spark transform as part of deploying a data pipeline on Google Cloud. You are using ANSI SQL to run queries for your analysts. You want to support complex aggregate queries and reuse existing code. How should you transform the input data?
You host structured data for analysis for multiple clients in BigQuery. For organizational purposes, you need to store all of the different clients' data in a single project. You also need to be able to give your clients the ability to query their own data without having access to other clients' data. How can you best achieve this?
What happens to your Bigtable data when a Bigtable node suffers a critical failure?
You need to choose a structure storage option for storing very large amounts of data with the following properties and requirements:The data has a single key and you need very low latency Which solution should you choose?
Which of these statements do not apply to preemptible worker nodes on Cloud Dataproc?
For this question, refer to the MJTelco case study here: https://cloud.google.com/certification/guides/data-engineer/casestudy-mjtelcoIn order to protect live customer data, MJTelco needs to maintain separate operating environments —development/test, staging, and production— to meet the needs of running experiments, deploying new features, and serving production customers. What is the best practice for isolating these environments while at the same time maintaining operability?
Your company collects and stores security camera footage in Google Cloud Storage. Within the first 30 days, footage is processed regularly for threat detection, object detection, trend analysis, and suspicious behavior detection. You want to minimize the cost of storing all the data. How should you store the videos?
You have a long-running, streaming Dataflow pipeline that you need to shut down. You do not need to preserve data currently in the processing pipeline and need it shut down as soon as possible. Which shutdown option should you use to complete the shutdown process?
Your customer is moving their storage product to Google Cloud Storage (GCS). The data contains personally identifiable information (PII) and sensitive customer information. What security strategy should you use for GCS?
You are building a data pipeline on Google Cloud. You need to prepare source data for a machine-learning model. This involves quickly deduplicating rows from three input tables and also removing outliers from data columns where you do not know the data distribution. What should you do?
It is a best practice to locate applications close to where data lives, irrespective of regulatory constraints.
Which of these is not a valid BigQuery data format?
For this question, refer to the MountKirk Games case study (https://cloud.google.com/certification/guides/cloud-architect/casestudy-mountkirkgames): MountKirk Games needs to set up their game backend database. Based on their requirements, which storage service best fits their needs?
Which of the following statements are true?
How can you connect to the web interface of a Dataproc cluster? (Choose two)
If the reads and writes are not evenly distributed in a BigTable database, performance can take a hit.
How can you set up your Dataproc environment to use BigQuery as an input and output source?
Which of the following is a GCP Machine Learning service?
You are upgrading your existing (development) Cloud Bigtable instance for use in your production environment. The instance contains a large amount of data that you want to make available for production immediately. You need to design for fastest performance. What should you do?
Which of these is NOT a type of trigger that applies to Dataflow?
Cloud Dataflow fully automates the management of processing resources.
You created a job which runs daily to import highly sensitive data from an on-premises location to Cloud Storage. You also set up a streaming data insert into Cloud Storage via a Kafka node that is running on a Compute Engine instance. You need to encrypt the data at rest and supply your own encryption key. Your key should not be stored in the Google Cloud. What should you do?
What types of Bigtable row keys can lead to hotspotting? (Choose all that apply)
Your company is developing a next generation pet collar that collects biometric information to assist potential millions of families with promoting healthy lifestyles for their pets. Each collar will push 30kb of biometric data In JSON format every 2 seconds to a collection platform that will process and analyze the data providing health trending information back to the pet owners and veterinarians via a web portal. Management has tasked you to architect the collection platform ensuring the following requirements are met.1. Provide the ability for real-time analytics of the inbound biometric data 2. Ensure processing of the biometric data is highly durable, elastic and parallel 3.The results of the analytic processing should be persisted for data mining. Which architecture outlined below win meet the initial requirements for the platform?
You are developing an application on Google Cloud that will label famous landmarks in users’ photos. You are under competitive pressure to develop the predictive model quickly. You need to keep service costs low. What should you do?
You are developing an application that will only recognize and tag specific business to business product logos in images. What is the best method to accomplish this task?
Which of these open source technologies is the direct equivalent to Google BigQuery?
What IAM role do you need to grant to service accounts for Dataproc workloads, while offering the smallest scope of permissions?
Your company wants to reduce cost on infrequently accessed data by moving it to the cloud. The data will still be accessed approximately once a month to refresh historical charts. In addition, data older than 5 years is no longer needed. How should you store and manage the data?
In machine learning, what is the difference between test and training data?
You want to use an open source framework for constructing unified batch and data stream pipelines. Which open source framework should you choose?
For this question, refer to the MJTelco case study here: https://cloud.google.com/certification/guides/data-engineer/casestudy-mjtelcoMJTelco is streaming telemetry data into BigQuery for long-term storage (2 years) and analysis, at the rate of about 100 million records per day. They need to be able to run queries against certain time periods of data without incurring the costs of querying all available records. What two options would you recommend for doing so? (Choose all that apply)
Your company collects and stores security camera footage in Google Cloud Storage. Within the first 30 days, footage isprocessed regularly for threat detection, object detection, trend analysis, and suspicious behavior detection. You want to minimize the cost of storing all the data. How should you store the videos?
Which Hadoop ecosystem service is most suited to storing on BigQuery instead?
Your organization is making the move to Google Cloud. You need to bring your existing big data processing workflows to the cloud without having to re-train employees on new products. Your organization uses the Apache Hadoop ecosystem for big data processing. Which Google Cloud managed service would your workflow move to?
What types of jobs does Cloud Dataproc support? (Choose all that apply)
You are building storage for files for a data pipeline on Google Cloud. You want to support JSON files. The schema of these files will occasionally change. Your analyst teams will use running aggregate ANSI SQL queries on this data. What should you do?
You are designing storage for event data as part of building a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying individual values over time windows. Which storage service and schema design should you use?
You have been asked to select the storage system for the click-data of your company's large portfolio of websites. This data is streamed in from a custom website analytics package at a typical rate of 6,000 clicks per minute, with bursts of up to 8,500 clicks per second. It must be stored for future analysis by your data science and user experience teams.Which storage infrastructure should you choose?
When using the GCP Cloud Pub/Sub service, the system that receives the message from the Publishing Forwarder and ensures delivery to subscribers is the Subscribing Forwarder.
Your application has a large international audience and runs stateless virtual machines within a managed instance group across multiple locations. One feature of the application lets users upload files and share them with other users. Files must be available for 30 days; after that, they are removed from the system entirely. Which storage solution should you choose?
You are working on a project with two compliance requirements. The first requirement states that your developers should be able to see the Google Cloud Platform billing charges for only their own projects. The second requirement states that your finance team members can set budgets and view the current charges for all projects in the organization. The finance team should not be able to view the project contents. You want to set permissions. What should you do?
Which of the following statements are true?
You created a job which runs daily to import highly sensitive data from an on-premises location to Cloud Storage. You also set up a streaming data insert into Cloud Storage via a Kafka node that is running on a Compute Engine instance. You need to encrypt the data at rest and supply your own encryption key. Your key should not be stored in the Google Cloud. What should you do?
You have a very large table with many columns that are not immediately relevant to your non-IT team members. You want to reduce the amount of irrelevant column data available in your table in order to keep from confusing your team members that need to run queries against it. What is a valid method of achieving this task?
Your BigQuery table needs to be accessed by team members who are not proficient in technology. You want to simplify the columns they need to query to avoid confusion. How can you do this while preserving all of the data in your table?
For this question, refer to the Dress4Win case study https://cloud.google.com/certification/guides/cloud-architect/casestudy-dress4win. You want to ensure Dress4Win’s sales and tax records remain available for infrequent viewing by auditors for at least 10 years. Cost optimization is your top priority. Which cloud services should you choose?
For this question, refer to the TerramEarth https://cloud.google.com/certification/guides/cloud-architect/casestudy-terramearth case study.TerramEarth's 20 million vehicles are scattered around the world. Based on the vehicle's location its telemetry data is stored in a Google Cloud Storage (GCS) regional bucket (US. Europe, or Asia). The CTO has asked you to run a report on the raw telemetry data to determine why vehicles are breaking down after 100 K miles. You want to run this job on all the data. What is the most cost-effective way to run this job?
Your customer is moving their storage product to Google Cloud Storage (GCS). The data contains personally identifiable information (PII) and sensitive customer information. What security strategy should you use for GCS?
For this question, refer to the JencoMart case study.JencoMart wants to move their User Profiles database to Google Cloud Platform. Which Google Database should they use?
Which of these actions can you not perform with the BigQuery Web UI?
Choose all Cloud Pub/Sub features that can have access controlled via IAM roles. (Choose all that apply)
Your company is making the move to Google Cloud and has chosen to use a managed database service to reduce overhead. Your existing database is used for a product catalog that provides real-time inventory tracking for a retailer. Your database is 500 GB in size. The data is semi-structured and does not need full atomicity. You are looking for a truly no-ops/serverless solution. What storage option should you choose?
What are wide neural networks good for, compared deep neural networks?
To run a local training job using the Google Cloud SDK, what command would you run?
What other Google Cloud service does Dataprep use to complete the process of transforming data?
Your BigQuery dataset contains 1500 tables. When conducting a query, you are limited to a maximum of 1000 tables that you can query at once. You need to query data across all 1500 tables. What should you do?
What is Cloud Dataprep?
For this question, refer to the MountKirk Games case study (https://cloud.google.com/certification/guides/cloud-architect/casestudy-mountkirkgames): MountKirk Games needs to build out their streaming data analytics pipeline to feed from their game backend application. What GCP services in which order will achieve this?
You are creating a machine learning model to predict the likelihood of fraud from credit card transaction data. What type of learning model problem is this?
For this question, refer to the Flowlogistic case study here: https://cloud.google.com/certification/guides/data-engineer/casestudy-flowlogistic. Flowlogistic's Kafka server cluster has been unable to scale to the demands of their data ingest needs. How can they migrate this functionality to Google Cloud to be able to scale for future growth?
You are designing storage for event data as part of building a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying individual values over time windows. Which storage service and schema design should you use?
Operational parameters such as oil pressure are adjustable on each of TerramEarth's vehicles to increase their efficiency, depending on their environmental conditions. Your primary goal is to increase the operating efficiency of all 20 million cellular and unconnected vehicles in the field. How can you accomplish this goal? Select one: https://cloud.google.com/certification/guides/cloud-architect/casestudy-terramearth
Your company wants to try out the cloud with low risk. They want to archive approximately 100 TB of their log data to the cloud and test the analytics features available to them there, while also retaining that data as a long-term disaster recovery backup. Which two steps should they take? Choose 2 answers