Excellent Professional-Data-Engineer PDF Dumps With 100% TestInsides Exam Passing Guaranted [Sep-2021]
100% Pass Your Professional-Data-Engineer Google Certified Professional Data Engineer Exam at First Attempt with TestInsides
NEW QUESTION 29
You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally. Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?
- A. Edge TPUs as sensor devices for storing and transmitting the messages.
- B. A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.
- C. An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.
- D. Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
Answer: C
NEW QUESTION 30
You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster. Which two actions can you take to accomplish this? (Choose two.)
- A. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.
- B. Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.
- C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.
- D. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.
- E. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above
70% of max capacity.
Answer: A,C
NEW QUESTION 31
You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?
- A. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
- B. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.
- C. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
- D. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
Answer: B
NEW QUESTION 32
Your neural network model is taking days to train. You want to increase the training speed. What can you do?
- A. Subsample your test dataset.
- B. Subsample your training dataset.
- C. Increase the number of input features to your model.
- D. Increase the number of layers in your neural network.
Answer: D
Explanation:
Explanation/Reference:
Reference: https://towardsdatascience.com/how-to-increase-the-accuracy-of-a-neural-network-
9f5d1c6f407d
NEW QUESTION 33
Each analytics team in your organization is running BigQuery jobs in their own projects. You want to enable each team to monitor slot usage within their projects. What should you do?
- A. Create a log export for each project, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric
- B. Create a Stackdriver Monitoring dashboard based on the BigQuery metric slots/allocated_for_project
- C. Create an aggregated log export at the organization level, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric
- D. Create a Stackdriver Monitoring dashboard based on the BigQuery metric query/scanned_bytes
Answer: C
NEW QUESTION 34
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DTstores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRINGtype. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- A. Create a view CLICK_STREAM_V, where strings from the column DTare cast into TIMESTAMPvalues.
Reference the view CLICK_STREAM_Vinstead of the table CLICK_STREAMfrom now on. - B. Delete the table CLICK_STREAM, and then re-create it such that the column DTis of the TIMESTAMPtype.
Reload the data. - C. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DTinto TIMESTAMPvalues. Run the query into a destination table NEW_CLICK_STREAM, in which the column TSis the TIMESTAMPtype. Reference the table NEW_CLICK_STREAMinstead of the table CLICK_STREAMfrom now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
- D. Add two columns to the table CLICK STREAM: TSof the TIMESTAMPtype and IS_NEWof the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEWto true. For future queries, reference the column TSinstead of the column DT, with the WHEREclause ensuring that the value of IS_NEWmust be true.
- E. Add a column TSof the TIMESTAMPtype to the table CLICK_STREAM, and populate the numeric values from the column TSfor each row. Reference the column TSinstead of the column DTfrom now on.
Answer: D
NEW QUESTION 35
Which of these is NOT a way to customize the software on Dataproc cluster instances?
- A. Set initialization actions
- B. Log into the master node and make changes from there
- C. Modify configuration files using cluster properties
- D. Configure the cluster using Cloud Deployment Manager
Answer: D
Explanation:
You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the --properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster. When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. [https://cloud.google.com/dataproc/ docs/concepts/configuring-clusters/init-actions] Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties
NEW QUESTION 36
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
- No interaction by the user on the site for 1 hour
- Has added more than $30 worth of products to the basket
- Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A. Use a fixed-time window with a duration of 60 minutes.
- B. Use a global window with a time based trigger with a delay of 60 minutes.
- C. Use a session window with a gap time duration of 60 minutes.
- D. Use a sliding time window with a duration of 60 minutes.
Answer: C
Explanation:
It will send a message per user after that user is inactive for 60 minutes. Session window works well for capturing a session per user basis.
NEW QUESTION 37
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than
50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
* Ensure secure and efficient transport and storage of telemetry data
* Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
* Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
* Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?
- A. Rowkey: date#data_point
Column data: device_id - B. Rowkey: data_point
Column data: device_id,date - C. Rowkey: device_id
Column data: date, data_point - D. Rowkey: date
Column data: device_id,data_point - E. Rowkey: date#device_id
Column data: data_point
Answer: B
NEW QUESTION 38
Cloud Dataproc charges you only for what you really use with _____ billing.
- A. week-by-week
- B. hour-by-hour
- C. minute-by-minute
- D. month-by-month
Answer: C
Explanation:
One of the advantages of Cloud Dataproc is its low cost. Dataproc charges for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.
Reference: https://cloud.google.com/dataproc/docs/concepts/overview
NEW QUESTION 39
Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?
- A. Preemptible workers cannot use persistent disk.
- B. Preemptible workers cannot store data.
- C. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
- D. A Dataproc cluster cannot have only preemptible workers.
Answer: B,D
Explanation:
Explanation
The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
Processing only-Since preemptibles can be reclaimed at any time, preemptible workers do not store data.
Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
No preemptible-only clusters-To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
Persistent disk size-As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms
NEW QUESTION 40
Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).
What should you do?
- A. Add a try... catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
- B. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
- C. Add a try... catch block to your DoFn that transforms the data, extract erroneous rows from logs.
- D. Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.
Answer: A
NEW QUESTION 41
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks.
She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?
- A. Grant the user access to Google Cloud Shell.
- B. Host a visualization tool on a VM on Google Compute Engine.
- C. Run a local version of Jupiter on the laptop.
- D. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
Answer: A
NEW QUESTION 42
You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country You check the query plan for the query and see the following output in the Read section of Stage:1:
What is the most likely cause of the delay for this query?
- A. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
- B. Users are running too many concurrent queries in the system
- C. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
- D. The [myproject:mydataset.mytable] table has too many partitions
Answer: B
NEW QUESTION 43
You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?
- A. In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.
- B. Make a call to the Stackdriver API to list all logs, and apply an advanced filter.
- C. Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.
- D. In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.
Answer: A
NEW QUESTION 44
Which methods can be used to reduce the number of rows processed by BigQuery?
- A. Putting data in partitions; using the LIMIT clause
- B. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
- C. Splitting tables into multiple tables; putting data in partitions
- D. Splitting tables into multiple tables; using the LIMIT clause
Answer: C
Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
NEW QUESTION 45
Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?
- A. An hourly watermark
- B. An event time trigger
- C. A processing time trigger
- D. The with Allowed Lateness method
Answer: C
Explanation:
When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window.
Processing time triggers. These triggers operate on the processing time ?the time when the data element is processed at any given stage in the pipeline. Event time triggers. These triggers operate on the event time, as indicated by the timestamp on each data element. Beam's default trigger is event time-based.
Reference: https://beam.apache.org/documentation/programming-guide/#triggers
NEW QUESTION 46
You work for a manufacturing plant that batches application log files together into a single log file once a
day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make
sure the log file in processed once per day as inexpensively as possible. What should you do?
- A. Change the processing job to use Google Cloud Dataproc instead.
- B. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
- C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
- D. Manually start the Cloud Dataflow job each morning when you get into the office.
Answer: C
NEW QUESTION 47
You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street
Tim,553 Y street
Sam, 111 Z street
Which operation is best suited for the above data processing requirement?
- A. ParDo
- B. Sink API
- C. Source API
- D. Data extraction
Answer: A
Explanation:
Explanation
In Google Cloud dataflow SDK, you can use the ParDo to extract only a customer name of each element in your PCollection.
Reference: https://cloud.google.com/dataflow/model/par-do
NEW QUESTION 48
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
* No interaction by the user on the site for 1 hour
* Has added more than $30 worth of products to the basket
* Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A. Use a session window with a gap time duration of 60 minutes.
- B. Use a fixed-time window with a duration of 60 minutes.
- C. Use a sliding time window with a duration of 60 minutes.
- D. Use a global window with a time based trigger with a delay of 60 minutes.
Answer: D
NEW QUESTION 49
Your United States-based company has created an application for assessing and responding to user actions. The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements:
* Single global endpoint
* ANSI SQL support
* Consistent access to the most up-to-date data
What should you do?
- A. Implement Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.
- B. Implement BigQuery with no region selected for storage or processing.
- C. Implement Cloud SQL for PostgreSQL with the master in Norht America and read replicas in Asia and Europe.
- D. Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
Answer: D
NEW QUESTION 50
You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?
- A. Use Cloud Dataflow to write summary of each day's stock trades to an Avro file on Cloud Storage.
Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses. - B. Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
- C. Change the data pipeline to use BigQuery for storing stock trades, and update your application.
- D. Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.
Answer: D
Explanation:
Timestamp at starting of rowkey causes bottleneck issues.
NEW QUESTION 51
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time.
This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
- A. Implement clustering in BigQuery on the ingest date column.
- B. Implement clustering in BigQuery on the package-tracking ID column.
- C. Re-create the table using data partitioning on the package delivery date.
- D. Tier older data onto Cloud Storage files, and leverage extended tables.
Answer: B
NEW QUESTION 52
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
- A. Use a row key of the form <timestamp>.
- B. Use a row key of the form <timestamp>#<sensorid>.
- C. Use a row key of the form >#<sensorid>#<timestamp>.
- D. Use a row key of the form <sensorid>.
Answer: C
NEW QUESTION 53
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud.
You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you do?
- A. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
- B. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
- C. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
- D. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
Answer: A
NEW QUESTION 54
......
Trend for Professional-Data-Engineer pdf dumps before actual exam: https://www.testinsides.top/Professional-Data-Engineer-dumps-review.html
Real Exam Questions & Answers - Google Professional-Data-Engineer Dump is Ready: https://drive.google.com/open?id=1tAVaDBv1YMjBTovZGqxUUhe_jEn71QBK