Olete.in – MCQs, Mock Tests & Government Job Prep | Databricks Databricks Mcq Question Set 1

Which one of the following is not a operations that can be performed using Azure Databricks?

1. It is Apache Spark based analytics platform
2. It helps to extract, transform and load the data
3. Visualization if data is not possible with it
4. All of the above

✅ Correct Answer: 3

To which one of the following sources do Azure Databricks connect for collecting streaming data?

1. Kafka
2. Azure data lake
3. CosmosDB
4. none of the above

✅ Correct Answer: 1

Which one of the following is a Databrick concept?

1. Workspace
2. Authentication and authorization
3. Data Management
4. All of the above

✅ Correct Answer: 4

Which of the following ensures data reliability even after termination of cluster in Azure Databricks?

1. Databricks Runtime
2. Databricks File System
3. Dashboards
4. Workspace

✅ Correct Answer: 2

Choose the correct option with respect to ETL operations of data in Azure Databricks?

1. For loading of data, data is moved from databricks to data warehouse
2. for loading of data, blob storage is used
3. Blob storage serves as a temporary storage
4. All of the above

✅ Correct Answer: 4

Which one of the following is incorrect regarding Workspace of Azure Databricks concept?

1. It manages ETL operations of data
2. It can store notebooks, libraries and dashboards
3. It is the root folder of Azure Databricks
4. none of the above

✅ Correct Answer: 1

Which of the following Azure datasources can be connected to Azure Databricks?

1. Azure Blob Storage
2. Azure Datawarehouse
3. Azure CosmosDB
4. All of the above

✅ Correct Answer: 4

Streaming data can be captured by?

1. Kafka
2. Event Hubs
3. Both A and B
4. none of the above

✅ Correct Answer: 3

Authentication and authorization in databricks can be managed for :

1. User, Group, Access Control List
2. User, Group
3. Access Control List
4. Group, Access Control List

✅ Correct Answer: 1

Which one of the following is a set of components that run on clusters of Azure Databricks?

1. DataBricks File System
2. Databricks Runtime
3. CosmosDB
4. Azure Data Lake

✅ Correct Answer: 2

Spark was initially started by ______ at UC Berkeley AMPLab in 2009.

1. Mahek Zaharia
2. Matei Zaharia
3. Doug Cutting
4. Stonebraker

✅ Correct Answer: 2

______ is a component on top of Spark Core.

1. Spark Streaming
2. Spark SQL
3. RDDs
4. All of the Mentioned

✅ Correct Answer: 2

Spark SQL provides a domain-specific language to manipulate ___________ in Scala, Java, or Python.

1. Spark Streaming
2. Spark SQL
3. RDDs
4. All of the Mentioned

✅ Correct Answer: 3

_______ leverages Spark Core fast scheduling capability to perform streaming analytics.

1. MLlib
2. Spark Streaming
3. GraphX
4. RDDs

✅ Correct Answer: 2

____ is a distributed machine learning framework on top of Spark.

1. MLlib
2. Spark Streaming
3. GraphX
4. RDDs

✅ Correct Answer: 1

Given a dataframe df, select the code that returns its number of rows:

1. df.take(‘all’)
2. df.collect()
3. df.count()
4. df.numRows()

✅ Correct Answer: 3

Users can easily run Spark on top of Amazon’s _____

1. Infosphere
2. EC2
3. EMR
4. None of the mentioned

✅ Correct Answer: 2

Which of the following can be used to launch Spark jobs inside MapReduce?

1. SIM
2. SIMR
3. SIR
4. RIS

✅ Correct Answer: 2

Which of the following language is not supported by Spark?

1. Java
2. Pascal
3. Scala
4. Python

✅ Correct Answer: 2

Spark is packaged with higher level libraries, including support for _________ queries.

1. SQL
2. C
3. C++
4. None of the mentioned

✅ Correct Answer: 1

Spark includes a collection over ________ operators for transforming data and familiar data frame APIs for manipulating semi-structured data.

1. 50
2. 60
3. 70
4. 80

✅ Correct Answer: 4

Given a DataFrame df that includes a number of columns among which a column named quantity and a column named price, complete the code below such that it will create a DataFrame including all the original columns and a new column revenue defined as quantity*price:

1. df.withColumnRenamed(“revenue”, expr(“quantity*price”))
2. df.withColumn(revenue, expr(“quantity*price”))
3. df.withColumn(“revenue”, expr(“quantity*price”))
4. df.withColumn(expr(“quantity*price”), “revenue”)

✅ Correct Answer: 3

Spark is engineered from the bottom-up for performance, running ______ faster than Hadoop by exploiting in memory computing and other optimizations.

1. 100x
2. 150x
3. 200x
4. None of the mentioned

✅ Correct Answer: 1

Spark powers a stack of high-level tools including Spark SQL, MLlib for _____

1. regression models
2. statistics
3. machine learning
4. reproductive research

✅ Correct Answer: 3

For Multiclass classification problem which algorithm is not the solution?

1. Naive Bayes
2. Random Forests
3. Logistic Regression
4. Decision Trees

✅ Correct Answer: 4

Which of the following is a tool of Machine Learning Library?

1. Persistence
2. Utilities like linear algebra, statistics
3. Pipelines
4. All of the above.

✅ Correct Answer: 4

Which of the following is true for Spark core?

1. It is the kernel of Spark
2. It enables users to run SQL / HQL queries on the top of Spark.
3. It is the scalable machine learning library which delivers efficiencies
4. Improves the performance of iterative algorithm drastically.

✅ Correct Answer: 1

Given a DataFrame df that has some null values in the column created_date, find the code below such that it will sort rows in ascending order based on the column creted_date with null values appearing last.

1. orderBy(asc_nulls_last(“created_date”))
2. sort(asc_nulls_last(“created_date”))
3. orderBy(col(“created_date”).asc_nulls_last())
4. orderBy(col(“created_date”), ascending=True))

✅ Correct Answer: 3

Which of the following is true for Spark MLlib?

1. Provides an execution platform for all the Spark applications
2. It is the scalable machine learning library which delivers efficiencies
3. enables powerful interactive and data analytics application across live streaming data
4. All of the above

✅ Correct Answer: 2

Which of the following is true for RDD?

1. We can operate Spark RDDs in parallel with a low-level API
2. RDDs are similar to the table in a relational database
3. It allows processing of a large amount of structured data
4. It has built-in optimization engine

✅ Correct Answer: 1

RDD is fault-tolerant and immutable

1. True
2. False
3. Both
4. none of the mentioned

✅ Correct Answer: 1

The read operation on RDD is

1. Fine-grained
2. Coarse-grained
3. Either fine-grained or coarse-grained
4. Neither fine-grained nor coarse-grained

✅ Correct Answer: 3

The write operation on RDD is

1. Fine-grained
2. Coarse-grained
3. Either fine-grained or coarse-grained
4. Neither fine-grained nor coarse-grained

✅ Correct Answer: 2

Which one of the following commands does NOT trigger an eager evaluation?

1. df.collect()
2. df.take()
3. df.show()
4. df.join() –> CORRECT

✅ Correct Answer: 2

Which one of the following command triggers an eager evaluation?

1. df.filter()
2. df.select()
3. df.show()
4. df.limit()

✅ Correct Answer: 3

Is it possible to mitigate stragglers in RDD?

1. Yes
2. No
3. Both
4. None of the mentioned

✅ Correct Answer: 1

Fault Tolerance in RDD is achieved using

1. Immutable nature of RDD
2. DAG (Directed Acyclic Graph)
3. Lazy-evaluation
4. none of the above

✅ Correct Answer: 2

What is action in Spark RDD?

1. The ways to send result from executors to the driver
2. Takes RDD as input and produces one or more RDD as output.
3. Creates one or many new RDDs
4. All of the above

✅ Correct Answer: 1

The shortcomings of Hadoop MapReduce was overcome by Spark RDD by

1. Lazy-evaluation
2. DAG
3. In-memory processing
4. All of the above

✅ Correct Answer: 4

Spark is developed in which language

1. Java
2. Scala
3. Python
4. R

✅ Correct Answer: 2

Which of the following is NOT an actions

1. foreach()
2. printSchema()
3. first()
4. reduce()

✅ Correct Answer: 2

Which of the following is an actions

1. foreach()
2. printSchema()
3. cache()
4. sort()

✅ Correct Answer: 1

Which of the following is a transformation?

1. foreach()
2. flatMap()
3. save()
4. count()

✅ Correct Answer: 2

Which of the following is not a component of the Spark Ecosystem?

1. Sqoop
2. GraphX
3. MLlib
4. BlinkDB

✅ Correct Answer: 1

Which of the following algorithm is not present in MLlib?

1. Streaming Linear Regression
2. Streaming KMeans
3. Tanimoto distance
4. none of the above

✅ Correct Answer: 3

Which of the following is not the feature of Spark?

1. Supports in-memory computation
2. Fault-tolerance
3. It is cost-efficient
4. Compatible with other file storage system

✅ Correct Answer: 3

Which of the following is the reason for Spark being Speedy than MapReduce?

1. DAG execution engine and in-memory computation
2. Support for different language APIs like Scala, Java, Python and R
3. RDDs are immutable and fault-tolerant
4. none of the above

✅ Correct Answer: 1

Which of the following statements are NOT true for broadcast variables ?

1. Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task.
2. A custom broadcast class can be defined by extending org.apache.spark.utilbroadcastV2 in Java or Scala or pyspark.Accumulatorparams in Python. –> CORRECT
3. It is a way of updating a value inside a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way.–> CORRECT
4. It provides a mutable variable that Spark cluster can safely update on a per-row basis. –> CORRECT

✅ Correct Answer: 2

Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task.

1. True
2. False
3. Can’t Specify
4. None of the mentioned

✅ Correct Answer: 1

broadcast variables are ______ and lazily replicated across all nodes in the cluster when an action is triggered

1. mutable
2. immutable
3. both
4. None of above

✅ Correct Answer: 2