Olete.in – MCQs, Mock Tests & Government Job Prep| High Performance Computing (HPC) High Performance Computing (HPC) MCQ Set 1

1. A CUDA program is comprised of two primary components: a host and a _____.
gpu??kernel
cpu??kernel
os
none of above

2. The kernel code is dentified by the ________qualifier with void return type
_host_
__global__??
_device_
void

3. Calling a kernel is typically referred to as _________.
kernel thread
kernel initialization
kernel termination
kernel invocation

4. the BlockPerGrid and ThreadPerBlock parameters are related to the ________ model supported by CUDA.
host
kernel
thread??abstraction
None of the above

5. _______ is Callable from the device only
_host_
__global__?? C.
_device_
None of the above

6. ____ is Callable from the host
_host_
__global__??
_device_
none of above

7. ______ is Callable from the host
_host_ B. C.
__global__??
_device_
None of the above

8. CUDA supports ____________ in which code in a single thread is executed by all other threads.
tread division
tread termination
thread abstraction
None of the above

9. . In CUDA, a single invoked kernel is referred to as a _____.
block
tread
grid
None of the above

10. A grid is comprised of ________ of threads.
block
bunch
host
None of the above

11. A block is comprised of multiple _______.
treads
bunch
host
None of the above

12. a solution of the problem in representing the parallelismin algorithm is
cud
pta
cda
cuda

13. Host codes in a CUDA application can not Reset a device
true
false
all
None of These

14. Any condition that causes a processor to stall is called as _____.
hazard
page fault
system error
none of the above

15. The time lost due to branch instruction is often referred to as _____.
latency
delay
branch penalty
None of the above

16. ___ method is used in centralized systems to perform out of order execution.
scorecard
score boarding
optimizing
redundancy

17. The computer cluster architecture emerged as an alternative for ____.
isa
workstation
super computers
distributed systems

18. NVIDIA CUDA Warp is made up of how many threads?
512
1024
312
32

19. Out-of-order instructions is not possible on GPUs.
true
false
--
--

20. CUDA supports programming in ....
c or c++ only
java, python, and more
c, c++, third party wrappers for java, python, and more
pascal

21. FADD, FMAD, FMIN, FMAX are ----- supported by Scalar Processors of NVIDIA GPU.
32-bit ieee floating point instructions
32-bit integer instructions
both
none of the above

22. Each streaming multiprocessor (SM) of CUDA herdware has ------ scalar processors (SP).
1024
128
512
8

23. Each NVIDIA GPU has ------ Streaming Multiprocessors
8
1024
512
16

24. CUDA provides ------- warp and thread scheduling. Also, the overhead of thread creation is on the order of ----.
programming-overhead”, 2 clock
zero-overhead”, 1 clock
64, 2 clock
32, 1 clock

25. Each warp of GPU receives a single instruction and “broadcasts” it to all of its threads. It is a ---- operation.
simd (single instruction multiple data)
simt (single instruction multiple thread)
sisd (single instruction single data)
sist (single instruction single thread)

26. Limitations of CUDA Kernel
recursion, call stack, static variable declaration
no recursion, no call stack, no static variable declarations
recursion, no call stack, static variable declaration
no recursion, call stack, no static variable declarations

27. What is Unified Virtual Machine
it is a technique that allow both cpu and gpu to read from single virtual machine, simultaneously.
it is a technique for managing separate host and device memory spaces.
it is a technique for executing device code on host and host code on device.
it is a technique for executing general purpose programs on device instead of host.

28. _____ became the first language specifically designed by a GPU Company to facilitate general purpose computing on ____.
python, gpus.
c, cpus.
cuda c, gpus.
java, cpus.

29. The CUDA architecture consists of --------- for parallel computing kernels and functions.
risc instruction set architecture
cisc instruction set architecture
zisc instruction set architecture
ptx instruction set architecture

30. CUDA stands for --------, designed by NVIDIA.
common union discrete architecture
complex unidentified device architecture
compute unified device architecture
complex unstructured distributed architecture

31. The host processor spawns multithread tasks (or kernels as they are known in CUDA) onto the GPU device. State true or false.
true
false
---
---

32. The NVIDIA G80 is a ---- CUDA core device, the NVIDIA G200 is a ---- CUDA core device, and the NVIDIA Fermi is a ---- CUDA core device
28, 256, 512
32, 64, 128
64, 128, 256
256, 512, 1024

33. NVIDIA 8-series GPUs offer -------- .
50-200 gflops
200-400 gflops
400-800 gflops
800-1000 gflops

34. IADD, IMUL24, IMAD24, IMIN, IMAX are ----------- supported by Scalar Processors of NVIDIA GPU.
32-bit ieee floating point instructions
32-bit integer instructions
both
none of the above

35. CUDA Hardware programming model supports: a) fully generally data-parallel archtecture; b) General thread launch; c) Global load-store; d) Parallel data cache; e) Scalar architecture; f) Integers, bit operation
a,c,d,f
b,c,d,e
a,d,e,f
a,b,c,d,e,f

36. In CUDA memory model there are following memory types available: a) Registers; b) Local Memory; c) Shared Memory; d) Global Memory; e) Constant Memory; f) Texture Memory.
a, b, d, f
a, c, d, e, f
a, b, c, d, e, f
b, c, e, f

37. What is the equivalent of general C program with CUDA C: int main(void) { printf("Hello, World! "); return 0; }
int main ( void ) { kernel &lt;&lt;&lt;1,1&gt;&gt;&gt;(); printf(&quot;hello, world!\n&quot;); return 0; }
__global__ void kernel( void ) { } int main ( void ) { kernel &lt;&lt;&lt;1,1&gt;&gt;&gt;(); printf(&quot;hello, world!\n&quot;); return 0; }
__global__ void kernel( void ) { kernel &lt;&lt;&lt;1,1&gt;&gt;&gt;(); printf(&quot;hello, world!\n&quot;); return 0; }
_global__ int main ( void ) { kernel &lt;&lt;&lt;1,1&gt;&gt;&gt;(); printf(&quot;hello, world!\n&quot;); return 0; }

38. Which function runs on Device (i.e. GPU): a) __global__ void kernel (void ) { } b) int main ( void ) { ... return 0; }
a
s
both a,b
---

39. If variable a is host variable and dev_a is a device (GPU) variable, to allocate memory to dev_a select correct statement:
cudamalloc( &amp;dev_a, sizeof( int ) )
malloc( &amp;dev_a, sizeof( int ) )
cudamalloc( (void**) &amp;dev_a, sizeof( int ) )
malloc( (void**) &amp;dev_a, sizeof( int ) )

40. If variable a is host variable and dev_a is a device (GPU) variable, to copy input from variable a to variable dev_a select correct statement:
memcpy( dev_a, &amp;a, size);
cudamemcpy( dev_a, &amp;a, size, cudamemcpyhosttodevice );
memcpy( (void*) dev_a, &amp;a, size);
cudamemcpy( (void*) &amp;dev_a, &amp;a, size, cudamemcpydevicetohost );

41. Triple angle brackets mark in a statement inside main function, what does it indicates?
a call from host code to device code
a call from device code to host code
less than comparison
greater than comparison

42. What makes a CUDA code runs in parallel
__global__ indicates parallel execution of code
main() function indicates parallel execution of code
kernel name outside triple angle bracket indicates excecution of kernel n times in parallel
. first parameter value inside triple angle bracket (n) indicates excecution of kernel n times in parallel

43. In ___________, the number of elements to be sorted is small enough to fit into the process's main memory.
internal sorting
internal searching
external sorting
external searching

44. _____________ algorithms use auxiliary storage (such as tapes and hard disks) for sorting because the number of elements to be sorted is too large to fit into memory.
internal sorting
internal searching
external sorting
external searching

45. ____ can be comparison-based or noncomparison-based.
searching
sorting
both a and b
none of above

46. The fundamental operation of comparison-based sorting is ________.
compare-exchange
searching
sorting
swapping

47. The performance of quicksort depends critically on the quality of the ______-.
non-pivote
pivot
center element
len of array

48. The main advantage of ______ is that its storage requirement is linear in the depth of the state space being searched.
bfs
dfs
a and b
None of the above

49. ___ algorithms use a heuristic to guide search.
bfs
dfs
a and b
none of above

50. Graph search involves a closed list, where the major operation is a _______
sorting
searching
lookup
None of the above

51. Breadth First Search is equivalent to which of the traversal in the Binary Trees?
pre-order traversal
post-order traversal
level-order traversal
in-order traversal

52. Time Complexity of Breadth First Search is? (V – number of vertices, E – number of edges)
o(v + e)
o(v)
o(e)
o(v*e)

53. Which of the following is not an application of Breadth First Search?
when the graph is a binary tree
when the graph is a linked list
when the graph is a n-ary tree
when the graph is a ternary tree

54. In BFS, how many times a node is visited?
once
twice
equivalent to number of indegree of the node
thrice

55. Which of the following is not a stable sorting algorithm in its typical implementation.
insertion sort
merge sort
quick sort
bubble sort

56. Which of the following is not true about comparison based sorting algorithms?
the minimum possible time complexity of a comparison based sorting algorithm is o(nlogn) for a random input array
any comparison based sorting algorithm can be made stable by using position as a criteria when two elements are compared
counting sort is not a comparison based sorting algortihm
heap sort is not a comparison based sorting algorithm.

57. mathematically efficiency is
e=s/p
e=p/s
e*s=p/2
e=p+e/e

58. Cost of a parallel system is sometimes referred to____ of product
work
processor time
both
None of the above

59. Scaling Characteristics of Parallel Programs Ts is
increase
constant
decreases
none

60. Speedup tends to saturate and efficiency _____ as a consequence of Amdahl’s law.
increase
constant
decreases
none

61. Speedup obtained when the problem size is _______ linearlywith the number of processing elements.
increase
constant
decreases
depend on problem size

62. The n × n matrix is partitioned among n processors, with each processor storing complete ___ of the matrix.
row
column
both
depend on processor

63. cost-optimal parallel systems have an efficiency of ___
. 1
n
logn
complex

64. The n × n matrix is partitioned among n2 processors such that each processor owns a _____ element.
n
2n
single
double

65. how many basic communication operations are used in matrix vector multiplication
1
2
3
4

66. In DNS algorithm of matrix multiplication it used
1d partition
2d partition
3d partition
both a,b

67. In the Pipelined Execution, steps contain
normalization
communication
elimination
all

68. the cost of the parallel algorithm is higher than the sequential run time by a factor of __
2020-03-02 00:00:00
2020-02-03 00:00:00
3*2
2/3+3/2

69. The load imbalance problem in Parallel Gaussian Elimination: can be alleviated by using a ____ mapping
acyclic
cyclic
both
none

70. A parallel algorithm is evaluated by its runtime in function of
the input size,
the number of processors
the communication parameters.
all

71. For a problem consisting of W units of work, p__W processors can be used optimally.
&lt;=
&gt;=
&lt;
&gt;

72. C(W)__Θ(W) for optimality (necessary condition).
&gt;
&lt;
&lt;=
equals

73. many interactions in oractical parallel programs occur in _____ pattern
well defined
zig-zac
reverse
straight

74. efficient implementation of basic communication operation can improve
performance
communication
algorithm
all

75. efficient use of basic communication operations can reduce
development effort and
software quality
both
none

76. Group communication operations are built using_____ Messenging primitives.
point-to-point
one-to-all
all-to-one
none

77. one processor has a piece of data and it need to send to everyone is
one -to-all
all-to-one
point -to-point
all of above

78. the dual of one -to-all is
all-to-one reduction
one -to-all reduction
pnoint -to-point reducntion
none

79. Data items must be combined piece-wise and the result made available at
target processor finally
target variable finatlalyrget receiver finally
both (a) and (b)
None of these