Latest [Oct 25, 2022] Databricks Databricks-Certified-Professional-Data-Scientist Real Exam Dumps PDF [Q82-Q106]

Share

Latest [Oct 25, 2022] Databricks Databricks-Certified-Professional-Data-Scientist Real Exam Dumps PDF

Databricks-Certified-Professional-Data-Scientist Practice Test Questions Updated 140 Questions


Databricks Databricks-Certified-Professional-Data-Scientist Exam Syllabus Topics:

TopicDetails
Topic 1
  • A complete understanding of the basics of machine learning
  • in-sample vs. out-of sample data
Topic 2
  • A intermediate understanding of the steps in the machine learning lifecycle
  • Model training, selection, and production
Topic 3
  • A complete understanding of basic machine learning algorithms and techniques
  • Unsupervised techniniques like K-means and PCA

 

NEW QUESTION 82
Spam filtering of the emails is an example of

  • A. 2 and 3 are correct
  • B. Unsupervised learning
  • C. Clustering
  • D. 1 and 3 are correct
  • E. Supervised learning

Answer: E

Explanation:
Explanation
Clustering is an example of unsupervised learning. The clustering algorithm finds groups within the data without being told what to look for upfront. This contrasts with classification, an example of supervised machine learning, which is the process of determining to which class an observation belongs. A common application of classification is spam filtering. With spam filtering we use labeled data to train the classifier:
e-mails marked as spam or ham.

 

NEW QUESTION 83
Scenario: Suppose that Bob can decide to go to work by one of three modes of transportation, car, bus, or commuter train. Because of high traffic, if he decides to go by car. there is a 50% chance he will be late. If he goes by bus, which has special reserved lanes but is sometimes overcrowded, the probability of being late is only 20%. The commuter train is almost never late, with a probability of only 1 %, but is more expensive than the bus.
Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove to work that day by car. Since he does not know Which mode of transportation Bob usually uses, he gives a prior probability of
1 3 to each of the three possibilities. Which of the following method the boss will use to estimate of the probability that Bob drove to work?

  • A. Linear regression
  • B. None of the above
  • C. Naive Bayes
  • D. Random decision forests

Answer: C

Explanation:
Explanation
Bayes' theorem (also known as Bayes' rule) is a useful tool for calculating conditional probabilities.

 

NEW QUESTION 84
Select the correct statement which applies to logistic regression

  • A. All 1, 2 and 3 are correct
  • B. May have low accuracy
  • C. Computationally inexpensive, easy to implement knowledge representation easy to interpret
  • D. Only 1 and 3 are correct
  • E. Works with Numeric values

Answer: A

Explanation:
Explanation
Depending on the size of the data you are uploading, Amazon S3 offers the following options:
Logistic regression
Pros: Computationally inexpensive, easy to implement knowledge representation easy to interpret Cons: Prone to underfitting, may have low accuracy Works with: Numeric values^ nominal values

 

NEW QUESTION 85
What describes a true property of Logistic Regression method?

  • A. It handles missing values well.
  • B. It works well with discrete variables that have many distinct values.
  • C. It works well with variables that affect the outcome in a discontinuous way.
  • D. It is robust with redundant variables and correlated variables.

Answer: D

 

NEW QUESTION 86
You are designing a recommendation engine for a website where the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user. What kind of this recommendation engine is ?

  • A. Content-based filtering
  • B. Logistic Regression
  • C. Naive Bayes classifier
  • D. Collaborative filtering

Answer: D

Explanation:
Explanation
Another aspect of collaborative filtering systems is the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and help the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user

 

NEW QUESTION 87
You are working as a data science consultant for a gaming company. You have three member team and all other stake holders are from the company itself like project managers and project sponsored, data team etc.
During the discussion project managed asked you that when can you tell me that the model you are using is robust enough, after which step you can consider answer for this question?

  • A. Model planning
  • B. Operationalize
  • C. Model building
  • D. Data Preparation
  • E. Discovery

Answer: C

Explanation:
Explanation
To answer whether the model you are building is robust enough or not you need to have answer below questions at least
- Model is performing as expected with the test data or not?
- Whatever hypothesis defined in the initial phase is being tested or not?
- Do we need more data?
- Domain experts are convinced or not with the model?
And all these can be answered when you have built the model and tested with the test data sets. Hence, correct option will be Model Building.

 

NEW QUESTION 88
Digit recognition, is an example of.....

  • A. Unsupervised learning
  • B. Clustering
  • C. None of the above
  • D. Classification

Answer: D

Explanation:
Explanation
Supervised learning is fairly common in classification problems because the goal is often to get the computer to learn a classification system that we have created. Digit recognition: once again, is a common example of classification learning. More generally, classification learning is appropriate for any problem where deducing a classification is useful and the classification is easy to determine. In some cases, it might not even be necessary to give pre-determined classifications to every instance of a problem if the agent can work out the classifications for itself. This would be an example of unsupervised learning in a classification context.

 

NEW QUESTION 89
A bio-scientist is working on the analysis of the cancer cells. To identify whether the cell is cancerous or not, there has been hundreds of tests are done with small variations to say yes to the problem. Given the test result for a sample of healthy and cancerous cells, which of the following technique you will use to determine whether a cell is healthy?

  • A. Linear regression
  • B. Identification Test
  • C. Naive Bayes
  • D. Collaborative filtering

Answer: C

Explanation:
Explanation
In this problem you have been given high-dimensional independent variables like yes, no: test results etc. and you have to predict either valid or not valid (One of two). So all of the below technique can be applied to this problem.
Support vector machines Naive Bayes Logistic regression Random decision forests

 

NEW QUESTION 90
Clustering is a type of unsupervised learning with the following goals

  • A. 2 and 3
  • B. 1 and 2
  • C. Not to maximize a utility function
  • D. Find similarities in the training data
  • E. Maximize a utility function

Answer: A

Explanation:
Explanation
type of unsupervised learning is called clustering. In this type of learning, The goal is not to maximize a utility function, but simply to find similarities in the training data.
The assumption is often that the clusters discovered will match reasonably well with an intuitive classification.
For instance, clustering individuals based on demographics might result in a clustering of the wealthy in one group and the poor in another. Clustering can be useful when there is enough data to form clusters (though this turns out to be difficult at times) and especially when additional data about members of a cluster can be used to produce further results due to dependencies in the data.

 

NEW QUESTION 91
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.
Above is an example of

  • A. Maximum likelihood estimation
  • B. Logistic Regression
  • C. Recommendation system
  • D. Linear Regression
  • E. Hierarchical linear models

Answer: B

Explanation:
Explanation
Logistic regression
Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret Cons: Prone to underfitting, may have low accuracy Works with: Numeric values, nominal values

 

NEW QUESTION 92
Select the correct statement which applies to Supervised learning

  • A. We asks the machine to learn from our data when we specify a target variable.
  • B. Instead of telling the machine Predict Y for our data X, we're asking What can you tell me about X?
  • C. Lesser machine's task to only divining some pattern from the input data to get the target variable

Answer: A,B,C

Explanation:
Explanation : Supervised learning asks the machine to learn from our data when we specify a target variable.
This reduces the machine's task to only divining some pattern from the input data to get the target variable.
In unsupervised learning we don't have a target variable as we did in classification and regression.
Instead of telling the machine Predict Y for our data X> we're asking What can you tell me about X?
Things we ask the machine to tell us about
X may be What are the six best groups we can make out of X? or What three features occur together most frequently in X?

 

NEW QUESTION 93
You are working in an ecommerce organization, where you are designing and evaluating a recommender system, you need to select which of the following metric wilt always have the largest value?

  • A. Sum of Errors
  • B. Information is not good enough.
  • C. Mean Absolute Error
  • D. Both land 2
  • E. Root Mean Square Error

Answer: B

 

NEW QUESTION 94
RMSE measures error of a predicted

  • A. Numerical Value
  • B. For booth Numerical and categorical values
  • C. Categorical values

Answer: A

 

NEW QUESTION 95
Refer to Exhibit

In the exhibit, the x-axis represents the derived probability of a borrower defaulting on a loan. Also in the exhibit, the pink represents borrowers that are known to have not defaulted on their loan, and the blue represents borrowers that are known to have defaulted on their loan. Which analytical method could produce the probabilities needed to build this exhibit?

  • A. Discriminant Analysis
  • B. Logistic Regression
  • C. Association Rules
  • D. Linear Regression

Answer: B

 

NEW QUESTION 96
Suppose there are three events then which formula must always be equal to P(E1|E2,E3)?

  • A. P(E1,E2,E3)P(E1)/P(E2:E3)
  • B. P(E1,E2;E3)/P(E2,E3)
  • C. P(E1,E2|E3)P(E2|E3)P(E3)
  • D. P(E1,E2,E3)P(E2)P(E3)
  • E. P(E1,E2|E3)P(E3)

Answer: B

Explanation:
Explanation
This is an application of conditional probability: P(E1,E2)=P(E1|E2)P(E2). so P(E1|E2) = P(E1.E2)/P(E2) P(E1,E2,E3)/P(E2,E3) If the events are A and B respectively, this is said to be "the probability of A given B" It is commonly denoted by P(A|B): or sometimes PB(A). In case that both "A" and "B" are categorical variables, conditional probability table is typically used to represent the conditional probability.

 

NEW QUESTION 97
You have modeled the datasets with 5 independent variables called A,B,C,D and E having relationships which is not dependent each other, and also the variable A,B and C are continuous and variable D and E are discrete (mixed mode).
Now you have to compute the expected value of the variable let say A, then which of the following computation you will prefer

  • A. Generalization
  • B. Integration
  • C. Differentiation
  • D. Transformation

Answer: B

Explanation:
Explanation
Text Description automatically generated

Text Description automatically generated

Text Description automatically generated

 

NEW QUESTION 98
Feature Hashing approach is "SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size" now with large vectors or with multiple locations per feature in Feature hashing?

  • A. Is a problem with accuracy as well as hard to understand what classifier us doing
  • B. It is hard to understand what classifier is doing
  • C. Is a problem with accuracy
  • D. It is easy to understand what classifier is doing

Answer: B

Explanation:
Explanation
FEATURE HASHING
SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size. This approach is known as feature hashing. The shoehorning is done by picking one or more locations by using a hash of the name of the variable for continuous variables or a hash of the variable name and the category name or word for categorical, text*like, or word-like data.
This hashed feature approach has the distinct advantage of requiring less memory and one less pass through the training data, but it can make it much harder to reverse engineer vectors to determine which original feature mapped to a vector location. This is because multiple features may hash to the same location. With large vectors or with multiple locations per feature, this isn't a problem for accuracy but it can make it hard to understand what a classifier is doing.
An additional benefit of feature hashing is that the unknown and unbounded vocabularies typical of word-like variables aren't a problem.

 

NEW QUESTION 99
You are working on a problem where you have to predict whether the claim is done valid or not. And you find that most of the claims which are having spelling errors as well as corrections in the manually filled claim forms compare to the honest claims. Which of the following technique is suitable to find out whether the claim is valid or not?

  • A. Random Decision Forests
  • B. Logistic Regression
  • C. Naive Bayes
  • D. Any one of the above

Answer: D

Explanation:
Explanation
In this problem you have been given high-dimensional independent variables like texts, corrections, test results etc. and you have to predict either valid or not valid (One of two). So all of the below technique can be applied to this problem.
Support vector machines Naive Bayes Logistic regression Random decision forests

 

NEW QUESTION 100
Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:
In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%.
Select the correct statement

  • A. Precision is low, which means the classifier is predicting positives poorly
  • B. 2 and 3
  • C. Precision is low, which means the classifier is predicting positives best
  • D. 1 and 3
  • E. problem domain has a major impact on the measures that should be used to evaluate a classifier within it

Answer: B

Explanation:
Explanation
In this case, Precision = 50%, Recall = 83%, Specificity = 95%: and Accuracy = 95%. In this case, Precision is low, which means the classifier is predicting positives poorly. However, the three other measures seem to suggest that this is a good classifier. This just goes to show that the problem domain has a major impact on the measures that should be used to evaluate a classifier within it, and that looking at the 4 simple cases presented is not sufficient.

 

NEW QUESTION 101
You are having 1000 patients' data with the height and age. Where age in years and height in meters. You wanted to create cluster using this two attributes. You wanted to have near equal effect for both the age and height while creating the cluster. What you can do?

  • A. You will be converting each height value to centimeters
  • B. You will be taking square root of height
  • C. You will be dividing both age and height with their respective standard deviation
  • D. You will be adding height with the numeric value 100

Answer: A,C

Explanation:
Explanation
When you see the data age in years would have values like 50, 60r 70 90 years etc. And while calculating distance from centroid maximum possible value can be 90-0 and its square will be 8100.
While using heights in meter can be 2-0.5(1.5) meters and its square will be 2.25 only. So you can see age has more effect than height. Hence bringing the height on same level you can convert it into centimeters. Can bring data upto 200 centimeters and then it be more effective like square of 200 maximum.
However there is another approach is to divide the each value with its standard deviation, which will not have impact of the units e.g. age/sd of the age, which results in value without unit. This can also help in reducing the effect of units.

 

NEW QUESTION 102
Question-18. What is the best way to ensure that the k-means algorithm will find a good clustering of a collection of vectors?

  • A. Only consider values of k larger than log(N), where N is the number of observations in the data set
  • B. Choose the initial centroids so that they all He along different axes
  • C. Choose the initial centroids so that they are far away from each other
  • D. Run at least log(N) iterations of Lloyd's algorithm, where N is the number of observations in the data set

Answer: C

Explanation:
Explanation
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining, k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
The problem is computationally difficult (NP-hard); however there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes This Question-is about the properties that make k-means an effective clustering heuristic which primarily deal with ensuring that the initial centers are far away from each other. This is how modern k-means algorithms like k-means++ guarantee that with high probability Lloyd's algorithm will find a clustering within a constant factor of the optimal possible clustering for each k.

 

NEW QUESTION 103
Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year. Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. Which of the following will you use to calculate the probability whether it will rain on the day of Marie's wedding?

  • A. Random Decision Forests
  • B. Logistic Regression
  • C. All of the above
  • D. Naive Bayes

Answer: D

Explanation:
Explanation
The sample space is defined by two mutually-exclusive events - it rains or it does not rain. Additionally, a third event occurs when the weatherman predicts rain. You should consider Bayes' theorem when the following conditions exist.
* The sample space is partitioned into a set of mutually exclusive events {A1, A2,... :An}.
* Within the sample space, there exists an event B: for which P(B) > 0.
* The analytical goal is to compute a conditional probability of the form: P( Ak B).

 

NEW QUESTION 104
Select the statement which applies correctly to the Naive Bayes

  • A. Works with nominal values
  • B. Sensitive to how the input data is prepared
  • C. Works with a small amount of data

Answer: A,B,C

 

NEW QUESTION 105
Refer to the exhibit.

You are building a decision tree. In this exhibit, four variables are listed with their respective values of info-gain.
Based on this information, on which attribute would you expect the next split to be in the decision tree?

  • A. Credit Score
  • B. Income
  • C. Age
  • D. Gender

Answer: A

 

NEW QUESTION 106
......

Databricks Databricks-Certified-Professional-Data-Scientist Dumps - Secret To Pass in First Attempt: https://www.passsureexam.com/Databricks-Certified-Professional-Data-Scientist-pass4sure-exam-dumps.html