#assignment 6 comparison methods
Clasification Methods
Classification is a general
process related to caregorize, the process in which ideas and objects are
recognized, differentiated, and understood.
A method is a programmed procedure that
is defined as part of a class and
included in any object of that class. A class (and
thus an object) can have more than one method. A method in an object can only
have access to the data known to that object, which ensures data integrity
among the set of objects in an application. A method can be re-used in multiple
objects.
Big data is a term that describes the
large volume of data – both structured and unstructured – that inundates a
business on a day-to-day basis. But it’s not the amount of data that’s
important. It’s what organizations do with the data that matters. Big data can
be analyzed for insights that lead to better decisions and strategic business
moves.
Now I will comparison classification
·
Naïve Bayes Classifier
The Naive Bayes Classifier
technique is based on the so-called Bayesian theorem and is particularly suited
when the dimensionality of the inputs is high. Despite its simplicity, Naive
Bayes can often outperform more sophisticated classification methods.

To demonstrate the concept of
Naïve Bayes Classification, consider the example displayed in the illustration
above. As indicated, the objects can be classified as either GREEN or RED. Our
task is to classify new cases as they arrive, i.e., decide to which class label
they belong, based on the currently exiting objects.
Since there are twice as many
GREEN objects as RED, it is reasonable to believe that a new case (which hasn't
been observed yet) is twice as likely to have membership GREEN rather than RED.
In the Bayesian analysis, this belief is known as the prior probability. Prior
probabilities are based on previous experience, in this case the percentage of
GREEN and RED objects, and often used to predict outcomes before they actually
happen
The first weakness is that the Naive Bayes classifier makes a very strong
assumption on the shape of your data distribution, i.e. any two features are
independent given the output class. Due to this, the result can be
(potentially) very bad - hence, a “naive” classifier. This is not as terrible
as people generally think, because the NB classifier can be optimal even if the
assumption is violated (see the seminal paper by Domingos and Pazzani , or the
later work by Zhang , and its results can be good even in the case of
sub-optimality.
Advantages to
fully respond your question, we would need to look into the alternatives you
are comparing it to.
Naive Bayes
is part of the family of Probabilistic Models: “One of their
advantages besides being a simple yet powerful model, is that they return not
only the prediction but also the degree of certainty, which can be very
useful.”
·
Decision Three
·
Decision Trees are excellent tools for
helping you to choose between several courses of action.
They provide a highly effective
structure within which you can lay out options and investigate the possible
outcomes of choosing those options. They also help you to form a balanced
picture of the risks and rewards associated with each possible course of
action.
Advantages and disadvantages
Among decision support tools,
decision trees (and influence
diagrams) have several advantages. Decision trees:
•
Are
simple to understand and interpret. People are able to understand decision tree
models after a brief explanation.
•
Have
value even with little hard data. Important insights can be generated based on
experts describing a situation (its alternatives, probabilities, and costs) and
their preferences for outcomes.
•
Allow
the addition of new possible scenarios.
•
Help
determine worst, best and expected values for different scenarios.
•
Use
a white box
model. If a given result is provided by a model.
•
Can
be combined with other decision techniques.
Disadvantages of decision trees:
•
For
data including categorical variables with different number of levels, information
gain in decision trees is biased in favor of those attributes with
more levels
Calculations can get very complex,
particularly if many values are uncertain and/or if many outcomes are linked.
UNSUPERVISED METHOD:
SenseClusters (an
adaptation of the K-means clustering algorithm)
In this section we will try
to understand the K-means clustering algorithm that has been used in
SenseClusters. Clustering is the process in which we divide the available data
instances into a given number of sub-groups. These sub-groups are called
clusters, and hence the name “Clustering”. To put it simply, the K-means
algorithm outlines a method to cluster a particular set of instances into K
different clusters, where K is a positive integer. It should be noted here that
the K-means clustering algorithm requires to know the number of clusters from
the user. It cannot identify the number of clusters by itself. However,
SenseClusters now has the facility of automatically identifying the number of
clusters that the data may comprise of.
The K-means clustering
algorithm starts by placing K centroids as far away from each other as possible
within the available space. Then each of the available data instances is
assigned a particular centroid, depending on a metric like Euclidian distance,
Manhattan distance, Minkowski distance, etc. The position of the centroid is
recalculated every time an instance is added to the cluster and this continues
until all the instances are grouped into the final required number of clusters.
Since recalculating the cluster centroids may alter the cluster membership, the
cluster memberships are also verified once the position of the centroid
changes. This process continues till there is no further change in the cluster
membership, and there is as little change in the positions of the centroids as
possible.
The initial position of the
centroids is thus very important since this position affects all the future
steps in the K-means clustering algorithm. Hence, it is always advisable to
keep the cluster centers as far away from each other as possible. If there are
too many clusters, then clusters that closely resemble each other and are in
the vicinity of each other are clubbed together. If there are too few clusters
then clusters that are too big and may contain two or more sub-groups of
different data instances are divided. The K-means clustering algorithm is thus
a simple to understand, fairly intuitive method by which we can divide the
available data into sub-categories
Komentar
Posting Komentar