#assignment 6 comparison methods

Clasification Methods

Classification is a general process related to caregorize, the process in which ideas and objects are recognized, differentiated, and understood.

A method is a programmed procedure that is defined as part of a class and included in any object of that class. A class (and thus an object) can have more than one method. A method in an object can only have access to the data known to that object, which ensures data integrity among the set of objects in an application. A method can be re-used in multiple objects.

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Now I will comparison classification

· Naïve Bayes Classifier

The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen

The first weakness is that the Naive Bayes classifier makes a very strong assumption on the shape of your data distribution, i.e. any two features are independent given the output class. Due to this, the result can be (potentially) very bad - hence, a “naive” classifier. This is not as terrible as people generally think, because the NB classifier can be optimal even if the assumption is violated (see the seminal paper by Domingos and Pazzani , or the later work by Zhang , and its results can be good even in the case of sub-optimality.

Advantages to fully respond your question, we would need to look into the alternatives you are comparing it to.

Naive Bayes is part of the family of Probabilistic Models: “One of their advantages besides being a simple yet powerful model, is that they return not only the prediction but also the degree of certainty, which can be very useful.”

· Decision Three

Decision Trees are excellent tools for helping you to choose between several courses of action.

They provide a highly effective structure within which you can lay out options and investigate the possible outcomes of choosing those options. They also help you to form a balanced picture of the risks and rewards associated with each possible course of action.

Advantages and disadvantages

Among decision support tools, decision trees (and influence diagrams) have several advantages. Decision trees:

• Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.

• Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.

• Allow the addition of new possible scenarios.

• Help determine worst, best and expected values for different scenarios.

• Use a white box model. If a given result is provided by a model.

• Can be combined with other decision techniques.

Disadvantages of decision trees:

• For data including categorical variables with different number of levels, information gain in decision trees is biased in favor of those attributes with more levels

Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.

UNSUPERVISED METHOD:

SenseClusters (an adaptation of the K-means clustering algorithm)

In this section we will try to understand the K-means clustering algorithm that has been used in SenseClusters. Clustering is the process in which we divide the available data instances into a given number of sub-groups. These sub-groups are called clusters, and hence the name “Clustering”. To put it simply, the K-means algorithm outlines a method to cluster a particular set of instances into K different clusters, where K is a positive integer. It should be noted here that the K-means clustering algorithm requires to know the number of clusters from the user. It cannot identify the number of clusters by itself. However, SenseClusters now has the facility of automatically identifying the number of clusters that the data may comprise of.

The K-means clustering algorithm starts by placing K centroids as far away from each other as possible within the available space. Then each of the available data instances is assigned a particular centroid, depending on a metric like Euclidian distance, Manhattan distance, Minkowski distance, etc. The position of the centroid is recalculated every time an instance is added to the cluster and this continues until all the instances are grouped into the final required number of clusters. Since recalculating the cluster centroids may alter the cluster membership, the cluster memberships are also verified once the position of the centroid changes. This process continues till there is no further change in the cluster membership, and there is as little change in the positions of the centroids as possible.

The initial position of the centroids is thus very important since this position affects all the future steps in the K-means clustering algorithm. Hence, it is always advisable to keep the cluster centers as far away from each other as possible. If there are too many clusters, then clusters that closely resemble each other and are in the vicinity of each other are clubbed together. If there are too few clusters then clusters that are too big and may contain two or more sub-groups of different data instances are divided. The K-means clustering algorithm is thus a simple to understand, fairly intuitive method by which we can divide the available data into sub-categories

Cari Blog Ini

BIG DATA

#assignment 6 comparison methods

Komentar

Posting Komentar