Monday, September 19, 2022

Natural Language Processing with Disaster Tweets

Natural Language Processing with Disaster Tweets 

Binary Classification Problem

Contains: 

Abstract:

Data Processing:

Data Discovery:

Missing Data Handling:

Model:

Decision Tree Model to Predict the Values of Class Attribute (target) on the Test set:

Validation Technique

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Training Data set:

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Test Data set:

Data Transformation - Column filtering:

Best learner for the class attribute:

Conclusion:

References

Abstract:

Twitter has become an important communication channel in times of emergency.

The ubiquitous Ness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e., disaster relief organisations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster or not. Twitter users explicitly use words that can be quite clear to the human, which is less clear to a machine.

In the work space we have to build a machine learning model to predict which tweets are a real disaster and which are not.

Data Processing:

dataset is available in kaggle platform, and consist of 10776  tweets that were almost classified tran dataset and test dataset, dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’.

after downloading the data and saved in the local disc, some process has been done to understand the dataset:

Data Discovery: 

train.csv - training dataset (7613) rows.

Columns:

  • id - is an integer type unique identifier for each tweet

  • text - is string type and it contains the text of the tweet that has been published by the user.

  • location - is string type and contain the location where the tweet was sent from (have some missing data)

  • keyword - is string type contains a particular keyword from the tweet (have some missing data)

  • target - is spatial category (binary) an integer type in train.csv only in which has to be predicted and applied on the test dataset, it’s denoting whether a tweet is about a real disaster (1) or not (0)

(id, text, location, keyword) are explanatory attribute

The count of the prediction or class attribute   (target) which has (0,1) values. we have obtained 57.03% for (0) and 42.97% for (1).

test.csv in which we need to apply the prediction column (3263) rows.

Columns:

  • id - a unique identifier for each tweet

  • text - the text of the tweet

  • location - the location the tweet was sent from (could be blank)

  • keyword - a particular keyword from the tweet (could be blank)

Missing Data Handling: 

If we take our considerations of the training dataset:

  • text has (0) missing data

  • location has (2534) missing data

  • keyword has(61) missing data

  • target has (0) missing data

notice that our class attribute (target ) has no missing values, which means that the decision model will not reduce the quality of the prediction if we do not take other attributes into consideration, so we would have to decide to not take any action on the training dataset, but if we take the in our considerations (location and keyword) it will effect on  the prediction result so the right decision the to remove (location, keyword and keyword) from the dataset. 

regarding missing data of the test dataset, 

  • text has (0) missing data

  • location has (1106) missing data,

  • keyword has (26) missing data

we have decided to remove attributes of missing values (location and keyword). so that we will have (id, and target column predicted) with no missing value in all attribute’s tables.

note: this handling method can not be represented PMML 4.2

Model:

Decision Tree Model to Predict the Values of Class Attribute (target) on the Test set:

After the first step of data discovery, a decision model technique is an efficient way to predict the values of the class attribute (target) on the test dataset.

As we know the data set is already divided into different files as train and test data, but the test data doesn't include the class attribute (target) which has the reals and fakes word according to the understanding of the machine. the reason why we have decided to predict the values of the class attribute (target) so that we have all data on the test dataset. We consider this step as the starting point of the total work.

we have obtained (2157) observations of target attribute predicted and the exact attributes table.

The decision tree view below shows values of the predicted column by setting the class column to the target column on the training dataset, numbers for record (2) and by giving (target_predicted) as name of the column predicted. The total element of the root nod is (4342) and (7613) is the totale of the specific values of the class attribute which have (57%) for (0) and (43%) for (1) values. In the next decision tree, id<=1548 then (70.8%) of (0) and (39.2%) of (1) will be obtained on the second left side of the root node and so on.

The pie chart shows the count of (target_predicted) attribute (62.95%) of (0) and (37.05) of (1), there is an increase on the (0) in respect to (1) values compared with the (target) column in the train set, what means the major part of the languages has be used by twitter users has 62,95 probability to be fake compared by the training dataset.

Validation Technique

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Training Data set:

The main purpose of the cross-validation holdout technique is to validate performance of the model that was used to predict the class attribute (target) column on the test set.

Three Learner methodology has computed to validate the performance (NBTree, J48, MultiLayerPerceptron)

this step started by partitioning the train data it to sub-train set and test, so that partition 1 is set to be 67% of the total number of observations on the training dataset, and by using stratified sampling method of the target column, and random seed is setted to be (2222).

After the learning procedure we have noticed that (0.61) accuracy was achieved by the two different (NBTree, J48) Learners and low accuracy value (0.5) signed by the MultiLayer Perceptron learner. That is what we can see on the result of the box plot below, NBTree and J48 are performing better than MultiLayerPerceptron on the training dataset.

Accuracy and Error:

The values of accuracy and error are calculated by using the confusion matrix values of (TP, TN, FP, FN) on each learning method. By applying this equation below. 

Acc= TP+TNTP+TN+FP+FN

Err= FP+FNTP+TN+FP+FN

this box plot displays Accuracy values on train data, Views provided by KNIME

<matplotlib.lines.Line2D at 0x1b348cd49e8>] plot_pythn_jupyter line plot displays Error values on train

<matplotlib.lines.Line2D at 0x1b348cd49e8>]scatter plot_pythn_jupyter

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Test Data set:

By applying the same learning evaluation method used on the training data to the test dataset we have obtained different accuracy results regarding the different learners method (0.66) signed for NBTree, (0.76) J48 and (0.63) signed for MultiLayer Perceptron learner.

box plot displays accuracy values on test data, 

Views provided by KNIME.

line plot displays Error values on train and test data, 

<matplotlib.lines.Line2D at 0x1b348cd49e8>]line plot_pythn_jupyter

Holdout Cross Validation BNTree Learner - J48 - Multilayer Perceptron - comparison of accuracy values between Training set and Test set:

By collecting all accuracy values of the train and test data,  we can conclude that the model is performing better on the test data by achieving (76%) if we use J48 learner. 

The box plots below show the accuracy values achieved on the training data and test, so the model is performing  better on the test set  by computing J48 learner.

box plot displays accuracy values on train data and test data - Views provided by KNIME

 

<matplotlib.lines.Line2D at 0x1b348cd49e8>]line plot_pythn_jupyter  displays Error  values on train data and test data 


Data Transformation - Column filtering:

As we are developing a model to predict a value some attribute has to be transformed according to the model required either on the explanatory or on the class attribute, in our case as we are predicting a type of binary and that is almost accepted by the learners (NBTree, J48, MultiLayerPerceptron) but the other attributes of string type cannot be handled by the learning method, the reason why we have decided to exclude all strings type from the training dataset, by using column filter of knime software we remove (keyword, location and text) from the evaluation method and remain only (ID) as an explanatory attribute and (target) as a class attribute, this procedure is computed on the train set and the test set is still have all attributes because it cannot effect on the performance of the model.

Best learner for the positive class attribute:

if we have considered (1) as the value of the positive class and (0) is the value of a negative class. we could have the ability to select the most capable learner to compute the prediction task. That is because the accuracy values can’t be guaranteed to choose the best learner, in which is signing low accuracy value on positive class.

By comparing (precision, recall and F-means) we have taken a decision to select J48 as the best classifier to predict. by achieving the gretter value of F-means (55) compared by BNTree and MLP classifiers.

Below are the values of (precision, recall and F-means) for the positive classes.

Conclusion:

This analysis is computed on data that were collected in acertain time, which means that it could be not valid in all times.

As the data set is downloaded from kaggle platform, the advantage of that is that we don’t have much to do on the part of data cleanser and data structure procedures.

In this analysis we’ve used only three different learners to validate the performance of the models, but there are more other ways to perform a better validation to select the best practical model to achieve the best prediction result. 

If we take our consideration of J48 learner as a better performer method to predict values of (0,1) on the class column, we could say that more than half percent of twitter disasters are wrongly understood by the machine.

References 

No comments:

Post a Comment

Life Expectancy 1990 - 2019:

  Content Abstract: 1 Introduction 1 Data Exploration 1 Missing Data Handling: 2 Comparison in each 10 years 2 Comparison in each ...