Tuesday, September 27, 2022

Life Expectancy 1990 - 2019:

Content

Abstract: 1

Introduction 1

Data Exploration 1

Missing Data Handling: 2

Comparison in each 10 years 2

Comparison in each year 1990 – 2019: 3

Conclusion 3

Reference

https://www.fiverr.com/share/942dQj

Abstract:

The data of life expectancy has been collected and integrated to have a clear answer of the question, in which place around the world life expectancy increases, and people well survive in certain places? So, the maximum number of years an individual human species can live, is the calculation method to respond to this question.

Key Word: Life – Expectancy – Rate – world

Introduction

The data set include (186 country) content region and states of countries all under evaluation of life expectation rate, the sample of data include some region as well, this study gives us a high vision of the availability of basic necessities of life such as food, health and education etc.., as well as the number of survivals is over increase or in the contrary increasing the number of mortality or even if it’s the situation of the life expectation is stable?

By computing some statistics, the questions could be clearer and more satisfying.

Data Exploration

As a matter of fact, data interrogation and integration were the most important challenges to be handled so in the next we will show we have achieved this problem.

Data collection:

Data has been collected from the (Kaggle) website (https://www.kaggle.com/chrisrarig/life-expectancy/notebook) and it contains 1969 observations and 34 attributes with missing values.

All attributes of the dataset were of (String) type, which could be difficult to compute statistical operations. We’ve taken the decision to give each attribute with a numerical type (integer) and we leave the other attributes as (string) type as mentioned next.

The attribute with string type: ((Country_Code, Level, Region, Country))

The attribute with Double type: age average statistics ((years from 1990 to 2019)).

The data we collected from the (186) countries, according to the observations for each country we can see from the data visualised below, in the upper left side we have the United State which is had the maximum number of observations, and gradually from the top to down in right side we can notice that the minimum observation in which Kuwait has obtained.

The table was showing the data of 1990 but we have obtained the same situation for years to 2019. We can conclude that data are treated equally in all years of under study and this diagram below is showing the treatment of years in which data were collected.

Missing Data Handling:

There are some missing values on the data set in a certain time from 1990 to 1998 as maximum in (Angola, Azerbaijan, Belarus, Bhutan, Bosnia and Herzegovina, Burkina Faso etc.) the table below is showing top 20 missing values.

Operations:

By computing these operations I’ve obtain an optimal result these steps have been computed to fill missing values

Giving (median) values for the attribute of integer types, calculating all values from the observation which is related to.
Other attributes like (Country_Code, Level, Region, Country) where having string type of data are without missing values
We have calculated the media for all years from 1990 to 2019

https://public.tableau.com/app/profile/khalid2173/viz/HumanlifeExpectancy1990-2019/HumanlifeExpectancy1990-2019

After we have had data cleaned, the next step was to present these data to be more readable and understandable. If we take a look at the map above, in 1990 some countries like (Italy, Australia, Canada, United States of America, UK and Scandinavian countries), the age average was high, the reason why the map showing a dark green colour people in these countries have a high expectation to live long.

On the other hand if we keep our attention in Africa, we can conclude that average ratios were the lowest, in particular and because of war in Rwanda and Sierra leone, they obtain the minimum averages in which 33.42 is signed in Rwanda and 38.81 in Sierra leone.

By the continuity of years the situation is getting better, and according to our individual notice, the interventions of medical assistance and political accords are becoming more effective to improve the quality life of people in what can increase the average rate of ages. In the next slide we can show how average rates in each country under evaluations in 2019.

The map which is below is presenting the improvement of the ages average in 2019. In 1990 we have taken Rwanda as an example of low rate of rate average, but in 2019 the country under is jump to 69.48 as a media of life age but we still have (Chad, Somalia, cot d’ivoire, Lesotho, Sierra leone, Guinea Bissau and South Sudan) which are signing a minimum number of average rates around 50-60 as media.

Anyhow, we can conclude that as a minimum achieved in 1990 compared of minimum in 2019, we can summaries the averages of ages are getting more better and somehow stable, and we can keep our attention to the African Country which is signing the lowest rate as we can see at the map of age average 2019.

https://public.tableau.com/app/profile/khalid2173/viz/HumanlifeExpectancy1990-2019/HumanlifeExpectancy1990-2019

Comparison in each 10 years

In this part we’ll show the difference of the increasing age average in every 10 years.

So, we can conclude that the lowest age averages had been signed in 1990 which was (56.483), by the increasing of the in each year with un stable value, until the last year of the study in which we obtain a highest value of age average (73.150).

Comparison in each year 1990 – 2019:

All previous steps were a prelude to this step of the presentation, which will appear in the following:

In this presentation we can take our attention to the area of confidence interval in which we decided to be 95% for the total of the sample obtained. So, we can conclude that a 95% in 29 years starting from 1990 to 2019 life expectancy ranging between (68.554-70.495)

Conclusion

As long as there is a quaite suffusion of diagrams that could be used to demonstrate a good visual reading of age average during the years under studies, but we have taken our decision to demonstrate the map in above to give a clearer understanding to the user by computing the latitude and longitude which were been calculated taking the country names.

Some statistical valuation to verify the graphic presentation has been used by a sample of 4 people, by calculating visibility, consistency, flexibility and how the presentation is understood.

Reference

https://www.tableau.com/it-it/trial/download-tableau?utm_campaign_id=2017049&utm_campaign=Prospecting-CORE-ALL-ALL-ALL-ALL&utm_medium=Paid+Search&utm_Source=Bing&utm_language=IT&utm_country=SOEUR&kw=tableau&adgroup=CTX-Brand-Core-IT-E&adused=&matchtype=e&placement=&&msclkid=8226553606bc107a370a373c30f9091a&gclid=8226553606bc107a370a373c30f9091a&gclsrc=3p.ds
https://www.kaggle.com/chrisrarig/life-expectancy/notebook
https://www.knime.com/
XLSTAT | Statistical Software for Excel
https://public.tableau.com/app/profile/khalid2173/viz/HumanlifeExpectancy1990-2019/HumanlifeExpectancy1990-2019

Monday, September 19, 2022

Natural Language Processing with Disaster Tweets

Binary Classification Problem

Contains:

Abstract:

Data Processing:

Data Discovery:

Missing Data Handling:

Model:

Decision Tree Model to Predict the Values of Class Attribute (target) on the Test set:

Validation Technique

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Training Data set:

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Test Data set:

Data Transformation - Column filtering:

Best learner for the class attribute:

Conclusion:

References

Abstract:

Twitter has become an important communication channel in times of emergency.

The ubiquitous Ness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e., disaster relief organisations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster or not. Twitter users explicitly use words that can be quite clear to the human, which is less clear to a machine.

In the work space we have to build a machine learning model to predict which tweets are a real disaster and which are not.

Data Processing:

dataset is available in kaggle platform, and consist of 10776 tweets that were almost classified tran dataset and test dataset, dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’.

after downloading the data and saved in the local disc, some process has been done to understand the dataset:

Data Discovery:

train.csv - training dataset (7613) rows.

Columns:

id - is an integer type unique identifier for each tweet
text - is string type and it contains the text of the tweet that has been published by the user.
location - is string type and contain the location where the tweet was sent from (have some missing data)
keyword - is string type contains a particular keyword from the tweet (have some missing data)
target - is spatial category (binary) an integer type in train.csv only in which has to be predicted and applied on the test dataset, it’s denoting whether a tweet is about a real disaster (1) or not (0)

(id, text, location, keyword) are explanatory attribute

The count of the prediction or class attribute (target) which has (0,1) values. we have obtained 57.03% for (0) and 42.97% for (1).

test.csv in which we need to apply the prediction column (3263) rows.

Columns:

id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (could be blank)
keyword - a particular keyword from the tweet (could be blank)

Missing Data Handling:

If we take our considerations of the training dataset:

text has (0) missing data
location has (2534) missing data
keyword has(61) missing data
target has (0) missing data

notice that our class attribute (target ) has no missing values, which means that the decision model will not reduce the quality of the prediction if we do not take other attributes into consideration, so we would have to decide to not take any action on the training dataset, but if we take the in our considerations (location and keyword) it will effect on the prediction result so the right decision the to remove (location, keyword and keyword) from the dataset.

regarding missing data of the test dataset,

text has (0) missing data
location has (1106) missing data,
keyword has (26) missing data

we have decided to remove attributes of missing values (location and keyword). so that we will have (id, and target column predicted) with no missing value in all attribute’s tables.

note: this handling method can not be represented PMML 4.2

Model:

Decision Tree Model to Predict the Values of Class Attribute (target) on the Test set:

After the first step of data discovery, a decision model technique is an efficient way to predict the values of the class attribute (target) on the test dataset.

As we know the data set is already divided into different files as train and test data, but the test data doesn't include the class attribute (target) which has the reals and fakes word according to the understanding of the machine. the reason why we have decided to predict the values of the class attribute (target) so that we have all data on the test dataset. We consider this step as the starting point of the total work.

we have obtained (2157) observations of target attribute predicted and the exact attributes table.

The decision tree view below shows values of the predicted column by setting the class column to the target column on the training dataset, numbers for record (2) and by giving (target_predicted) as name of the column predicted. The total element of the root nod is (4342) and (7613) is the totale of the specific values of the class attribute which have (57%) for (0) and (43%) for (1) values. In the next decision tree, id<=1548 then (70.8%) of (0) and (39.2%) of (1) will be obtained on the second left side of the root node and so on.

The pie chart shows the count of (target_predicted) attribute (62.95%) of (0) and (37.05) of (1), there is an increase on the (0) in respect to (1) values compared with the (target) column in the train set, what means the major part of the languages has be used by twitter users has 62,95 probability to be fake compared by the training dataset.

Validation Technique

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Training Data set:

The main purpose of the cross-validation holdout technique is to validate performance of the model that was used to predict the class attribute (target) column on the test set.

Three Learner methodology has computed to validate the performance (NBTree, J48, MultiLayerPerceptron)

this step started by partitioning the train data it to sub-train set and test, so that partition 1 is set to be 67% of the total number of observations on the training dataset, and by using stratified sampling method of the target column, and random seed is setted to be (2222).

After the learning procedure we have noticed that (0.61) accuracy was achieved by the two different (NBTree, J48) Learners and low accuracy value (0.5) signed by the MultiLayer Perceptron learner. That is what we can see on the result of the box plot below, NBTree and J48 are performing better than MultiLayerPerceptron on the training dataset.

Accuracy and Error:

The values of accuracy and error are calculated by using the confusion matrix values of (TP, TN, FP, FN) on each learning method. By applying this equation below.

Acc= TP+TNTP+TN+FP+FN

Err= FP+FNTP+TN+FP+FN

this box plot displays Accuracy values on train data, Views provided by KNIME

<matplotlib.lines.Line2D at 0x1b348cd49e8>] plot_pythn_jupyter line plot displays Error values on train

<matplotlib.lines.Line2D at 0x1b348cd49e8>]scatter plot_pythn_jupyter

Holdout Cross Validation BNTree Learner - J48 - Multi Layer Perceptron on Test Data set:

By applying the same learning evaluation method used on the training data to the test dataset we have obtained different accuracy results regarding the different learners method (0.66) signed for NBTree, (0.76) J48 and (0.63) signed for MultiLayer Perceptron learner.

box plot displays accuracy values on test data,

Views provided by KNIME.

line plot displays Error values on train and test data,

<matplotlib.lines.Line2D at 0x1b348cd49e8>]line plot_pythn_jupyter

Holdout Cross Validation BNTree Learner - J48 - Multilayer Perceptron - comparison of accuracy values between Training set and Test set:

By collecting all accuracy values of the train and test data, we can conclude that the model is performing better on the test data by achieving (76%) if we use J48 learner.

The box plots below show the accuracy values achieved on the training data and test, so the model is performing better on the test set by computing J48 learner.

box plot displays accuracy values on train data and test data - Views provided by KNIME

<matplotlib.lines.Line2D at 0x1b348cd49e8>]line plot_pythn_jupyter displays Error values on train data and test data

Data Transformation - Column filtering:

As we are developing a model to predict a value some attribute has to be transformed according to the model required either on the explanatory or on the class attribute, in our case as we are predicting a type of binary and that is almost accepted by the learners (NBTree, J48, MultiLayerPerceptron) but the other attributes of string type cannot be handled by the learning method, the reason why we have decided to exclude all strings type from the training dataset, by using column filter of knime software we remove (keyword, location and text) from the evaluation method and remain only (ID) as an explanatory attribute and (target) as a class attribute, this procedure is computed on the train set and the test set is still have all attributes because it cannot effect on the performance of the model.

Best learner for the positive class attribute:

if we have considered (1) as the value of the positive class and (0) is the value of a negative class. we could have the ability to select the most capable learner to compute the prediction task. That is because the accuracy values can’t be guaranteed to choose the best learner, in which is signing low accuracy value on positive class.

By comparing (precision, recall and F-means) we have taken a decision to select J48 as the best classifier to predict. by achieving the gretter value of F-means (55) compared by BNTree and MLP classifiers.

Below are the values of (precision, recall and F-means) for the positive classes.

Conclusion:

This analysis is computed on data that were collected in acertain time, which means that it could be not valid in all times.

As the data set is downloaded from kaggle platform, the advantage of that is that we don’t have much to do on the part of data cleanser and data structure procedures.

In this analysis we’ve used only three different learners to validate the performance of the models, but there are more other ways to perform a better validation to select the best practical model to achieve the best prediction result.

If we take our consideration of J48 learner as a better performer method to predict values of (0,1) on the class column, we could say that more than half percent of twitter disasters are wrongly understood by the machine.

References

Thursday, September 1, 2022

You will get Developer to Convert HTML Template Into WordPress Theme 1

You will get Developer to Convert HTML Template Into WordPress Theme part 2

World of Violence: Protesters 2021 - Machine Learning project

World of Violence: Protesters 2021 - Machine Learning

Abstract

In 2021 millions of people will need humanitarian assistance and protection according to the number of violence that has been showed during the analysis stages of Protesters and Armed Conflict Location & Event. the gold of the analysis to show the events that achieve the most frequency rate among other events.

Key words: protesters – rioters – alshabab – violence

Contente

Introduction

1. Data Exploration

2. Preprocessing

• Sampling Techniques

• Missing Data Handling

• Variable Transformation

• Transformation of categorical variables

3. Models

• Cross Validation techniques

• Feature Filtering

• Holdout Evaluation

• K-folds Cross Validation Evaluation

• Comparison the accuracy result of Holdout and k-folds Cross validations

• Prototype Base Algorithm (K-Means)

• Hierarchical Clustering & DBSCAN (DistMatrex)

conclusion

References

Introduction

Events that are going through the world (protesting) creates an emergence, deterioration of the global economy and increases of the poverty rates of people. The reason why I decided to take initiative to conduct this scientific research to the main result of knowing the most effective frequencies and the probability of increasing events activities of the world.

Data has collected by including all reported political violence and protest events across Africa, the Middle East, Latin America & the Caribbean, East Asia, South Asia, Southeast Asia, Central Asia & the Caucasus, Europe, and the United States of America.

According to the groups that belong to the owners of the event, these events include military coups, wars, and also include demonstrations and people protesting which istake the highest number of events (145) 2021.

The dataset has been interrogated by using the (API). real time data in json format, then I’ve transformed the documentational json data to (Excel sheet) relational to be more readable. The data contains a set of (82 row and 13 table attribute). All table attributes are set have Number (double) a part of the country attribute which is set to have string type(figure1).

Figure1 statistic result of data

1. Data Exploration

If we look at the figure below that shows violence statistics, we'll find that the Peaceful demonstrations in public squares takes the highest level among all types of existing violence.

Figur2 box plot visualizing the high value of protesters

(Protesters, max) =145 of frequencies

And we can note that there is other two levels of violence are taking a high-level compering with the others type of violence.

(Police/milit, max) = 9 of frequencies (Unidentified armed Group, max) = 9 of frequencies

2. Preprocessing

In order to make data easier to read, I've implemented some strategies which include sampling techniques.

• Sampling Techniques

From the total numbers of row that are include (82 absolute rows) I have made the decision to make the sampling test to (Relative 10%) using Bootstrap sampling Appending count occurrences to control the duplicate values. So, the table of the dataset has split to two partitions the train set witch I'll use them as a training set, and the rest of the rows to use as a test set of data.

I keep my attentions to the (country attribute) and the attribute of the most frequent events (protesters attribute) to be the used to compared, by filtering these attribute tables using (column filter).

• Missing Data Handling

The set of the data has been distributed according to the country evet type, so If I do any statistical manipulations (line interpolation, max, most frequent value …etc.), I will not get a perfect analysis result, so that the missing values were treated by adding the value zero to all the record of a missing values using (Missing Value node).

• Variable Transformation

To have a given minimum and maximum value for the protesters, I’ve taken the decision to transform all values related by using the normalizer node, setting the method to be Min- Max normalization, this choice is showing the less value of (0) has been signed in Nigeria, by the other side the maximum value has signed in Sweden, and the most other countries (Armenia, Slovenia, South Sudan and Venezuela) have signed (0.1) according to trading dataset.

Figure3 data normalizing

• Transformation of categorical variables

as the dataset contain a categorical character so it useful to normalize the protesters attribute. do after divide each number of protesters column (10%) I’ve obtain either 0 or 1 of al values of the protesters attribute.

The progress was partitioning the data in to two as the first part to use as training set and the other part as data set. Then I select only the column of interest to compute the normalization metho by setting (min-max Normalization) and setting the value of the min to (0.0) and the max value to (1.0).

Lastly, I take the decision to compare the different between the normalization result for all data and the protesters attribute. In the line box below, there is some statistic values are visualized (min, max, standard deviation and variance).

Figure4 line plot (statistic of the normalization values)

3. Models

Different Modeling Methods were been used to calculate predicting value.

Cross Validation techniques

The dataset has split to two parts (training data and the test data) using partitioning node of knime platform, I have made the decision to give the first part of the data (67%) from all the dataset in a random selection, so it'll be useful to circular guarantee and setting the random seed to (123). And this partitioning has taken action two times to complete the different parts of the cross validation. In particular (holdout and k-folds classifiers techniques) where the computed to compare the value of accuracy between all classifiers under studies.

Feature Filtering

To have more clear data and to reduce data dimensionality, I have decided to compute filter selection procedure (uni-Variate) by uploading the dataset under analysis. data partitioned to two, (67%) for the training set and (33%) to the test dataset and by choosing draw random and random seed (123).

To compute a decision tree classifier, I use knime node (AttributeSelectedClassifier (3.7))

To proceed with a uni-variate the evaluator had set at (CfsSubsetEval)

choose (J48) as classifier.

Country attribute is set to be target column.

(AttributeSelectedClassifier) has linked to weka predictor node to compute the classification to the rest of the dataset.

Holdout Evaluation

To compute the accuracy value to know how classifier is preformed, I have partitioned the data in to sub training set and test set, actually I’ve given the training data set the major part of the data, in particular it has (70%) and the remaining (30%) has been given to the test data set, as long as there is a different classifier to calculate the accuracy value, I’ve taken the decision to choose the (NBTree classifier), to obtain an accuracy over the test data set I choose (weka predictor inducer) to query over the test data set.

below the box plot (figure5) show the accuracy result of 0.0 witch as a maximum and 0.0 minimum possibility, which is mean that there is a (missing values) on training dataset so, high generalization error has achieved using holdout evaluation method.

Figure5 box plot - Holdout Cross Validation result

K-folds Cross Validation Evaluation

To ensure that each row is included to the training data set, in this case it could be better to partition data in to disjoin sub set with a number of rows and sequence iteration. So, the x-partitioner is needed to complete this task, practically, I’ve set the number of records of validation to (10), random sampling method was preferred to be used and 2222 as a random seed. According to these decision takes I’ve notice that error value achieved are reduced compared with a holdout validation accuracy value.

below the box plot (figure5) show the accuracy result of 0.52 which as a maximum and 0.0 as minimum possibility of K-folds Cross Validation.

Figure5 box plot – K-folds Cross Validation result

Comparison the accuracy result of Holdout and k-folds Cross validations

The first consideration is that all the accuracy estimation procedures are achieved by the two different classifiers, have the same result, because the different classifiers have applied at the same part of data (test set) to calculate the confidence interval.

The accuracy value has obtained by computing the classifier to the tow partitions (A, B)

second consideration is that holdout and k-folds estimations are a point, actually they present only the generalization errors during validations.

In the line plot (figure6) which is showing below the result of the accuracy value computed by the two different modulization with an interpolating of the missing values.

I can conclude that accuracy value obtain by classification model K-Folds is achieve better result (0.52) comparing with Holdout cross validation which is achieved (0)

comparing the two classification models is statistically significant.

Fugure6

Prototype Base Algorithm (K-Means)

For proximity measures and to give more meaningful of group of objects, I have taken the decision to compute K-mean cluster algorithm to the dataset which are including missing values, in particular I handled them by given a value of (0) to the double missing values and the value (non) to the string attribute, these procedures will make the k

mean cluster result more coherent.

Initially I have started by to a random initialization of the k-mean cluster centroid (hierarchical clustering) and the statistic random. All attributes are chosen to compute the k-mean clustering algorithm and the number of clusters are sets to be (5). clusters

with the count of number of attributes related to, as follow

1. Cluster number 0 contain 12 observations 2. Cluster number 1 contain 10 observations 3. Cluster number 2 contain 10 observations

4. Cluster number 3 contain 9 observations 5. Cluster number 4 contain 9 observations

To improve the readability of these clusters I have groups the rows of a table by the unique values by selecting the group columns. A row is created for each unique set of values of the selected group column. The remaining columns are aggregated based on the specified aggregation settings (mean). The output table contains one row for each unique value. These conditional box plot bellow showing the clustering k-mean result.

part and the rest of the data which have (32) as second part, with draw randomly and setting the random seed to 777.

▪ Normalization: normalizing data by setting z Score normalizations to be the method of normalizing data.

▪ Distance matrix calculate: by using Euclidean distance

▪ Hierarchical clustering: the set of linkage type is average linkage

▪ Hierarchical clustering view: in below the (figure10) show the amount of distance between clusters under evaluation.

▪ DBSCAN

As we can see the conditional box plot at the axes lines, we have the five cluster at (x) axe and the protesters values resulting at the(y) axe. I notice that the value of the protesters is different from each cluster to another.

Actually, we can select different attribute during the set of the box plot to see different result for other Events.

As I compute the cluster algorithm to the first partition of the data se I have to compute the cluster algorithm to the second part of the data. By assigning data of the second partition.

Hierarchical Clustering & DBSCAN (DistMatrex)

To understand a different matrix distance between classes, hierarchical clustering is required in particular the distances of the observations point.

The prosses has started by uploading the dataset of violence and events, then some operations have done they mentioned in bellow

▪ Column filtering: filtering all attribute and I considered only the cases where I want to compute the hierarchical.

▪ Missing values: The next step is to handle the missing values by giving values of (0) for double and (non) for strings.

▪ Partitioning: partitioning data to two parts, so I have given absolute value of (50) for the first

As we can see from the dendrogram the distance view in the (x) which is presenting the number of clusters and in the (y) axes which is presenting the numbers of error achieved. So, that could be the useful to choose the perfect number of clusters and that could be (47) as best number from the two clusters under comparison.

Figure11

The pie chart is showing the percentage of the BSCAN technique. after has setting the epsilon to (0.5) and minimum point to (3), the summary table resulting (2) cluster with (18) records of noises points.

conclusion

This analysis is the result of knowing the most violent event type in the world and their probability to increase.

According to what I notice that all or most of events are achieved by the citizens are peaceful protest, rather and the probability to increase is high.

The dataset is quiet enough to have the answer the question.

All models presented having a mean to facilitated the readability of the data and quite useful to take better decision. Although there a missing value in the data set and the process. the missing values do not affect much if I handled them by the most effective technique.