INFOBALAK: World of Violence: Protesters 2021

World of Violence: Protesters 2021 - Machine Learning

Abstract

In 2021 millions of people will need humanitarian assistance and protection according to the number of violence that has been showed during the analysis stages of Protesters and Armed Conflict Location & Event. the gold of the analysis to show the events that achieve the most frequency rate among other events.

Key words: protesters – rioters – alshabab – violence

Contente

Introduction

1. Data Exploration

2. Preprocessing

• Sampling Techniques

• Missing Data Handling

• Variable Transformation

• Transformation of categorical variables

3. Models

• Cross Validation techniques

• Feature Filtering

• Holdout Evaluation

• K-folds Cross Validation Evaluation

• Comparison the accuracy result of Holdout and k-folds Cross validations

• Prototype Base Algorithm (K-Means)

• Hierarchical Clustering & DBSCAN (DistMatrex)

conclusion

References

Introduction

Events that are going through the world (protesting) creates an emergence, deterioration of the global economy and increases of the poverty rates of people. The reason why I decided to take initiative to conduct this scientific research to the main result of knowing the most effective frequencies and the probability of increasing events activities of the world.

Data has collected by including all reported political violence and protest events across Africa, the Middle East, Latin America & the Caribbean, East Asia, South Asia, Southeast Asia, Central Asia & the Caucasus, Europe, and the United States of America.

According to the groups that belong to the owners of the event, these events include military coups, wars, and also include demonstrations and people protesting which istake the highest number of events (145) 2021.

The dataset has been interrogated by using the (API). real time data in json format, then I’ve transformed the documentational json data to (Excel sheet) relational to be more readable. The data contains a set of (82 row and 13 table attribute). All table attributes are set have Number (double) a part of the country attribute which is set to have string type(figure1).

Figure1 statistic result of data

1. Data Exploration

If we look at the figure below that shows violence statistics, we'll find that the Peaceful demonstrations in public squares takes the highest level among all types of existing violence.

Figur2 box plot visualizing the high value of protesters

(Protesters, max) =145 of frequencies

And we can note that there is other two levels of violence are taking a high-level compering with the others type of violence.

(Police/milit, max) = 9 of frequencies (Unidentified armed Group, max) = 9 of frequencies

2. Preprocessing

In order to make data easier to read, I've implemented some strategies which include sampling techniques.

• Sampling Techniques

From the total numbers of row that are include (82 absolute rows) I have made the decision to make the sampling test to (Relative 10%) using Bootstrap sampling Appending count occurrences to control the duplicate values. So, the table of the dataset has split to two partitions the train set witch I'll use them as a training set, and the rest of the rows to use as a test set of data.

I keep my attentions to the (country attribute) and the attribute of the most frequent events (protesters attribute) to be the used to compared, by filtering these attribute tables using (column filter).

• Missing Data Handling

The set of the data has been distributed according to the country evet type, so If I do any statistical manipulations (line interpolation, max, most frequent value …etc.), I will not get a perfect analysis result, so that the missing values were treated by adding the value zero to all the record of a missing values using (Missing Value node).

• Variable Transformation

To have a given minimum and maximum value for the protesters, I’ve taken the decision to transform all values related by using the normalizer node, setting the method to be Min- Max normalization, this choice is showing the less value of (0) has been signed in Nigeria, by the other side the maximum value has signed in Sweden, and the most other countries (Armenia, Slovenia, South Sudan and Venezuela) have signed (0.1) according to trading dataset.

Figure3 data normalizing

• Transformation of categorical variables

as the dataset contain a categorical character so it useful to normalize the protesters attribute. do after divide each number of protesters column (10%) I’ve obtain either 0 or 1 of al values of the protesters attribute.

The progress was partitioning the data in to two as the first part to use as training set and the other part as data set. Then I select only the column of interest to compute the normalization metho by setting (min-max Normalization) and setting the value of the min to (0.0) and the max value to (1.0).

Lastly, I take the decision to compare the different between the normalization result for all data and the protesters attribute. In the line box below, there is some statistic values are visualized (min, max, standard deviation and variance).

Figure4 line plot (statistic of the normalization values)

3. Models

Different Modeling Methods were been used to calculate predicting value.

Cross Validation techniques

The dataset has split to two parts (training data and the test data) using partitioning node of knime platform, I have made the decision to give the first part of the data (67%) from all the dataset in a random selection, so it'll be useful to circular guarantee and setting the random seed to (123). And this partitioning has taken action two times to complete the different parts of the cross validation. In particular (holdout and k-folds classifiers techniques) where the computed to compare the value of accuracy between all classifiers under studies.

Feature Filtering

To have more clear data and to reduce data dimensionality, I have decided to compute filter selection procedure (uni-Variate) by uploading the dataset under analysis. data partitioned to two, (67%) for the training set and (33%) to the test dataset and by choosing draw random and random seed (123).

To compute a decision tree classifier, I use knime node (AttributeSelectedClassifier (3.7))

To proceed with a uni-variate the evaluator had set at (CfsSubsetEval)

choose (J48) as classifier.

Country attribute is set to be target column.

(AttributeSelectedClassifier) has linked to weka predictor node to compute the classification to the rest of the dataset.

Holdout Evaluation

To compute the accuracy value to know how classifier is preformed, I have partitioned the data in to sub training set and test set, actually I’ve given the training data set the major part of the data, in particular it has (70%) and the remaining (30%) has been given to the test data set, as long as there is a different classifier to calculate the accuracy value, I’ve taken the decision to choose the (NBTree classifier), to obtain an accuracy over the test data set I choose (weka predictor inducer) to query over the test data set.

below the box plot (figure5) show the accuracy result of 0.0 witch as a maximum and 0.0 minimum possibility, which is mean that there is a (missing values) on training dataset so, high generalization error has achieved using holdout evaluation method.

Figure5 box plot - Holdout Cross Validation result

K-folds Cross Validation Evaluation

To ensure that each row is included to the training data set, in this case it could be better to partition data in to disjoin sub set with a number of rows and sequence iteration. So, the x-partitioner is needed to complete this task, practically, I’ve set the number of records of validation to (10), random sampling method was preferred to be used and 2222 as a random seed. According to these decision takes I’ve notice that error value achieved are reduced compared with a holdout validation accuracy value.

below the box plot (figure5) show the accuracy result of 0.52 which as a maximum and 0.0 as minimum possibility of K-folds Cross Validation.

Figure5 box plot – K-folds Cross Validation result

Comparison the accuracy result of Holdout and k-folds Cross validations

The first consideration is that all the accuracy estimation procedures are achieved by the two different classifiers, have the same result, because the different classifiers have applied at the same part of data (test set) to calculate the confidence interval.

The accuracy value has obtained by computing the classifier to the tow partitions (A, B)

second consideration is that holdout and k-folds estimations are a point, actually they present only the generalization errors during validations.

In the line plot (figure6) which is showing below the result of the accuracy value computed by the two different modulization with an interpolating of the missing values.

I can conclude that accuracy value obtain by classification model K-Folds is achieve better result (0.52) comparing with Holdout cross validation which is achieved (0)

comparing the two classification models is statistically significant.

Fugure6

Prototype Base Algorithm (K-Means)

For proximity measures and to give more meaningful of group of objects, I have taken the decision to compute K-mean cluster algorithm to the dataset which are including missing values, in particular I handled them by given a value of (0) to the double missing values and the value (non) to the string attribute, these procedures will make the k

mean cluster result more coherent.

Initially I have started by to a random initialization of the k-mean cluster centroid (hierarchical clustering) and the statistic random. All attributes are chosen to compute the k-mean clustering algorithm and the number of clusters are sets to be (5). clusters

with the count of number of attributes related to, as follow

1. Cluster number 0 contain 12 observations 2. Cluster number 1 contain 10 observations 3. Cluster number 2 contain 10 observations

4. Cluster number 3 contain 9 observations 5. Cluster number 4 contain 9 observations

To improve the readability of these clusters I have groups the rows of a table by the unique values by selecting the group columns. A row is created for each unique set of values of the selected group column. The remaining columns are aggregated based on the specified aggregation settings (mean). The output table contains one row for each unique value. These conditional box plot bellow showing the clustering k-mean result.

part and the rest of the data which have (32) as second part, with draw randomly and setting the random seed to 777.

▪ Normalization: normalizing data by setting z Score normalizations to be the method of normalizing data.

▪ Distance matrix calculate: by using Euclidean distance

▪ Hierarchical clustering: the set of linkage type is average linkage

▪ Hierarchical clustering view: in below the (figure10) show the amount of distance between clusters under evaluation.

▪ DBSCAN

As we can see the conditional box plot at the axes lines, we have the five cluster at (x) axe and the protesters values resulting at the(y) axe. I notice that the value of the protesters is different from each cluster to another.

Actually, we can select different attribute during the set of the box plot to see different result for other Events.

As I compute the cluster algorithm to the first partition of the data se I have to compute the cluster algorithm to the second part of the data. By assigning data of the second partition.

Hierarchical Clustering & DBSCAN (DistMatrex)

To understand a different matrix distance between classes, hierarchical clustering is required in particular the distances of the observations point.

The prosses has started by uploading the dataset of violence and events, then some operations have done they mentioned in bellow

▪ Column filtering: filtering all attribute and I considered only the cases where I want to compute the hierarchical.

▪ Missing values: The next step is to handle the missing values by giving values of (0) for double and (non) for strings.

▪ Partitioning: partitioning data to two parts, so I have given absolute value of (50) for the first

As we can see from the dendrogram the distance view in the (x) which is presenting the number of clusters and in the (y) axes which is presenting the numbers of error achieved. So, that could be the useful to choose the perfect number of clusters and that could be (47) as best number from the two clusters under comparison.

Figure11

The pie chart is showing the percentage of the BSCAN technique. after has setting the epsilon to (0.5) and minimum point to (3), the summary table resulting (2) cluster with (18) records of noises points.

conclusion

This analysis is the result of knowing the most violent event type in the world and their probability to increase.

According to what I notice that all or most of events are achieved by the citizens are peaceful protest, rather and the probability to increase is high.

The dataset is quiet enough to have the answer the question.

All models presented having a mean to facilitated the readability of the data and quite useful to take better decision. Although there a missing value in the data set and the process. the missing values do not affect much if I handled them by the most effective technique.

INFOBALAK

Thursday, September 1, 2022

World of Violence: Protesters 2021 - Machine Learning project

No comments:

Post a Comment

Life Expectancy 1990 - 2019: