Thursday, September 1, 2022

World of Violence: Protesters 2021 - Machine Learning project

 World of Violence: Protesters 2021 - Machine  Learning

Abstract 

In 2021 millions of people will need humanitarian  assistance and protection according to the number  of violence that has been showed during the  analysis stages of Protesters and Armed Conflict  Location & Event. the gold of the analysis to show  the events that achieve the most frequency rate  among other events. 

Key words: protesters – rioters – alshabab – violence

Contente 

Introduction 

1. Data Exploration 

2. Preprocessing 

Sampling Techniques 

Missing Data Handling 

Variable Transformation 

Transformation of categorical variables 

3. Models 

Cross Validation techniques 

Feature Filtering 

Holdout Evaluation 

K-folds Cross Validation Evaluation 

Comparison the accuracy result of Holdout and k-folds Cross validations 

Prototype Base Algorithm (K-Means) 

Hierarchical Clustering & DBSCAN (DistMatrex)

conclusion 

References 

Introduction 

Events that are going through the world (protesting)  creates an emergence, deterioration of the global economy and increases of the poverty rates of  people. The reason why I decided to take initiative to conduct this scientific research to the main result  of knowing the most effective frequencies and the  probability of increasing events activities of the world. 

Data has collected by including all reported  political violence and protest events across Africa,  the Middle East, Latin America & the Caribbean,  East Asia, South Asia, Southeast Asia, Central Asia  & the Caucasus, Europe, and the United States of  America. 

According to the groups that belong to the owners of the event, these events include military coups, wars, and also include demonstrations and people  protesting which istake the highest number of events (145) 2021. 

The dataset has been interrogated by using the (API).  real time data in json format, then I’ve transformed the documentational json data to (Excel sheet) relational to be more readable. The data contains a  set of (82 row and 13 table attribute). All table  attributes are set have Number (double) a part of the  country attribute which is set to have string  type(figure1). 

Figure1 statistic result of data 

1. Data Exploration 

If we look at the figure below that shows violence statistics, we'll find that the Peaceful demonstrations in public squares takes the highest level among all types of existing violence. 

Figur2 box plot visualizing the high value of protesters 

(Protesters, max) =145 of frequencies 

And we can note that there is other two levels of  violence are taking a high-level compering with the others type of violence. 

(Police/milit, max) = 9 of frequencies (Unidentified  armed Group, max) = 9 of frequencies 

2. Preprocessing 

In order to make data easier to read, I've implemented some strategies which include sampling techniques. 

Sampling Techniques 

From the total numbers of row that are include (82  absolute rows) I have made the decision to make the  sampling test to (Relative 10%) using Bootstrap sampling Appending count occurrences to control the duplicate values. So, the table of the dataset has  split to two partitions the train set witch I'll use  them as a training set, and the rest of the rows to  use as a test set of data. 

I keep my attentions to the (country attribute) and  the attribute of the most frequent events (protesters  attribute) to be the used to compared, by filtering  these attribute tables using (column filter). 

Missing Data Handling 

The set of the data has been distributed according to the country evet type, so If I do any statistical manipulations (line interpolation, max, most frequent value …etc.), I will not get a perfect analysis result, so that the missing values were treated by adding the value zero to all the record of  a missing values using (Missing Value node). 

Variable Transformation 

To have a given minimum and maximum value for the protesters, I’ve taken the decision to transform all values related by using the normalizer node,  setting the method to be Min- Max normalization,  this choice is showing the less value of (0) has been  signed in Nigeria, by the other side the maximum value has signed in Sweden, and the most other countries (Armenia, Slovenia, South Sudan and Venezuela) have signed (0.1) according to trading dataset.

Figure3 data normalizing 

Transformation of categorical variables

as the dataset contain a categorical character so it  useful to normalize the protesters attribute. do after  divide each number of protesters column (10%)  I’ve obtain either 0 or 1 of al values of the protesters attribute. 

The progress was partitioning the data in to two as  the first part to use as training set and the other part as data set. Then I select only the column of interest to compute the normalization metho by setting  (min-max Normalization) and setting the value of  the min to (0.0) and the max value to (1.0). 

Lastly, I take the decision to compare the different between the normalization result for all data and the  protesters attribute. In the line box below, there is some statistic values are visualized (min, max, standard deviation and variance). 

Figure4 line plot (statistic of the normalization values) 

3. Models 

Different Modeling Methods were been used to calculate predicting value. 

Cross Validation techniques 

The dataset has split to two parts (training data and the test data) using partitioning node of knime  platform, I have made the decision to give the first part of the data (67%) from all the dataset in a  random selection, so it'll be useful to circular  guarantee and setting the random seed to (123). And  this partitioning has taken action two times to  complete the different parts of the cross validation.  In particular (holdout and k-folds classifiers techniques) where the computed to compare the value of accuracy between all classifiers under  studies. 

Feature Filtering 

To have more clear data and to reduce data dimensionality, I have decided to compute filter  selection procedure (uni-Variate) by uploading the  dataset under analysis. data partitioned to two, (67%)  for the training set and (33%) to the test dataset and  by choosing draw random and random seed (123). 

To compute a decision tree classifier, I use knime  node (AttributeSelectedClassifier (3.7)) 

To proceed with a uni-variate the evaluator had set  at (CfsSubsetEval) 

choose (J48) as classifier. 

Country attribute is set to be target column. 

(AttributeSelectedClassifier) has linked to weka  predictor node to compute the classification to the  rest of the dataset.

Holdout Evaluation 

To compute the accuracy value to know how  classifier is preformed, I have partitioned the data  in to sub training set and test set, actually I’ve given  the training data set the major part of the data, in  particular it has (70%) and the remaining (30%) has  been given to the test data set, as long as there is a  different classifier to calculate the accuracy value,  I’ve taken the decision to choose the (NBTree  classifier), to obtain an accuracy over the test data  set I choose (weka predictor inducer) to query over  the test data set. 

below the box plot (figure5) show the accuracy result of 0.0 witch as a maximum and 0.0 minimum possibility, which is mean that there is a (missing  values) on training dataset so, high generalization  error has achieved using holdout evaluation  method. 

Figure5 box plot - Holdout Cross Validation result 

K-folds Cross Validation Evaluation 

To ensure that each row is included to the training  data set, in this case it could be better to partition  data in to disjoin sub set with a number of rows and  sequence iteration. So, the x-partitioner is needed  to complete this task, practically, I’ve set the  number of records of validation to (10), random  sampling method was preferred to be used and 2222 as a random seed. According to these decision  takes I’ve notice that error value achieved are  reduced compared with a holdout validation  accuracy value. 

below the box plot (figure5) show the accuracy result of 0.52 which as a maximum and 0.0 as minimum possibility of K-folds Cross Validation. 

Figure5 box plot – K-folds Cross Validation result 

Comparison the accuracy result of Holdout and k-folds Cross validations 

The first consideration is that all the accuracy  estimation procedures are achieved by the two  different classifiers, have the same result, because  the different classifiers have applied at the same part  of data (test set) to calculate the confidence interval. 

The accuracy value has obtained by computing the  classifier to the tow partitions (A, B) 

second consideration is that holdout and k-folds  estimations are a point, actually they present only the  generalization errors during validations. 

In the line plot (figure6) which is showing below the result of the accuracy value computed by the two different modulization with an interpolating of the missing values. 

I can conclude that accuracy value obtain by  classification model K-Folds is achieve better result (0.52) comparing with Holdout cross validation which is achieved (0) 

comparing the two classification models is statistically significant. 

Fugure6 

Prototype Base Algorithm (K-Means) 

For proximity measures and to give more  meaningful of group of objects, I have taken the  decision to compute K-mean cluster algorithm to the  dataset which are including missing values, in  particular I handled them by given a value of (0) to  the double missing values and the value (non) to the  string attribute, these procedures will make the k 

mean cluster result more coherent. 

Initially I have started by to a random initialization  of the k-mean cluster centroid (hierarchical  clustering) and the statistic random. All attributes are  chosen to compute the k-mean clustering algorithm  and the number of clusters are sets to be (5). clusters 

with the count of number of attributes related to, as  follow

1. Cluster number 0 contain 12 observations  2. Cluster number 1 contain 10 observations  3. Cluster number 2 contain 10 observations 

4. Cluster number 3 contain 9 observations  5. Cluster number 4 contain 9 observations  

To improve the readability of these clusters I have  groups the rows of a table by the unique values by  selecting the group columns. A row is created for  each unique set of values of the selected group  column. The remaining columns are aggregated  based on the specified aggregation settings (mean).  The output table contains one row for each unique  value. These conditional box plot bellow showing  the clustering k-mean result.  

part and the rest of the data which have (32) as  second part, with draw randomly and setting  the random seed to 777. 

Normalization: normalizing data by setting z Score normalizations to be the method of  normalizing data. 

Distance matrix calculate: by using Euclidean  distance 

Hierarchical clustering: the set of linkage type  is average linkage 

Hierarchical clustering view: in below the  (figure10) show the amount of distance between  clusters under evaluation. 

DBSCAN

As we can see the conditional box plot at the axes  lines, we have the five cluster at (x) axe and the  protesters values resulting at the(y) axe. I notice  that the value of the protesters is different from  each cluster to another. 

Actually, we can select different attribute during  the set of the box plot to see different result for  other Events. 

As I compute the cluster algorithm to the first  partition of the data se I have to compute the cluster  algorithm to the second part of the data. By  assigning data of the second partition. 

Hierarchical Clustering & DBSCAN  (DistMatrex) 

To understand a different matrix distance between  classes, hierarchical clustering is required in  particular the distances of the observations point. 

The prosses has started by uploading the dataset  of violence and events, then some operations have  done they mentioned in bellow  

Column filtering: filtering all attribute and I  considered only the cases where I want to  compute the hierarchical.  

Missing values: The next step is to handle the  missing values by giving values of (0) for  double and (non) for strings. 

Partitioning: partitioning data to two parts, so  I have given absolute value of (50) for the first  

As we can see from the dendrogram the distance  view in the (x) which is presenting the number of  clusters and in the (y) axes which is presenting the  numbers of error achieved. So, that could be the  useful to choose the perfect number of clusters and  that could be (47) as best number from the two clusters under comparison. 

Figure11 

The pie chart is showing the percentage of the  BSCAN technique. after has setting the epsilon to  (0.5) and minimum point to (3), the summary table  resulting (2) cluster with (18) records of noises  points.  

conclusion 

This analysis is the result of knowing the most  violent event type in the world and their probability  to increase.  

According to what I notice that all or most of events  are achieved by the citizens are peaceful protest, rather and the probability to increase is high. 

The dataset is quiet enough to have the answer the question. 

All models presented having a mean to facilitated  the readability of the data and quite useful to take  better decision. Although there a missing value in the data set and the process. the missing values do not affect much if I handled them by the most effective technique.



No comments:

Post a Comment

Life Expectancy 1990 - 2019:

  Content Abstract: 1 Introduction 1 Data Exploration 1 Missing Data Handling: 2 Comparison in each 10 years 2 Comparison in each ...