Citation
Machine learning algorithms and natural language processing techniques for crime prediction with geo-tagged tweets

Material Information

Title:
Machine learning algorithms and natural language processing techniques for crime prediction with geo-tagged tweets
Creator:
Alsalman, Alanoud
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science
Committee Chair:
Altman, Tom
Committee Members:
Biswas, Ashis
He, Liang

Notes

Abstract:
Twitter is one of the top 10 online social networks in the world. Many studies have been conducted using Twitter as a data source, however, not many deployed the tweets’ spatial metadata, such as the tweets geo locations or names of places. This research focuses on collecting and analyzing tweet posts containing location metadata to increase the accuracy of predicting crimes types within the neighborhood boundaries of the city of Chicago. Our prediction model is a combination of machine learning and natural language processing techniques. The data used in this model is the Chicago’s official crimes data and the geo-tagged Chicago tweets that we have collected. The results are based on three experiments: (1) A model for using five classification algorithms with Chicago’s official crime data as input to predict crime types, as our baseline model. (2) A model for using the same classification algorithms, but with both the Chicago crime data and the geo-tagged tweets. (3) Using ensemble learning on the previous two models. As a result, we gained an increase in performance of the three models after adding tweets features as inputs. Our approaches achieve an accuracy as high as 96% in predicting crime categories.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Alanoud Alsalman. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item has the following downloads:


Full Text
MACHINE LEARNING ALGORITHMS AND NATURAL LANGUAGE PROCESSING TECHNIQUES
FOR CRIME PREDICTION WITH GEO-TAGGED TWEETS
by
ALANOUD ALSALMAN B.S., Qassim University, 2009
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Computer Science Program
2018


© 2018
ALANOUD ALSALMAN
ALL RIGHTS RESERVED


This thesis for the Master of Science degree by
Alanoud Alsalman Has been approved for the Computer Science Program by
Tom Altman Ashis Biswas
Liang He
Date: July 20, 2018


Alsalman, Alanoud. (M.S., Computer Science Program)
Machine Learning Algorithms and Natural Language Processing Techniques for Crime Prediction
With Geo-Tagged Tweets
Thesis directed by Professor Tom Altman
ABSTRACT
Twitter is one of the top 10 online social networks in the world. Many studies have been conducted using Twitter as a data source, however, not many deployed the tweets' spatial metadata, such as the tweets geo locations or names of places. This research focuses on collecting and analyzing tweet posts containing location metadata to increase the accuracy of predicting crimes types within the neighborhood boundaries of the city of Chicago. Our prediction model is a combination of machine learning and natural language processing techniques. The data used in this model is the Chicago's official crimes data and the geo-tagged Chicago tweets that we have collected. The results are based on three experiments: (1) A model for using five classification algorithms with Chicago's official crime data as input to predict crime types, as our baseline model. (2) A model for using the same classification algorithms, but with both the Chicago crime data and the geo-tagged tweets. (3) Using ensemble learning on the previous two models. As a result, we gained an increase in performance of the three models after adding tweets features as inputs. Our approaches achieve an accuracy as high as 96% in predicting crime categories.
The form and content of this abstract are approved. I recommend its publication.
Approved: Tom Altman
IV


DEDICATION
I dedicate this work to my dear husband who supported me throughout my entire master studies. I also dedicate it to my mother, father, sisters and brother who supported me throughout my life and motivated me to pursue my graduate education.
v


ACKNOWLEDGMENTS
This work was supported by the Saudi Arabian Cultural Mission in the United States and by my employers at the Qassim University. I want to thank them for giving me this great opportunity to finish my master's degree in the United States. Special thanks to my advisor, Professor Tom Altman. Without his invaluable advice and guidance throughout the thesis work, I wouldn't have achieved my goals. I would like to thank Professor Ashis Biswas for his guidance as well.
VI


TABLE OF CONTENTS
CHAPTER
I. PRILIMINARIES..................................................................1
Introduction....................................................................1
Hypothesis......................................................................2
Objectives......................................................................2
Problem Statement...............................................................2
Proposed Solution...............................................................3
Limitations.....................................................................4
Contributions...................................................................4
II. BACKGROUND.....................................................................6
Data Mining.....................................................................6
Prediction of Crimes...........................................................10
Social Media Analysis..........................................................13
Literature Review..............................................................16
III. DATA.........................................................................18
Datasets.......................................................................18
Data Collection Process........................................................21
IV. METHODS.......................................................................24
Data Preprocessing............................................................24
Feature Extraction.............................................................26
Exploratory Data Analysis......................................................29
vii


Tweets Corpus.............................................................36
Predictive Model..........................................................38
V. RESULTS AND DISCUSSION....................................................44
Experiments...............................................................44
Discussion................................................................70
VI. CONCLUSION AND FUTURE WORK................................................71
Conclusion................................................................71
Future Work...............................................................71
REFERENCES.......................................................................72
APPENDIX
A. Tweet Downloader Code.....................................................75
B. Data Integration Code.....................................................85
viii


CHAPTER I
PRELIMINARIES
Introduction
Social media has proven to be a valuable research asset in many fields, such as business and sociology. Recently, some approaches were used to deploy social media in the field of criminology. In this research, Twitter was chosen as the social media source of data to predict the category of crimes occurring in the city of Chicago, IL.
This research uses the geo-tagged tweets from Twitter data to our advantage in the prediction process. In addition, this research explores and uses some natural language processing techniques and machine learning algorithms for the crimes prediction task and identifies the best prediction accuracy for the study. Furthermore, this work uses Twitter data and data from Chicago city crimes data to extract features and uses them as inputs for our predictive models.
The performance of five classification algorithms are then compared, and this work uses the ensemble learning's max voting technique that takes the prediction results of these five classification algorithms as inputs and decides the best model accuracy. In addition, the performance of our model is analyzed before and after adding Twitter data features as inputs in conjunction with the Chicago city crimes data features in our model.
This study is an interdisciplinary approach between the fields of computer science and criminal justice and introduces an approach for predicting crimes that can help law enforcement agencies prevent crimes faster than traditional crimes prediction approaches.
1


Hypothesis
By analyzing both the tweets from Twitter and the data concerning Chicago crimes, a crimes prediction model with greater performance can be achieved.
Objectives
This work proves that Twitter text data can be an asset in predicting crimes and creates a prediction model that, by using Twitter data features along with historical crimes data features, can efficiently and accurately be used as a prediction tool that police officers may use on a daily basis to prevent crimes by deploying their limited patrols in a smarter and more optimal manner.
Problem Statement
The field of criminology, which is the scientific study of crimes and criminal characteristics [1], is essential for any country's defense system. The task of solving crimes is a difficult and time-consuming task, and the success rate of this task relies on timing and accuracy. There is also the problem of scarce police resources, such as patrols, that must be used in the smartest and the most effective way possible. Consequently, this work assists criminal analysts with their prediction tasks by harnessing Twitter data to obtain better prediction results with the help of data mining techniques.
The advantage of using data mining in the analysis of crimes is that it helps solve crimes and trace criminals faster. The large amount of data in the criminology field makes it a suitable field to apply data mining algorithms.
2


Proposed Solution
Online social media users tend to share their everyday life activities. In addition, they also share their experiences and what they spotted during their day such as fights and sounds of something breaking. These online posts sometimes carry valuable information about a crime scene or a crime that just occurred. If these online feeds, such as Twitter posts, contain temporal data about the incident they witnessed and spatial information, then the police could incorporate the data into their crime-prediction process.
This study filters Twitter posts, called tweets, based on the boundaries of the city of Chicago to investigate whether adding the textual information carried in these posts can positively affect the performance of predicting crimes. The geo-tagged tweets were specifically studied to detect signs of criminal activities based on tweet features found within the boundaries of the city.
The solution to the problem of accurately and efficiently predicting crimes around the area is mainly examining these tweets by performing statistical analysis and using them as inputs in our models of prediction. This model also uses data samples of Chicago crimes, which are open data downloaded from the Chicago Data Portal website[2], as inputs.
Five classification techniques were used in the first experiment based on the features from the historical recordings of Chicago crimes: logistic regression, Gaussian naive Bayes, k-nearest neighbor, support vector machine, and artificial neural networks. Then, the second experiment took both the historical crimes features and the tweet features as inputs to the same classifiers. The last experiment used the ensemble learning's max voting technique to determine the best predictors among these five classifiers.
3


For our performance analysis, results of the prediction model when the tweets are used
with the Chicago crimes data as inputs to our model and when the tweets were not used as inputs are compared in this work. In addition, the performances of the five different classification models are compared with the results of ensemble learning's max voting model.
Limitations
It is crucial to mention that our model does not serve as a standalone tool. The final decision of the prediction can only be determined by the crimes analysts themselves. Our model is merely a tool to speed up the process of predicting crimes because it is impossible to reach a 100% accuracy of prediction.
Another limitation is that the standard Twitter API allows developers to only collect a free sample of 1% of public Twitter's feeds. Thus, the collected geo-tagged tweets in this study make up 1% of all geotagged tweets in Chicago.
Contributions
This research increased the performance of the crimes category prediction in three of the chosen six classification models simply by integrating the tweets text features to our crimes dataset. In addition, adding tweets text features as inputs resulted in higher prediction accuracy rate than only using crimes' features.
In addition, 417.5 MB of raw tweets (119,196 tweets) data were obtained using the Twitter REST API. The tweets collected were tagged with the geo-locations within a radius of 15 miles from the center of the city of Chicago, and they were posted during the months of November and December in 2017.
4


This data cannot be published to the public due to Twitter's developer policies and agreements on publishing downloadable Twitter contents [3]. Similarly, Twitter posts were downloaded were downloaded in this work since no downloadable public data about Chicago's tweets was found anywhere online.
Very few studies researching crimes are based on the geolocations of tweets. Our approach explores the advantages and uses of the geo metadata attached to a tweet object and this study is an addition to the few studies that use this strategy.
5


CHAPTER II
BACKGROUND Data Mining
Data mining is the science of extracting useful knowledge from many data sources to predict or classify future outcomes. Applications of data mining have spread to many fields, such as biology and psychology, but few data mining studies have been conducted in the field of criminology. The top applications of data mining in criminology are the following:
Tracking criminals in the society using their online feeds.
Predicting criminal suspects, also known as criminal identification.
Predicting criminal activities, such as predicting crimes time, location, and type.
Predicting patterns of crimes, e.g., serial killers and hate crime.
Machine learning (ML), deep learning, and natural language processing (NLP) are all data mining techniques, and they have some overlap. For instance, ML is sometimes used for NLP tasks, as shown in Figure (1).
Figure (1): Venn diagram showing the overlap between ML and NLP [4]
6


The next section provides some background information about machine learning algorithms and natural language processing techniques.
Machine Learning
ML is a branch of artificial intelligence concerned with using historical data to extract useful information to predict future events. Machine learning has received much attention in the recent years due to its support in big data and its application in a wide range of disciplines, such as marketing, bioengineering, banking, and criminology.
ML algorithms are divided into many types, and the two most common types of ML are supervised learning algorithms and unsupervised learning algorithms. These algorithms use
different approaches to implement a machine learning model of prediction, and they are shown
in Figure (2).
UNSUPERVISED
LEARNING
Group and interpret data based only on input data
SUPERVISED
LEARNING
Develop predictive model based on both input ond output data
( \ CLUSTERING
^
/ ' CLASSIFICATION
v______________
r >
REGRESSION
<______________
Figure (2): Machine learning types [5]
Supervised Learning Algorithms
In this method of machine learning, the prediction task is performed on a training set with
known labels, also known as target values. The labels are the true prediction values of the
dataset, which are also known as the ground zero values. The labels can be categorical,
7


discrete, or continuous values, and they are compared with the predicted values, which are the
outputs of the prediction model. Figure (3) shows the steps of a supervised learning algorithm.
There are two types of supervised learning algorithms: classification and regression. Classification algorithms are used for classification problems where the output is a category or a discrete value. In contrast, regression algorithms are used for regression problems where the output to predict is a real numeric or a continuous value.
Figure (3): Supervised learning model [6]
A classification algorithm is a two-step task. This algorithm must perform a training task first, which trains the prediction model with the training dataset using the train labels. After training the model, a prediction or a classification can be performed on the test dataset. The test dataset samples are treated as inputs to predict the test labels as outputs of the prediction model. The goal of a classification algorithm is to separate the data into categories.
8


A regression algorithm follows the same steps as a classification algorithm. However, the
goal of this algorithm is to fit the output data into the best fitting labels. The best fit is determined by calculating some type of error and gaining the least error possible from the regression task.
Unsupervised Learning Algorithms
In this method, the labels are unknown for the dataset used in the prediction model.
The most popular type of unsupervised learning algorithms is the clustering algorithm, which is used for clustering problems where data points are clustered as groups with similar features. Figure (4) shows the steps of an unsupervised learning algorithm.
Training
Text,
Documents,
Images,
etc.
Unsupervised Learning Model
Feature
Vectors
New Text, Document, Image, etc. K Feature Vector K /C . . L . X k Likelihood or Cluster ID or Better
1 > 1 ) 1 )
V / X 1 / 1/ Representation
Figure (4): Unsupervised learning model [6]
An unsupervised learning clustering algorithm is only a one-step task. Only one dataset
is given as input to the model, instead of having two datasets such as the training and test
9


datasets. Then, the outputs of the model are given as clusters or groups of data points, and each cluster represents one of the predicted labels.
Natural Language Processing
Natural language processing (NLP) is often used for processing textual data. NLP is also a branch of artificial intelligence that is responsible for making computers understand and process natural human languages, and NLP is frequently used in sentiment analysis as a preprocessing step. NLP is mainly used for lexical analysis tasks, such as part of speech (POS) tagging, named entity recognition (NER), and topic modeling (TM).
In this study, some NLP techniques are used to preprocess the data of the tweets.
Prediction of Crimes
Introduction
The definition of crime by Merriam Webster is an illegal act for which someone can be punished by the government [7], and criminology is the scientific field for identifying crimes and criminal characteristics.
Crime analysis is the process of exploring the behavior of crimes, detecting crimes, and finding correlations between the crimes and the criminals. Within that process, many types of crime prediction techniques have emerged. The two major types of crime prediction are crime hotspots prediction and crime pattern prediction. Crime hotspot prediction measures the density rate of a specific crime in a specific boundary of a location, while crime pattern prediction is the process of predicting the type of crimes in a specific area and time.
10


Crimes hotspots prediction
The police often use more patrols for the areas that are more crime-prone, and these crime-prone areas that have high rates of crimes are called crime hotspots. How do they know about these places? Well, some police officers are keenly aware of the areas with higher crime rates in the city. However, there are other ways than just practicing patrolling to identify these spots. Law enforcement employees who do not patrol, such as crime analysts, for example, can detect these trends by two methods:
1. Using maps and geographic systems.
2. Using statistical tests.
These are based on the information available at the National Institute of Justice [8].
1. Maps:
Specialists create density maps to monitor crimes and show where crimes occur within the boundaries of a given city. These maps show the blocks with the most crimes as well. Analysts use geographic information systems (GIS) to visualize the crime hotspots by combining street maps, the data about crime and public disorder, and data about other features, such as stores and bus stops. The GIS shows these spots as grids with colors that identify the severity or concentration of the crimes in each cell of the grid.
Crime maps are categorized into the following:
• Points: They Convey the exact location of a crime hotspot, which is an exact place where crimes usually occur on a regular basis.
11


Figure (5): Points in a grid map
• Street segment: It is a line that shades an entire street location, meaning that the places along that street are crime-prone places.
,/i
b i V


Figure (6): Street segment in a grid map
• Shaded area: A shaded area is a shaded cell on the grid of the city map that shows the distribution of crimes. Each cell could represent a district, a suburb, or a cluster of blocks. If that cell is shaded, then it requires police attention, as it indicates that the crime concentration in this cell is high.
Figure (7): Shaded area in a grid map
• Density Surface: It is represented as color gradients in a map. This shows the crimes concentration by showing the inner dark areas of the gradient as high-risk areas, whereas the outer lighter colors are lower risk areas.
12


Figure (8): Density surface on a grid map
2. Statistical tests:
This is the use of computer software to analyze crimes data and geographical data to identify crimes hotspots.
Other crimes predictions
In addition to the hot spot prediction of crimes, which mostly relies on clustering, there is also crime type prediction, which helps predict what type of crime will occur in a specific time and place. This study is focused on this type of prediction, and this prediction type is addressed in the methods section of this study.
Social Media Analysis
Introduction
Nowadays, online social media is being deployed to analyze and predict crimes. The police and law enforcement units are collecting data from online interactions in social media, such as Twitter, Facebook, and personal blogs. They try to follow gang members, criminal organizations or suspected terrorists.
Twitter
Twitter is one of the top 10 microblogging websites in the world with over 12 TB of data generated daily, according to Alexa's ranking website [9]. Its main form of interaction between users is called a "tweet". A tweet is a 140-character message that gets posted and shared either
13


publicly or privately in the Twitter space, and on November 7, 2017, Twitter launched the use of
280 characters long tweets [10]. Users of all ages and nationalities post about 500 million tweets per day, and the latest character increase allowed the users to double the length of their published thoughts and opinions Twitter Developer APIs
Twitter provides application program interfaces (APIs) for the developers to manage Twitter data. These APIs connect to the Twitter server via HTTP operations. There are two APIs available on Twitter, the REST API and the Streaming API [11]. To use these APIs, the programmer or developer must first establish authentication with the server using OAuth credentials. OAuth is "An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications" [12]. There are two types of OAuth credentials, application-only authentication and application-user authentication. Each user must register an application through his Twitter account for which he/she has unique credentials.
The HTTP methods used in Twitter's API are GET and POST. GET is simply responsible for fetching data, while POST is responsible for sending data. Both GET and POST use Twitter API resources, such as GET search/tweets and POST statuses/update. This study involved searching and collecting tweets. Thus, only GET search/tweets operation was needed.
1. REST API: It allows the programmer to search terms or get tweets filtered based on specific parameters. REST API does not provide live streaming data, but it is useful for analyzing historical data.
14


User
HTTP server
Twitter
Figure (9): Twitter REST API [13]
2. Streaming API: It lets the programmer deal with live streaming tweets data per request
and keeps sending it until stopped. This is useful for when one wants to do analytics over live
campaigns on twitter rather than historic ones.
User
HTTP server Streaming connection Twitter
process process
Figure (10): Twitter Streaming API [14]
15


Twitter analysis applications
Twitter data applications are increasing. Some of the most notable applications for this data are as follows:
• Extracting Flu-related tweets to predict the flu epidemic trend in a region [15].
• Stock market prediction [16].
• Predicting the results of elections [17].
Literature Review
Crimes Prediction
Other than traditional techniques for crime prediction, there are also new research papers about using data mining techniques to do the same. A survey paper "Survey on Crime Analysis and Prediction Using Data Mining Techniques" was published in 2017[18]. In this paper, the authors split the types of crime analysis procedures into two types based on the data mining approaches used. These types are crime prediction by classification techniques and finding crime clusters by clustering techniques.
This study uses classification techniques to predict the categories crimes that are based on spatial and temporal features within a given location
There is a method for crime prediction proposed by Sharma [19], which is a tool called Z-Crime, that can be used for detecting suspicious emails by enhancing an ID3 decision tree algorithm. His enhanced method produces a faster and better decision tree. He improved the feature selection method of the tree based on an improved information entropy. Thus, the resulting algorithm is a combination of the improved ID3 decision tree algorithm and the improved feature selection method.
16


Another method proposed by Hamdy et al. [20] is based on social media users' interactions
and mobile usage data, such as locations and call logs. They used social feeds and mobile data to predict suspicious behaviors and movements of individuals. Furthermore, their new model called the Criminal Act Detection and Identification Model can help crime analysts make better decisions via a sequence of inference rules to perform behavioral analyses.
Agarwal et al. [21] developed a tool for crimes analysis using the k-means clustering algorithm. They used it to predict crimes rates based on a spatial distribution of crimes data and were able to predict homicide rates on a yearly pattern.
Twitter Prediction
The most notable paper that employs Twitter data in crime prediction is Gerber's "Predicting Crime Using Twitter and Kernel Density Estimation" [22]. His approach was based on using geotagged tweets and he used kernel density estimation to predict crime types in various areas of Chicago. The results showed that there are improvements in 19 out of 25 types of crime prediction when tweets' features were added to their model.
Then, Xinyu Chen et al. in "Crime Prediction Using Twitter Sentiment and Weather" [23] also cited Gerber's work and used it as a benchmark model of prediction. They based their approach on sentiment polarity of the tweets and some weather data. Their results have surpassed the results of the benchmark model of choice, Gerber's Kernel Density Estimation model.
Our approach is inspired mainly by Gerber's approach. However, instead of using clustering algorithms, classification algorithms are used in this work. Time series types of data are still used, but the timestamps are considered as merely features used in the classification task.
17


CHAPTER III
DATA
Datasets
The acquired data that served as inputs to the prediction model is the Chicago Data Portal crimes dataset for the year 2017 and the Twitter collected tweets for the month of November-December, 2017. A description of each is provided in the following sections:
Chicago Crimes Dataset 2017
This dataset was semi-structured as a comma-separated value (CSV) file. It was easily managed to be transformed into a structured a pandas data frame type of data object during the preprocessing stage of programming. Before preprocessing these crimes records of the months of November and December, there were 40,583 records of crimes. After preprocessing them, that number was reduced to 39,225. Then after the data integration process, that number was reduced further down to 24,462.
November and December tweets of 2017
Tweets in their raw form are considered unstructured data. However, since they were downloaded as a JavaScript Object Notation (JSON) files and converted them into CSV files, they became semi-structured data. Further transformations were performed during the preprocessing stage of coding which transformed the tweets dataset into structured data as a pandas data frame format. Then, 121,853 raw tweets were collected. After preprocessing, we ended up with 114,095 tweets. That number was reduced further after the integration process to 24,462 tweets to match the number of crimes.
18


First: Twitter Data
All Twitter data consisted of several data objects and their attributes contained in a JSON text format. These data objects are the following:
• Tweet Objects
• User Objects
• Entities Objects
• Extended Entities Objects
• Geospatial Objects
This study is interested in the tweet text and geospatial objects only.
Tweet objects
Twitter's main data object is a Tweet. A tweet is textual content that is being shared on the Twitter microblogging website, and it can also contain images, URLs, and GPS locations. A tweet also contains many metadata contents, such as the tweet's time of publication, its owner, location, retweets count, and other information related to the tweet.
Geospatial objects
There are two types of geographical metadata:
1. Tweet location (also known as a geo-tagged tweet): A tweet is called a geo-tagged tweet when it is associated with a location. A tweet location can be either an exact Point location or a Twitter Place with a bounding box. These locations are represented as a set of longitude and latitude coordinates. About 1-2% of all tweets are geo-tagged using one of the methods, and these geospatial objects are saved as a data dictionary in a JSON format type of file.
19


2. User location: The location of the user's hometown is found in their public profile, but
this work is only interested in studying geo-tagged tweets.
Second: Chicago Crime Data
The city of Chicago, IL, was chosen as the city to use its crime data in evaluating our prediction model. Chicago has always been a center of attention for crime-related topics because it scores high in crimes and that it never stays as a low rate crime city for long. The reasons for choosing it as the crime city of our prediction approach are the following:
• Chicago has always had a higher average rate of crime in the United States, especially in violent crimes, such as homicide.
• US cities have witnessed an increase in murder rates in 2016, but almost half of the murders occurred in Chicago.
• Availability of large amounts of publicly available (high quality) crime datasets.
In 2016, Chicago witnessed a horrific spike of 50 percent in the murder rate from its previous year, and it was ranked the 24th most dangerous city in the United States [24].
Third: Community Areas
The last data needed were shapefile data from all 77 community areas of Chicago. This file contains the geometrical coordinates of points that form a multipolygon area on the map for each one of these community areas of Chicago. As mentioned earlier, the tweets that are located within a 15-mile radius from the center point of Chicago were downloaded. The Chicago city center point has a latitude of 41.881832 and a longitude of -87.623177. This shapefile was needed to determine the origin of each tweet among the 77 community areas.
20


Data Collection Process
Twitter's developer policy prohibited the distribution of any downloaded data from Twitter as a data source. Consequently, no historical tweets were available to be downloaded from within the boundaries of Chicago. Twitter strictly states in their Developer Terms page under the Redistribution of Twitter Contents term that it does not allow developers who previously downloaded twitter content to share more than 50,000 tweets with any group or individual. They also did not allow the collected data to be made publicly available [25].
For the previously stated reasons, a downloading schedule was used to collect the tweets manually.
The Downloading Time Schedule
A Tweet Downloader program written in Python using the Twitter REST API and the Tweepy library [Appendix A] was used. The downloader was run on Monday and Tuesday of each week starting from November 1, 2017, until December 31, 2017. The downloading time took up to a day and a half to download the previous week's tweets, that is, about 36 straight hours. The downloader stores the tweets backward starting from the most recent tweet going toward the oldest tweet of the previous week.
The training window period was a two-month period starting from November 1, 2017, to December 31, 2017. The first tweet collected was on 10:11:12 a.m. Chicago time, and the last tweet collected was exactly at midnight, December 31, 2017.
Building the Downloader
To build the downloader program, the following steps were followed:
1. Create a Twitter account by signing up.
21


2. Create an application by registering for one connected to your Twitter account.
3. Once we have created an application, get the customer key, customer secret, access token, and access token secret. Store these in a safe place or in a confidential text file.
4. Now that these properties are stored in a computer as a JSON files format, they can be used to authenticate a connection to the Twitter data source. This authentication occurs directly using Twitter's APIs or indirectly by using helping libraries such as the one we used, Tweepy, which is a Python module that provides access to Twitter's RESTful API methods.
Registering a Twitter Application
To retrieve tweets through our Tweet Downloader program, either the User Authentication keys and secrets or the Application Authentication keys and secrets can be used. The Application Authentication was used because more tweets per a given window of time can be downloaded. The rate limit for the number of requests allowed per 15-minutes interval for the Application Authentication method is 450 requests, while the User Authentications method allows for only 150 requests per period. This rate limit is only for the search/tweets operation, which was used for filtering and gathering tweets. Consequently, the Application Authentication can run 300 more requests than the User Authentication. That is the main reason why an Application was started and its Authentication was used for collecting tweets. If the rate limit is reached before the time of the window ends, then a wait time for the subsequent window time starts.
22


Collecting Geo-Tagged Tweets
To collect tweets that contain location data within the Chicago area, the following information is needed:
• The coordinates (longitude and latitude) of Chicago.
• Understanding of the geographic information provided as metadata within the tweets, such as geocode, reverse geocode, coordinates, bounding box, and place.
• Search criteria to filter the tweets wanted to download. Our search criterions are collecting tweets located within the longitude and latitude box of Chicago and published between November 1 and December 31, 2017.
23


CHAPTER IV
METHODS Data Preprocessing
Two different data sets were used: the crime data and the twitter data. The crime dataset was already cleaned and semi-structured, and it only needed a little amount of preprocessing before getting the final dataset. In contrast, the collected Twitter data was raw.
First: Tweets
After collecting raw twitter data, the data was cleaned and prepared for analysis and prediction. The steps taken to preprocess the data and obtain a cleaned dataset ready for analysis are the following:
1. Converting the raw tweets from a JSON format data to a Pandas data frame data structure.
2. Dropping duplicate tweets based on the most recent tweets and keeping the oldest tweets, allowing for the exclusion of spams, advertisements, and retweets from the original tweets.
3. Dropping samples with null features.
4. Changing the time zone. The tweets are stored with their own metadata, and one category is the tweet time of posting. The time is saved as a GMT time zone. The time needed to be changed into US Central time zone, i.e., Chicago's time zone.
5. After setting the time zone, the tweets were filtered to obtain only the tweets that were published between time 00:00:00 November 1, 2017, and 23:59:59 December 31, 2017.
24


After those steps, the preprocessing of the tweets was finished. The preprocessing resulted
in 114,095 tweets in total instead of the 121,853 unprocessed tweets originally collected. The next step was the extraction of new features and deleting/dropping unwanted columns. Second: Crimes
The Chicago Data Portal website had two datasets of crimes: one was the homicide records from 2001 to the present, and the other dataset was all other types of crimes recorded from 2001 to the present. The crimes were filtered based on the year 2017 and then download through some IDE provided in the website. After downloading the two 2017 crime datasets from the Chicago Data Portal, the data were semi-structured. However, some preprocessing steps were still needed:
1. Reading the CVS file and converting the data into a pandas data frame object.
2. Concatenating the two datasets, the crimes and the homicides, into one dataset; and then sorting the samples by date/time.
3. Filtering the crimes and storing only the crimes which were recorded during the months of November and December in 2017.
After these steps, the crimes dataset was preprocessed and contained 40,583 records of crimes. However, after the data integration process, that number was reduced to 24,462 records of crimes. The next step was extracting new features and dropping unwanted columns.
25


Feature Extraction
Our original datasets had many different features. Many of them had to be discarded and only the tweets' texts, time features, and some of the place features were kept. Each dataset had undergone numerous feature extractions and droppings.
First: Tweets
Based on the tweet time of posting, several time-related features were extracted such as the day of the week, the time, binning the time of day (morning, afternoon, evening, and night), month (11 and 12), and the date was maintained as a date-time format object.
Binning the time of day was based on dividing the 24 hours in that day into the following four parts. These parts are the following:
• Morning: from 6:00 a.m. to 11:59 a.m.
• Afternoon: from 00:00 p.m. to 5:59 p.m.
• Evening: from 6:00 p.m. to 11:59 p.m.
• Night: from 00:00 a.m. to 5:59 a.m.
After extracting those time features, these values were converted from categorical values to numerical values.
The most important features that were already present in the raw tweets are the time and date of when the tweet was created, the coordinate features, and the tweet text. Thus, the missing values (based on encountering one of these three features as missing) were handled by deleting that sample from the data. Most of the samples did contain the time and text. Consequently, only the samples that were missing the geo/location coordinates were deleted.
26


There were several coordinates systems represented within the tweet data, and only the
geo-coordinates of that tweet were kept. Then, the longitude and latitude features were extracted. There was no point in having multiple coordinate features, and only one coordinate system and its features were needed.
Finally, the original raw tweets data had 176 features. That number was reduced to only three columns. Next, seven new features were extracted. Therefore, the following 10 features in our tweets dataset were obtained: the text, created_at, date, time, hour, day, month, geo.coordinates, longitude, latitude. Only those features were needed for the data integration process. Afterward, only the text was needed to extract textual features.
In the end, the most important information from our tweets dataset was the text. WEKA, a data mining software, was used to preprocess the text, and this is explained further in the tweets corpus section. The other features in the tweets helped correlate the tweets data with the data concerning the crimes. Then, both the tweets and the crimes datasets were combined into one dataset with the same features. However, when predicting crimes without using the features extracted from the tweet text, which were 1,000 features, these features were simply excluded from our full dataset.
Second: Crimes
From the original features, some temporal features of the crimes dataset were extracted, which they were exactly similar to the ones extracted for the tweets dataset, i.e., the five temporal features.
There were several spatial features in the dataset such as districts, wards, blocks, location description, and community areas. Since the shapefile for the community areas of Chicago was
27


already obtained, that spatial feature was used with the location points feature, the latitude
feature, and the longitude feature, and delete the other location features.
The most important feature of the crimes dataset was the primary type feature, i.e., the exact crime committed in that recorded sample. This feature serves as the target value for our prediction process, and it was stored as string categories. They were then converted to numerical values for use in our predictive model and its evaluation.
The original crimes data from the Chicago Data Portal website consisted of 22 features, and the number of original features was reduced to nine. Then, five temporal features that were extracted from the data were added. Consequently, the total number of features in the crimes dataset is 14.
After integrating the crimes and tweets datasets into one full dataset, nine features were extracted from crimes and 1,000 features were extracted from the tweets. The nine features are represented in Table (1):
Table (1): Descriptions of crimes features
Feature Name Feature Description
Week_day An integer number that represents the number of a weekday, starting with 1 as Monday and ending with 7 as Sunday.
Time The hour of the day in a 24-hour format.
Times Number The time bin of the day in numbers as explained earlier.
Domestic A Boolean value that indicates whether an incident was domestic related.
Arrest A Boolean value that indicates whether an arrest was made.
Beat Indicates the beat number where the incident occurred. A beat is a small police geographic area. Each beat has a dedicated police car.
Community Area A number from 1 to 77 that indicates the community area.
Longitude Indicates the longitude of the location of the crime.
Latitude Indicates the latitude of the location of the crime.
28


Exploratory Data Analysis
This section presents some exploratory data analysis to explain both our Chicago tweets and crimes datasets.
First: Tweets
Figure (11): Number of tweets posted during every day of November and December 2017 Figure (11) shows a line plot of the number of tweets posted on each day of the months of November and December in 2017 within a 15-mile radius of the center of Chicago. December 11, Christmas Eve, and Christmas day had the lowest number of tweets posted, while there was a spike in the number of tweets on November 11 and 12 and December 1 and 31.
29


Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
0 2500 5000 7500 10000 12500 15000 17500 20000
Figure (12): Number of tweets poster during each day of the week during November and December 2017 Next, a histogram of the number of tweets per day of the week is plotted in Figure (12). The figure shows that most tweets are posted on Saturdays, while fewer were posted on Mondays and Tuesdays. Most of the tweets were posted on weekend days.
30


-8.7el Chicago Geo-Tagged Tweets Map (Nov-Dec 2017)
Figure (13): Scatter plot of all collected tweets within a 15-mile radius of the Chicago center point Figure (13) shows a scatter plot of tweets within a 15-mile radius around the center point of
Chicago. Most of the points are in the metro area of Chicago.
Second: Crimes
There are 30 categories of crimes in the crimes dataset. These categories are the target classes that are predicted in this work. Predicting crimes types is, therefore, a multi-class problem. These crime types and their frequencies are represented in Table (1). Some of the
31


frequencies of the types are very low, whereas others are very high, indicating that this data is not random.
Table (2): Crimes frequencies per category of crime
Frequencies
THEFT 0.240985
BATTERY 0.185330
CRIMINAL DAMAGE 0.109365
ASSAULT 0.072653
DECEPTIVE PRACTICE 0.066195
OTHER OFFENSE 0.063938
BURGLARY 0.048679
ROBBERY 0.044803
MOTOR VEHICLE THEFT 0.043068
NARCOTICS 0.043022
CRIMINAL TRESPASS 0.025672
WEAPONS VIOLATION 0.017630
OFFENSE INVOLVING CHILDREN 0.008344
CRIM SEXUAL ASSAULT 0.005811
PUBLIC PEACE VIOLATION 0.005652
INTERFERENCE WITH PUBLIC OFFICER 0.004091
SEX OFFENSE 0.003546
PROSTITUTION 0.002775
HOMICIDE 0.002514
ARSON 0.001682
KIDNAPPING 0.000726
GAMBLING 0.000722
LIQUOR LAW VIOLATION 0.000715
STALKING 0.000673
INTIMIDATION 0.000578
OBSCENITY 0.000310
CONCEALED CARRY LICENSE VIOLATION 0.000261
NON-CRIMINAL 0.000136
OTHER NARCOTIC VIOLATION 0.000042
PUBLIC INDECENCY 0.000038
HUMAN TRAFFICKING 0.000030
NON-CRIMINAL (SUBJECT SPECIFIED) 0.000011
The plots in Figure (14) show the number of crimes per category. Some of these crimes only occurred once or twice during our two-months period.
32


INTERFERENCE WITH PUBLIC OFFICER
30 06 13 20 27 041118 25 30 06 13 20 27 041118 25 30 06 13 20 27 041118 25
A It
out/ w
v' \
---- UOUOR LAW VIOLATION
J L
NON-CRIMINAL

12 5 10.0 -
7.5 50
2.5 00
ui
PROSTITUTION

Figure (14): Crimes numbers per category of crime.
33


These crimes and their frequencies in each of the 77 communities of Chicago are
represented in Table (3).
Table (3): Crimes frequencies per community area
counts freqs counts freqs
categories categories
26.0 721 0.017766 51.0 267 0.006579
27.0 643 0.015844 52.0 192 0.004731
28.0 1421 0.035015 53.0 552 0.013602
29.0 1353 0.033339 54.0 198 0.004879
30.0 649 0.015992 55.0 94 0.002316
31.0 382 0.009413 56.0 289 0.007121
32.0 1798 0.044304 57.0 131 0.003228
33.0 294 0.007244 58.0 367 0.009043
34.0 177 0.004361 59.0 132 0.003253
35.0 422 0.010398 60.0 236 0.005815
36.0 95 0.002341 61.0 656 0.016164
37.0 126 0.003105 62.0 144 0.003548
38.0 584 0.014390 63.0 341 0.008403
39.0 262 0.006456 64.0 123 0.003031
40.0 462 0.011384 65.0 294 0.007244
41.0 261 0.006431 66.0 810 0.019959
42.0 540 0.013306 67.0 949 0.023384
43.0 1377 0.033930 68.0 851 0.020969
44.0 983 0.024222 69.0 1030 0.025380
45.0 191 0.004706 70.0 377 0.009290
46.0 663 0.016337 71.0 1052 0.025922
47.0 46 0.001133 72.0 121 0.002982
48.0 189 0.004657 73.0 465 0.011458
49.0 1001 0.024666 74.0 88 0.002168
50.0 160 0.003943 75.0 336 0.008279
76.0 226 0.005569
77.0 402 0.009906
counts fireqs
categories
1.0 631 0.015548
2.0 587 0.014464
3.0 521 0.012838
4.0 320 0.007885
5.0 200 0.004928
6.0 773 0.019047
7.0 776 0.019121
8.0 2078 0.051204
9.0 33 0.000813
10.0 197 0.004854
11.0 153 0.003770
12.0 105 0.002587
13.0 157 0.003869
14.0 424 0.010448
15.0 555 0.013676
16.0 421 0.010374
17.0 218 0.005372
18.0 39 0.002193
19.0 702 0.017298
20.0 199 0.004904
21.0 402 0.009906
22.0 715 0.017618
23.0 1187 0.029249
24.0 1215 0.029939
25.0 2326 0.057315
34


In addition, Figure (15) shows the density of total crimes per community area represented
as a scatter plot. Larger circles indicate higher crimes rates in that area, and the three largest circles indicate the areas where the crimes occurred most frequently. Two of these areas are in the metro area of Chicago. Area 25, West Town, had the highest crimes rates in Chicago, followed by area 32, South Lawndale, and area 28, East Garfield Park.
o ° ° -o O • o ■ 8 o O O
(j • 4 # o i ; A o<§K

Figure (15): Crime density per community area as scatter plot density circles
Figure (16) shows a line plot of the number of crimes that occurred on each day of November and December in 2017. An interesting observation was that on December 25, the number of crimes was significantly low. Also, the subsequent days up until the end of the year also had low crimes compared to the other days. Finally, on Thanksgiving, which was November
35


22, the number of crimes were slightly higher than the number of crimes during the days following Christmas day; however, that day still had a low-crime rate. In conclusion, crimes in Chicago are less likely to happen during holidays.
30 06 13 20 27 04 11 IS 25
Nov
Dec
â–¡ays
(an
2018
Figure (16): Number of crimes reported during every day of November and December in 2017
Tweets Corpus
Dealing with boundaries
Before preprocessing the tweets text, community area names and numbers were assigned to the tweets. This was performed by checking the longitude and latitude of each tweet to determine if they belong to a certain community area in Chicago. Among the 114,095 tweets that were collected within the 15-mile radius of Chicago, there were 105,008 geo-tagged tweets within the boundaries of Chicago's community areas.
36


Text preprocessing
The texts of the 105,008 tweets were used as a corpus to perform some text preprocessing techniques on them. Each tweet text served as a document in the corpus.
For preprocessing tweets texts, WEKA version 3.8.2 was used [26]. WEKA is a popular data mining software in Java developed by the University of Waikato, and it can be installed and used either by the command line or the GUI interface. It is a fast and an efficient way of text preprocessing.
First, the corpus was converted from a CSV format to an attribute-relation file format (ARFF), which is the data format suitable for WEKA. Second, the ARFF file was uploaded into WEKA and its StringToWordVector filter was used to filter the text using several text preprocessing techniques as parameters to the filter. The parameters set are the following:
• Stemming: Which transforms the word or verb back to a crude heuristic form that ends up with the word being chopped to a smaller word (e.g., 'are' would be 'ar').
• Stop words removal: This is simply the process of removing English stop words. Stop words, which are commonly used words that can be ignored (e.g., 'the', 'a', 'an').
• Lower casing: This is the process of converting all the letters in the words to lower case letters.
• Words to keep: Since including every single word of the texts of the tweets would result in a large data file, this option was set to 1,000. Consequently, the 1,000 most common words were used in the corpus.
Finally, the filter was applied, resulting in 1,000 useful words serving as features that were extracted from each tweet in our corpus. These features contain either number 1 that indicates
37


the presence of a word or 0 that indicates the absence of a word for each corresponding sample row. Thus, a features and samples matrix of size 105,008 times 1,000 was constructed. Once the tweets had been tokenized and filtered, they were saved as a CSV file. Then, they were converted into a pandas data frame object format with the words features as the columns and the samples as the rows.
Data Integration
There was a challenge when trying to fuse the crimes data with the tweets text features. The problem was that there were many more tweets than crimes samples. Thus, a naive approach for the data integration of these two datasets was developed [Appendix B]. The approach involved finding the tweets that were posted within a specific community area and time bin, resulting in a general, but faster, correlation with a wide range of crimes recorded.
After integrating the crimes with the tweets data, the number of resulting samples was 24,426. The number of samples was much smaller than the 39,225 crimes records and the 105,008 tweets because crimes were often more active in different areas and time bins than the areas and time bins of the geo-tagged tweets, and vice versa.
After combining the two datasets, the crimes and the tweets, into one dataset, there were a total of 1,009 features and 24,462 samples to study.
Predictive Model
Train and test datasets
After preprocessing the data and gaining our final prepared dataset, the data was first split into two datasets, a 20% validating set and an 80% testing and training sets. In the remaining 80% of the data, cross-validation splitting of 10 folds was performed using a Python library
38


called scikit-learn [27]. Thus, the training and test datasets were split into 10 equal partitions,
called folds with shuffling, and each fold was used in our training and prediction tasks during the models' tunings to determine which of the parameters resulted in the best testing accuracy score. Then, the best-tested parameters in our models were used to perform predictions on the validation set.
Our experiments
Next, our experiments were performed. Our baseline experiment used the crimes' data features as inputs in five classification models. These models were logistic regression (LR), naive Bayes (NB), k-nearest neighbor (KNN), support vector machines (SVM), and artificial neural network (ANN). Our second experiment used the crimes' features along with the extracted tweets' features as inputs to the same five previously mentioned models. The third experiment was not a standalone experiment, and it could be seen as a sixth model used in the both the two previously mentioned experiments. It is called, ensemble learning (EL), which is also a supervised learning technique just like the other five classification models. Thus, this experiment was included as a sixth classifier in both experiment one and two.
The models' parameters
Before running the experiments, the 10-fold cross validation split and shuffle was used to tune and test the parameters on each model, and following are the models and the best parameters chosen for better prediction accuracy:
Logistic regression
Logistic regression is a linear model. The significant parameter is called C, which is the inverse of regulation strength. So, scikit-learn library provides the use of a regularized logistic
39


regression model. C was set to 0.1, 1.0, and 10, and when C was equal to 1.0, the results were
the best. The strategy used for this model is the default one-vs-rest scheme.
Gaussian naive Bayes
Gaussian naive Bayes is a probabilistic classifier based on Bayes Theorem. No parameters are needed here.
K-nearest neighbor
K-nearest neighbor is both used for classification, which is the objective of this work, and regression. The most significant feature is the number of k, which are the neighbors. The best k parameter can be determined by performing grid search; however, since a good k value should be used, the square root of the current dataset used is chosen. Our cross-validation splits currently had 17,613 samples. Thus, the square root of that was 133. This model was trained and tested with lower and higher k values, but it had the best results when k was 133.
Support vector machines
There was not sufficient time to test all the range of parameters for SVM. A small number of parameters were tested so the test would not take hours. The penalty parameter C, also called the misclassification parameter, was set once to 1, once to 5, and lastly to 10 on a linear kernel called radial basis function (RBF). The best results were when C was equal to 5. The gamma parameter is set to auto by default. Auto means setting the gamma value to l/n_features. In the first experiment we have 9 features. So, gamma is equal to 0.1111. Whereas in the second experiment we have 9 features. So, gamma is equal to 0.00099.
40


Artificial neural network
ANN is a deep learning technique that can be used for classification tasks. In our study, we trained and tested this model on several parameters such as one, two, three, or four hidden layers of neurons with either 10, 20, 30, 40, 50, 60, and 100 neurons, initial learning rates of 0.01, 0.1, and 0.2, and max iterations of 200, 400, 500, 70, and 1,000. We chose the rectified linear unit (ReLU) function as our activation function. The best parameters for this model were with three hidden layers of 30, 60, and 100 neurons, respectively, a learning rate initial rate of 0.01, and a max iteration of 500.
Ensemble learning
Ensemble learning is a collection of models working together to give one output result, and there are many types of ensemble learning techniques such as bagging, boosting, stacking, voting, and many others [28]. In our study, the max voting technique was used, also called hard voting, which basically compares the target value that was predicted for each sample for the different classification models and chooses the one which had the max number of votes. There were 30 categories of crimes which were considered the classes used in the target of the dataset that needed to be predicted.
Model evaluation metrics
After acquiring the results of each experiment, the results were compared using the following classification evaluation methods:
Accuracy score
It is the ratio of correctly predicted samples. In other words, it is the number of the true values, both the true positive and the true negative, over the total number of samples. By
41


calculating that, the accuracy score of the classification will be obtained. This is a relatively
good scoring metric, especially since the true values are desired, not the negatives. The formula for this measure is:
(true positive + true negative) / (all samples).
Confusion matrix
A generic example of a confusion matrix is shown in Table (4):
Table (4): Interpretation of the confusion matrix
Actual Predicted
Positive Negative
Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)
The confusion matrix is one of the most frequently used visualizations of performance in supervised learning algorithms. It can be considered as some sort of visualization of the classification report.
Classification report
The classification report is a way to represent the metric scores for each crimes type in the target label. The classification report is also a type of descriptive representation of the confusion matrix. Our classification problem is a multi-class, single label problem. Thus, the classification report represents several scoring metrics for each one of these classes.
The classification report represents the following four types of metrics:
Precision: This is the number of true positive samples divided by the number of all positive samples. In other words, it is the ratio of the correct positive samples to positive samples.
The formula is:
42


true positive / (true positive + false positive).
Recall: This is also called the true positive rate. It is the ratio of the correctly predicted positive samples to true samples. It is the number of true positive samples divided by the number of true positive and false negative samples.
The formula is:
true positive / (true positive + false negative).
Both the precision and the recall are good for predicting biased class distributions because they focus more on the performance of positive samples than the negative ones.
Fl-score: This is the weighted average of the precision and the recall. Thus, this measure considers both false positive and false negatives. This measure is usually more accurate in prediction tasks with uneven class distributions like the case with our data.
The formula is:
(precision • recall) / (precision + recall).
Support: This is not an evaluation metric itself. This is supportive information to compare scores. In other words, the support is the number of actual samples predicted for each class. So, it is the actual number of crimes for each category.
43


CHAPTER V
RESULTS AND DISCUSSION Experiments
The first experiment was our baseline experiment. This experiment only took the crimes features as the input features. Its results were compared with our second experiment once the tweets features were added. The second experiment used both the crimes nine features and the tweets 1,000 text extracted features, which were called terms.
This section first shows the results for each model for each crime type. Then, the overall scores of the models for the predictions of all the crimes categories are compared. Lastly, the scores for classifying each crime of the best model are compared.
First experiment Logistic regression
The logistic regression overall performance on the validation set had an accuracy of 61%. The prediction, recall, and fl-scores are shown in Table (5). By observing the diagonal of the confusion matrix in Figure (17), the figure shows that this model worked well with classifying the CRIMINAL DAMAGE type of crime. It predicted 1,483 samples of that crime as true positives out of a total of 1,515 of actual true positives. Thus, it only mispredicted 32 samples. That crime category had the largest number of samples in the validation set. Perhaps it had a better chance of getting the prediction right because there were many samples of this type. In addition, this model failed to observe any types of crime with few samples.
44


true label
Figure (17): Confusion matrix for the logistic regression model of the first experiment
45


Table (5): Classification report for the logistic regression model of the first experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .67 0. .83 0. .74 540
0 .54 0, .98 0. .69 1515
0 .72 0. .18 0. .29 351
0 .00 0, .00 0. .00 222
0 .00 0. .00 0. .00 30
0 .77 0, .77 0. .77 271
0 .88 0. .82 0. .85 267
0 .00 0, .00 0. .00 223
0 .66 0. .57 0. .61 131
0 .00 0, .00 0. .00 23
0 .68 0. .47 0. .56 769
0 .00 0, .00 0. .00 294
0 .00 0. .00 0. .00 4
0 .54 0, .95 0. .69 111
0 .00 0, .00 0. .00 23
0 .47 0, .13 0. .20 63
0 .00 0, .00 0. .00 2
0 .00 0, .00 0. .00 6
0 .00 0, .00 0. .00 9
0 .00 0, .00 0. .00 16
0 .00 0, .00 0. .00 5
0 .00 0, .00 0. .00 2
0 .00 0, .00 0. .00 1
0 .00 0, .00 0. .00 3
0 .00 0, .00 0. .00 1
0 .00 0, .00 0. .00 2
0 .00 0, .00 0. .00 2
0 .00 0. .00 0. .00 7
0 .53 0. .61 0. .53 4893
Gaussian naive Bayes
The overall accuracy of this model was 86%. Based on the diagonal of the confusion matrix in Figure (18), this model predicted the true positives well. It especially predicted class 1, CRIMINAL DAMAGE, and class 10, DECEPTIVE PRACTICE well with fl-scores of 99% and 81%, respectively. Table (6) shows that these two types of crimes were also the highest number of crimes within this set. NB also predicted classes with lower numbers of crimes well. For instance, it 100% predicted the true positives of PROSTITUTION with only six samples available in our validation set.
46


true label
Figure (18): Confusion matrix for the naive Bayes model of the first experiment
47


Table (6): Classification report for the naive Bayes model of the first experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
1. .00 0, .99 1, .00 540
1. .00 0. .98 0. .99 1515
1. ,00 0, .93 0, .97 351
1. .00 0, .95 0, .98 222
0 .62 0. .50 0. .56 30
0 ,36 0, .30 0, .44 271
1. .00 0, .99 0, .99 267
1. .00 0. .96 0. .98 223
0 .44 0, .78 0, .56 131
0 .96 0, .96 0, .96 23
0 .73 0. .89 0. .81 769
0 .43 0, .25 0, .32 294
1. .00 0, .50 0, .67 4
1. .00 1. .00 1 .00 111
0 .57 0, .17 0, .27 23
0 .98 1, .00 0, .99 63
0 .00 0. .00 0. .00 2
1. ,00 1, .00 1. .00 6
0 .56 1, .00 0, .72 9
0 .83 0. .94 0. .88 16
0 ,06 0, .80 0, .12 5
0 .00 0, .00 0, .00 2
0 .00 0. .00 0. .00 1
0 ,02 0, .33 0, .04 3
0 .00 0, .00 0, .00 1
0 .00 0. .00 0. .00 2
0 ,00 0, .00 0, .00 2
0 .00 0, .00 0, .00 7
0 ,89 0, .86 0, .86 4893
K-nearest neighbor
The overall accuracy of KNN was 31%, which was a poor accuracy score performance. Based
on the diagonal of the confusion matrix in Figure (19), most of the predictions were set in the
CRIMINAL DAMAGE category, possibly because the largest number of crimes in our validation
set were CRIMINAL DAMAGE crimes. In addition, this model failed to predict any category other
than CRIMINAL DAMAGE, ASSAULT, and DECEPTIVE PRACTICE. In other words, it failed to
predict any class other than the major three classes in this set.
48


20
10
5
4
2
5
7
2
3
0
19
2
2
2
0
2
0
1
0
0
0
0
0
0
0
0
0
0
~0~
449 0 0 0 0 0 0 0 0 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1393 0 0 0 0 0 0 0 0 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
314 0 0 0 0 0 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1B2 0 0 0 0 0 0 0 0 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
215 0 0 0 0 0 0 0 0 51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
214 0 0 0 0 0 0 0 0 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
185 0 0 0 0 0 0 0 0 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
108 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
623 0 0 0 0 0 0 0 0 127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
255 0 0 0 0 0 0 0 0 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
85 0 0 0 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
45 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 10 IS 20 25
predicted label
Figure (19): Confusion matrix for the k-nearest neighbor model of the first experiment
49


Table (7): Classification report for the k-nearest neighbor model of the first experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .23 0. .04 0, .06 540
0 .33 0. .92 0. .49 1515
0 .00 0. .00 0, .00 351
0 .00 0. .00 0, .00 222
0 .00 0. .00 0. .00 30
0 .00 0. .00 0, .00 271
0 .00 0. .00 0, .00 267
0 .00 0. .00 0. .00 223
0 .00 0. .00 0, .00 131
0 .00 0. .00 0, .00 23
0 .20 0. . 17 0. .18 769
0 .00 0. .00 0, .00 294
0 .00 0. .00 0, .00 4
0 .00 0. .00 0. .00 111
0 .00 0. .00 0, .00 23
0 .00 0. .00 0, .00 63
0 .00 0. .00 0. .00 2
0 .00 0. .00 0, .00 6
0 .00 0. .00 0, .00 9
0 .00 0. .00 0. .00 16
0 .00 0. .00 0, .00 5
0 .00 0. .00 0, .00 2
0 .00 0. .00 0. .00 1
0 .00 0. .00 0, .00 3
0 .00 0. .00 0, .00 1
0 .00 0. .00 0. .00 2
0 .00 0. .00 0, .00 2
0 .00 0. .00 0, .00 7
0 .16 0. .31 0, . 19 4893
Support vector machines
The overall performance accuracy score was 32%. It was 1% better than the KNN
performance, but it was still poor. The confusion matrix in Figure (20) shows that SVM
produced similar results as KNN. The prediction scores were similar even for the same crime
categories. The only difference was that this model predicted fewer values in categories other than the three major ones based on the classification report in Table (8).
50


Figure (20): Confusion matrix for the support vector machines model of the first experiment
51


Table (8): Classification report for the support vector machines model of the first experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .81 0. .06 0. , 10 540
0 .31 0. .96 0. .47 1515
0 .93 0. .11 0. 19 351
0 .31 0. .04 0. ,06 222
0 .00 0. .00 0. .00 30
0 .67 0. .01 0. .01 271
1. .00 0. .02 0. ,04 267
0 .32 0. .05 0. .09 223
0 .00 0. .00 0. .00 131
0 .00 0. .00 0. ,00 23
0 .42 0. .04 0. .07 769
0 .29 0. .02 0. .04 294
0 .00 0. .00 0. ,00 4
0 .00 0. .00 0. .00 111
0 .00 0. .00 0. .00 23
0 .00 0. .00 0. ,00 63
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 6
0 .00 0. .00 0. ,00 9
0 .00 0. .00 0. .00 16
0 .00 0. .00 0. .00 5
0 .00 0. .00 0. ,00 2
0 .00 0. .00 0. .00 1
0 .00 0. .00 0. .00 3
0 .00 0. .00 0. ,00 1
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. ,00 7
0 .46 0. .32 0. .20 4893
Artificial neural network
The overall accuracy score for ANN was 53%. Thus far, the best model was NB with an accuracy of 86%, followed by logistic regression with an accuracy of 61% and now ANN with 53% accuracy. This model also worked well with predicting the number of crimes of types with a high number of crimes, but the model had a poor prediction performance with types with a low number of crimes, as seen in the confusion matrix in Figure (21) and the classification report in Table (9).
52


Figure (21): Confusion matrix for the artificial neural networks model of the first experiment
53


Table (9): Classification report for the artificial neural networks model of the first experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .61 0. .54 0. .57 540
0 .46 0. .90 0. .60 1515
0 .80 0. .32 0. 46 351
0 .29 0. .01 0. .02 222
0 .00 0. .00 0. .00 30
0 .43 0. .71 0. ,54 271
0 .32 0. .38 0. ,35 267
0 .00 0. .00 0. ,00 223
0 .50 0. .02 0. ,03 131
0 .40 0. .09 0. . 14 23
0 .61 0. .37 0. ,46 769
1. .00 0. .02 0. ,03 294
0 .00 0. .00 0. ,00 4
0 .52 0. .14 0. ,23 111
0 .00 0. .00 0. ,00 23
0 .00 0. .00 0. ,00 63
0 .00 0. .00 0. ,00 2
0 .00 0. .00 0. ,00 6
0 .00 0. .00 0. ,00 9
0 .00 0. .00 0. ,00 16
0 .00 0. .00 0. ,00 5
0 .00 0. .00 0. ,00 2
0 .00 0. .00 0. ,00 1
0 .00 0. .00 0. ,00 3
0 .00 0. .00 0. ,00 1
0 .00 0. .00 0. ,00 2
0 .00 0. .00 0. ,00 2
0 .00 0. .00 0. ,00 7
0 .50 0. .48 0. ,41 4893
Ensemble learning
Ensemble learning used all the predictions of the five previous models and set the prediction result to the most voted value for each sample in the validation set. This technique is called the max voting technique. The overall performance of this model was 52%, which was better than the accuracy scores of KNN and SVM, but the performance was worse than the scores of logistic regression, NB, and ANN. This model scores were very similar to the ANN scores based on the results of the confusion matrix in Figure (22) and the classification report in Table (10).
54


true label
Figure (22): Confusion matrix for the ensemble learning model of the first experiment
55


Table (10): Classification report for the ensemble learning model of the first experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0. .81 0. 94 0. 87 540
0, .41 0. 99 0. 58 1515
0, .98 0. 17 0. 30 351
0, .00 0. 00 0. 00 222
0, .00 0. 00 0. 00 30
0. .86 0. 29 0. 44 271
0. .98 0. 31 0. 47 267
0. .00 0. 00 0. 00 223
0. .56 0. 18 0. 27 131
0. .00 0. 00 0. 00 23
0, .79 0. 31 0. 45 769
0, .00 0. 00 0. 00 294
0, .00 0. 00 0. 00 4
1, .00 0. 24 0. 39 111
0. .00 0. 00 0. 00 23
1. .00 0. 06 0. 12 63
0. .00 0. 00 0. 00 2
0. .00 0. 00 0. 00 6
0. .00 0. 00 0. 00 9
0, .00 0. 00 0. 00 16
0, .00 0. 00 0. 00 5
0, .00 0. 00 0. 00 2
0, .00 0. 00 0. 00 1
0. .00 0. 00 0. 00 3
0. .00 0. 00 0. 00 1
0. .00 0. 00 0. 00 2
0. .00 0. 00 0. 00 2
0. .00 0. 00 0. 00 7
0, .56 0. 52 0. 44 4893
Accuracy score and classification report scores for all five models and ensemble learning model Table (11) shows that the model with the best accuracy score was the naive Bayes classifier with a score of 86%. This was followed by logistic regression with an accuracy of 61%. The worst two models for this experiment were the k-nearest neighbor classifier and the support vector machines with accuracy scores of 31% and 32%, respectively.
56


Table (11): Performance metrics for the first experiment for the models
Model Accuracy Precision Recall Fl-score
LR 0.61 0.53 0.61 0.53
NB 0.86 0.89 0.86 0.86
KNN 0.31 0.16 031 0.19
SVM 0.32 0.46 0.32 0.20
ANN 0.48 0.50 0.48 0.41
Ensemble 0.52 0.56 0.52 0.44
Second experiment
Logistic regression
Similar to the previous experiment, logistic regression predicted the crimes with the highest occurrences better than the ones with the lowest occurrences. However, the accuracy score in the second experiment after adding the 1,000 tweets features as inputs to the model has improved significantly. It is now 81% compared to a score of 61% in the previous experiment when only the crimes features were used as inputs. Figure (23) and Table (12) show detailed scorings for each category of crime.
57


Figure (23): Confusion matrix for the logistic regression model for the second experiment
58


Table (12): Classification report for the logistic regression model for the second experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .99 1. .00 1. .00 540
0 .95 1. .00 0. .98 1515
0 .86 0. .90 0. .88 351
0 .96 0. .42 0, .59 222
0 .50 0. .07 0, .12 30
0 .92 0. .99 0, .96 271
0 .95 1. .00 0, .97 267
0 .31 0. .21 0. .25 223
0 .90 0. .82 0. .86 131
1. .00 0. .13 0. .23 23
0 .59 0. .83 0. .69 769
0 .42 0. .28 0, .33 294
0 .00 0. .00 0, .00 4
0 .69 0. .81 0, .74 111
0 .60 0. .13 0, .21 23
0 .35 0. .17 0. .23 63
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 6
0 .00 0. .00 0, .00 9
0 .33 0. .06 0, .11 16
1. .00 0. .20 0, .33 5
0 .00 0, .00 0, .00 2
0 .00 0. .00 0. .00 1
0 .00 0. .00 0. .00 3
0 .00 0. .00 0. .00 1
0 .00 0. .00 0. .00 2
0 .00 0. .00 0, .00 2
1. .00 0. .29 0, .44 7
0 .80 0. .81 0, .80 4893
Gaussian naive Bayes
NB also scored a higher accuracy score in this experiment compared to the first one. Here it has an accuracy of 96%, while in the previous experiment, it had an accuracy of 86%, showing a 10% increase in the accuracy score of this model. Figure (24) and Table (13) show a detailed scoring of each crime category.
59


Figure (24): Confusion matrix for the naive Bayes model for the second experiment
60


Table (13): Classification report for the naive Bayes model for the second experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
1 .00 1. .00 1. ,00 540
1. .00 1. .00 1. .00 1515
0 .91 1. .00 0. .95 351
1 .00 0. .84 0. .91 222
0 .79 0. .37 0. .50 30
0 .91 1. .00 0. ,95 271
1. .00 0. .97 0. .98 257
1 .00 1. .00 1. .00 223
1 .00 0. .98 0. ,99 131
1. .00 0. .26 0. ,41 23
0 .90 0. .96 0. .93 769
0 .37 0. .80 0. .83 294
0 .00 0. .00 0. ,00 4
1. .00 0. .95 0. ,98 111
0 .54 1. .00 0. .78 23
1. .00 0. .87 0. .93 63
0 .00 0. .00 0. .00 2
1. .00 1. .00 1. .00 6
1. .00 1. .00 1. ,00 9
0 .94 1. .00 0. .97 16
0 .71 1. .00 0. .83 5
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. ,00 1
1. .00 0. .67 0. .80 3
0 .00 0. .00 0. .00 1
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. ,00 2
0 .54 1. .00 0. .70 7
0 .96 0. .96 0. ,96 4893
61


true label
K-nearest neighbor
This model's overall accuracy score was 31%, and it was an identical accuracy score to the model's performance in the first experiment.
Figure (25): Confusion matrix for the k-nearest neighbor model for the second experiment
62


precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .25 0. .04 0. .07 540
0 .34 0. .90 0. .49 1515
0 .00 0. .00 0. .00 351
0 .00 0. .00 0. .00 222
0 .00 0. .00 0. .00 30
0 .00 0. .00 0. .00 271
0 .00 0. .00 0. .00 267
0 .00 0. .00 0. .00 223
0 .00 0. .00 0. .00 131
0 .00 0. .00 0. .00 23
0 .20 0. .20 0. .20 769
0 .00 0. .00 0. .00 294
0 .00 0. .00 0. .00 4
0 .00 0. .00 0. .00 111
0 .00 0. .00 0. .00 23
0 .00 0. .00 0. .00 63
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 6
0 .00 0. .00 0. .00 9
0 .00 0. .00 0. .00 16
0 .00 0. .00 0. .00 5
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 1
0 .00 0. .00 0. .00 3
0 .00 0. .00 0. .00 1
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 7
0 .16 0. .31 0. .19 4893
Table (14): Classification report for the k-nearest neighbor model for the second experiment
Support vector machines
The overall accuracy performance of SVM was 47%. It was 15% higher than the previous score in the first experiment. This was a big improvement after adding the tweets features into the model along with the crimes features.
63


true label
B9 232 30 19 2 23 16 12 10 0 76 24 0 4 0 2 0 0 0 0 1 0 0 0 0 0 0 0
108 899 108 38 3 16 57 41 2 1 193 41 0 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0
39 193 31 15 1 6 4 4 6 0 42 7 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 99 11 18 1 4 7 10 0 2 41 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 12 3 3 0 1 0 0 3 0 2 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
42 102 11 4 1 25 4 3 9 1 51 8 0 6 2 0 0 0 2 0 0 0 0 0 0 0 0 0
20 143 5 4 0 1 27 13 0 2 39 11 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 105 5 10 0 3 12 16 4 0 40 9 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
21 55 ID 0 3 9 0 5 4 0 19 2 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 9 0 1 0 1 5 1 0 0 3 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
66 320 27 28 1 15 32 24 4 0 182 52 0 8 1 8 0 0 0 1 0 0 0 0 0 0 0 0
24 123 9 11 1 2 10 9 2 3 66 26 0 3 0 3 0 0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 41 5 1 0 8 0 0 0 0 20 7 0 17 0 2 0 0 0 0 0 0 0 0 0 0 0 0
1 7 2 0 0 3 0 0 0 0 7 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0
6 27 1 0 1 0 1 3 1 0 16 4 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 3 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 3 0 1 0 2 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 8 1 1 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 4 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 5 10 15 20 25
predicted label
Figure (26): Confusion matrix for the support vector machines model for the second experiment
64


Table (15): Classification report for the support vector machines model for the second experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .19 0, .16 0. . 18 540
0 .38 0. .59 0. .46 1515
0 .12 0, .09 0. . 10 351
0 .12 0, .08 0. . 10 222
0 .00 0, .00 0. .00 30
0 .21 0, .09 0. . 13 271
0 .15 0, .10 0. .12 267
0 .11 0, .07 0. .09 223
0 .09 0, .03 0. .05 131
0 .00 0, .00 0. .00 23
0 .23 0. .24 0. .23 769
0 .12 0, .09 0. . 10 294
0 .00 0, .00 0. .00 4
0 .34 0, .15 0. .21 111
0 .33 0, .09 0. . 14 23
0 .00 0, .00 0. .00 63
0 .00 0, .00 0. .00 2
0 .00 0, .00 0. .00 6
0 .00 0, .00 0. .00 9
0 .00 0, .00 0. .00 16
0 .00 0, .00 0. .00 5
0 .00 0. .00 0. .00 2
0 .00 0, .00 0. .00 1
0 .00 0, .00 0. .00 3
0 .00 0, .00 0. .00 1
0 .00 0, .00 0. .00 2
0 .00 0, .00 0. .00 2
0 .00 0, .00 0. .00 7
0 .23 0, .27 0. .24 4893
Artificial neural network
ANN scored an accuracy of 47% in this experiment, whereas it scored 53% in the previous experiment. This model did not show improvement. Instead, it showed a decrease in performance when the 1,000 text features were added as inputs to the model.
65


Figure (27): Confusion matrix for the artificial neural networks model for the second experiment
66


Table (16): Classification report for the artificial neural networks model for the second experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0.75 0, .10 0, .17 540
0.39 0, .95 0. .55 1515
0.00 0, .00 0. .00 351
0.00 0. .00 0. .00 222
0.00 0. .00 0. .00 30
0.42 0. .20 0. .27 271
0.50 0, .00 0. .01 267
0.00 0, .00 0, .00 223
0.21 0, .10 0, .14 131
0.00 0, .00 0, .00 23
0.32 0, .39 0, .35 769
0.00 0, .00 0, .00 294
0.00 0, .00 0, .00 4
0.00 0, .00 0, .00 111
0.00 0, .00 0, .00 23
0.00 0, .00 0, .00 63
0.00 0, .00 0, .00 2
0.00 0, .00 0, .00 6
0.00 0, .00 0. .00 9
0.00 0, .00 0. .00 16
0.00 0. .00 0. .00 5
0.00 0. .00 0. .00 2
0.00 0, .00 0. .00 1
0.00 0, .00 0. .00 3
0.00 0, .00 0, .00 1
0.00 0, .00 0, .00 2
0.00 0, .00 0, .00 2
0.00 0, .00 0, .00 7
0.31 0, .38 0, .27 4893
Ensemble learning
The overall accuracy score of ensemble learning was the second experiment was 60%. It was 8% higher than the accuracy of the same model in our first experiment. Thus, the model has improved after using the tweet features.
67


Figure (28): Confusion matrix for the ensemble learning model for the second experiment
68


Table (17): Classification report for the ensemble learning model for the second experiment
precision recall fl-score support
ASSAULT CRIMINAL DAMAGE BATTERY THEFT BURGLARY WEAPONS VIOLATION ROBBERY MOTOR VEHICLE THEFT NARCOTICS CRIMINAL TRESPASS DECEPTIVE PRACTICE OTHER OFFENSE OFFENSE INVOLVING CHILDREN CRIM SEXUAL ASSAULT SEX OFFENSE PUBLIC PEACE VIOLATION INTERFERENCE WITH PUBLIC OFFICER
PROSTITUTION
HOMICIDE
KIDNAPPING
ARSON
LIQUOR LAW VIOLATION CONCEALED CARRY LICENSE VIOLATION
INTIMIDATION OBSCENITY OTHER NARCOTIC VIOLATION PUBLIC INDECENCY NON-CRIMINAL
avg / total
0 .91 0. .74 0, .81 540
0 .48 1. .00 0, .65 1515
0 .96 0. .28 0. .43 351
1 .00 0. .13 0, .23 222
1 .00 0. .03 0, .06 30
0 .88 0. .63 0. .73 271
0 .97 0. .27 0, .43 267
1. .00 0. .14 0. .25 223
1 .00 0. .43 0. .60 131
1 .00 0. .04 0, .08 23
0 .71 0. .65 0. .63 769
0 .71 0. .09 0. .15 294
0 .00 0. .00 0, .00 4
0 .93 0. .35 0. .51 111
1 .00 0. .09 0, . 16 23
1. .00 0. .05 0, .09 63
0 .00 0. .00 0. .00 2
0 .00 0. .00 0, .00 6
0 .00 0. .00 0. .00 9
1 .00 0. .12 0. .22 16
0 .00 0. .00 0, .00 5
0 .00 0. .00 0. .00 2
0 .00 0. .00 0. .00 1
0 .00 0. .00 0, .00 3
0 .00 0. .00 0. .00 1
0 .00 0. .00 0, .00 2
0 .00 0. .00 0, .00 2
0 .00 0. .00 0. .00 7
0 .74 0. .60 0, .55 4893
Accuracy score and classification report scores for all five models and ensemble learning model Table (18) shows that the model of the best accuracy score of this experiment was also the naive Bayes model, just like in the first experiment. This model score was followed by a logistic regression score of 81% and an ensemble learning score of 60%.
Comparing the ensemble learning results in the first and second experiments with an accuracy of 52% and 60%, respectively, ensemble learning was not the best model in terms of high accuracy rate. However, it was also not the worst. The ensemble learning model in both
69


experiments had a middle score of accuracy. In contrast, the best model for our crimes classification task was Gaussian naive Bayes.
Table (18): Performance metrics for the second experiment
Model Accuracy Precision Recall Fl-score
LR 0.81 0.80 0.81 0.80
NB 0.96 0.96 0.96 0.96
KNN 0.31 0.16 0.31 0.19
SVM 0.27 0.23 0.27 0.24
ANN 0.47 0.31 0.38 0.27
Ensemble 0.60 0.74 0.60 0.55
Discussion
Comparing the results of the two experiments
Table (19) shows the accuracy scores of the two experiments side by side. There was a significant improvement in the logistic regression model. Its accuracy score increased by 20% in the second experiment compared to the first one. There was also an increase in the performance of NB and ensemble learning models by 10% and 8%, respectively. In contrast, there was a slight decrease in the performance of the SVM and ANN models. Their accuracy scores decreased by 5% and 1%, respectively. Lastly, there was no change in the performance of the KNN model. It stayed the same at an accuracy score of 31%.
Table (19): A comparison of the accuracy score of the first and second experiments
Model Experiment 1 accuracy Experiment 2 accuracy Rate change
Logistic Regression 0.61 0.81 + 0.20
Gaussian Naive Bayes 0.86 0.96 + 0.10
K-Nearest Neighbor 0.31 0.31 0.00
Support Vector Machines 0.32 0.27 -0.05
Artificial Neural Network 0.48 0.47 -0.01
Ensemble Learning 0.52 0.60 + 0.08
70


CHAPTER VI
CONCLUSION AND FUTURE WORK Conclusion
Comparing the accuracy scores of both experiments side by side shows that the accuracy of our second experiment surpasses the accuracy score of our first experiment with the logistic regression model, the NB model, and the ensemble learning model by 20%, 10%, and 8% increases in accuracy, respectively. Furthermore, the decreases in accuracy scores between the SVM and ANN models of the two experiments are very low with only a 5% and 1% decrease in performance, respectively.
This result shows that our hypothesis of adding the 1,000 features collected from the tweets of Chicago into our models along with the crimes data's nine features has significantly increased the performance of predicting the category of crimes.
Overall, an accuracy as high as 96% was achieved using the Gaussian naive Bayes classifier with the nine crimes features and the 1,000 features from the geo-tagged tweets as inputs to the model.
Future Work
Our approach will be improved by finding more advanced ways to integrate the tweets with the crimes. In addition, time series analysis could be used to predict the time and place of the upcoming crimes, and not just predicting their number of occurrences in categories. This approach would mean working with clustering techniques.
71


REFERENCES
[1] "Definition of CRIMINOLOGY." [Online]. Available: https://www.merriam-webster.com/dictionary/criminology. [Accessed: 01-Jul-2018].
[2] Chicago. (2018). City of Chicago | Data Portal | City of Chicago | Data Portal, [online] Available at: https://data.cityofchicago.org/ [Accessed 14 Jul. 2018].
[3] "Developer Policy." [Online]. Available: https://developer.twitter.com/en/developer-terms/policy.html. [Accessed: 01-Jul-2018].
[4] "Natural Language Processing vs. Machine Learning vs. Deep Learning. Dr. Rutu Mulkar-Mehta." [Online]. Available: https://rutumulkar.com/blog/2016/NLP-ML. [Accessed: Ol-Jul-2018],
[5] "What Is Machine Learning? | How It Works, Techniques & Applications." [Online]. Available: https://www.mathworks.com/discovery/machine-learning.html. [Accessed: Ol-Jul-2018],
[6] "supervised learning - Morgan Polotan." [Online]. Available:
https://morganpolotan.wordpress.com/tag/supervised-learning/. [Accessed: Ol-Ju 1-2018].
[7] "Definition of CRIME." [Online]. Available: https://www.merriam-webster.com/dictionary/crime. [Accessed: 02-Jul-2018].
[8] "How to Identify Hot Spots," National Institute of Justice. [Online]. Available: https://www.nij.gov/topics/law-enforcement/strategies/hot-spot-policing/pages/identifying.aspx. [Accessed: 29-Mar-2018].
[9] "Twitter.com Traffic, Demographics and Competitors - Alexa." [Online].
Available: https://www.alexa.com/siteinfo/twitter.com. [Accessed: 01-Apr-2018].
[10] "Twitter officially expands its character count to 280 starting today," 07-Nov-2017. [Online]. Available: http://social.techcrunch.com/2017/ll/Q7/twitter-officiallv-expands-its-character-count-to-280-starting-today/. [Accessed: 01-Apr-2018].
[11] "Building With the Twitter API: Getting Started." [Online].
Available: https://code.tutsplus.com/tutorials/building-with-the-twitter-api-getting-started— cms-22192. [Accessed: 02-Apr-2018].
[12] "OAuth Community Site." [Online]. Available: https://oauth.net/. [Accessed: 02-Apr-2018].
72


[13] "Building With the Twitter API: Getting Started." [Online].
Available: https://code.tutsplus.com/tutorials/building-with-the-twitter-api-getting-started--cms-22192. [Accessed: 02-Apr-2018].
[14] S. Sola, "Playing with Twitter Streaming API," Sergio Sola, 23-Nov-2016. [Online].
Available: https://medium.eom/@ssola/playing-with-twitter-streaming-api-blf8912e50b0. [Accessed: 02-Apr-2018].
[15] A. Culotta, "Towards detecting influenza epidemics by analyzing Twitter messages," Proceedings of the First Workshop on Social Media Analytics - SOMA 10, 2010.
[16] H. Alostad and H. Davulcu, "Directional prediction of stock prices using breaking news on Twitter," Web Intelligence, vol. 15, no. 1, pp. 1-17, 2017.
[17] L. Wang and J. Q. Gan, "Prediction of the 2017 French election based on Twitter data analysis," 2017 9th Computer Science and Electronic Engineering (CEEC), 2017.
[18] H. B. F. David and A. Suruliandi, "Survey On Crime Analysis And Prediction Using Data Mining Techniques," ICTACT Journal on Soft Computing, vol. 7, no. 3, pp. 1459-1466, Jan. 2017.
[19] M. Sharma, "Z - CRIME: A data mining tool for the detection of suspicious criminal activities based on decision tree," 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC), 2014.
[20] E. Hamdy, A. Adi, A. E. Hassanien, O. Hegazy, and T.-H. Kim, "Criminal Act Detection and Identification Model," 2015 Seventh International Conference on Advanced Communication and Networking (ACN), 2015.
[21] J. Agarwal, R. Nagpal, and R. Sehgal, "Crime Analysis using K-Means Clustering," International Journal of Computer Applications, vol. 83, no. 4, pp. 1-4, 2013.
[22] M. S. Gerber, "Predicting crime using Twitter and kernel density estimation," Decision Support Systems, vol. 61, pp. 115-125, 2014.
[23] X. Chen, Y. Cho, and S. Y. Jang, "Crime prediction using Twitter sentiment and weather," 2015 Systems and Information Engineering Design Symposium, 2015.
[24] J. Sanburn and D. Johnson, "Chicago's Deadly 2016: See It in 3 Charts," Time, 17-Jan-2017. [Online]. Available: http://time.com/4635049/chicago-murder-rate-homicides/. [Accessed: 15-Jul-2018],
73


[25] "More on restricted use cases — Twitter Developers." [Online].
Available: https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases. [Accessed: 31-Mar-2018].
[26] Cs.waikato.ac.nz, 2018. [Online]. Available:
https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pdf. [Accessed: 20- Jul-2018],
[27] "scikit-learn: machine learning in Python — scikit-learn 0.19.2 documentation", Scikit-learn.org, 2018. [Online]. Available: http://scikit-learn.org/stable/. [Accessed: 16-Jul- 2018].
[28] M. Learning and A. codes), "A Comprehensive Guide to Ensemble Learning (with Python codes)", Analytics Vidhya, 2018. [Online]. Available:
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/. [Accessed: 24-Jul- 2018].
74


APPENDIX A
Tweet Downloader Code
#lmporting libraries import sys import psutil import time import os import jsonpickle import simplejson import json
#from HTMLParser import HTMLParser from IPython.display import display import tweepy
from tweepy.error import TweepError
from tweepy import OAuthHandler, AppAuthHandler
from requests.exceptions import ConnectionError
# Our tweets downloader main class of our program class TweetDownloader (object):
linn
75


#— Authentication information (put your own)------#
q Q p ^ 0 p |^0y h T3|C3|C3|C3|C3|C3|C3|C3|C3|C3|C3|C3|C3|C3|CT
consumer_secret = '**************' access token = '**************' access secret = '**************'
#.........................................#
# ......... Our search variables..............#
geoLocation = "41.833584450000004,-87.67181069718896,15mi" # The geocode for
Chicago, IL at a radius of 15 miles
searchQuery = " "#"place:ld9a5370a355ab0c" # We leave this empty because
we want to collect all of Chicago tweets
tweetsPerQ = 100 # This is the max count the API permits
# so 100 tweets * 450 queries (which is the rate limit per 15 minutes window) = 45,000 tweets can be collected
maxTweets = 2300 # The maximum number of tweets we want
to collect
# If we leave the items() empty, it's suppose to collect
all tweets up to 7 days old
# .......................................#
76


def___init__(self,old = None):
min
# a counter for how many tweets we collected
self.tweetsCount = 0
self.oldest = old
self.api = None
self .tweet = None
def auth_api (self):
mm
#Pass our application authentication information to Tweepy's AppAuthHandler auth = AppAuthHandler(self.consumer_key, self.consumer_secret)
#Creating a twitter API wrapper using tweepy
#wait_on_rate_limit will let us know when we reach the rate limit and how much time left
self.api = tweepy.API(auth, retry_count=5, retry_delay=10, retry_errors=set([401, 404, 408, 500, 503, 504]), wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
#Error handling
77


if (not self.api):
print ("Problem connecting to API")
def print_tweet(self):
min
#tweetJSON=json.loads(str(self .tweet))
#print ("\nDate:", self.tweet.created_at, "\n\"", self.tweet.text, "\"\n" /'User: ",self.tweet.user.name,' self.tweet.user.screen_name)
#print ("Place:", self.tweet.place, "\nCoordinates:", self.tweet.coordinates,"\nGeo:", self.tweet.geo,'\n')
print("ID:",self.tweet.id, "\nCreated at:",self.tweet.created_at)
#print("Place:", tweetJSON["place"]["name"], "\nType:
",tweetJSON [" place"] ["place_type"])
def print_progress(self):
mm
#Display how many tweets we have collected print("\nDownloaded {0} tweets".format(self.tweetsCount))
78


#Display some information about the tweet
self.print_tweet()
#You can check how many queries you have left using rate_limit_status() method print("Remaining rate limit:",
self.api.rate_limit_status()[,resources,][,search,][,/search/tweets'] ['remaining'])
def process_tweet(self,outputFile):
min
#Verify the tweet has specific place info before writing '" or self.tweet.place["place_type"] == "neighborhood"'" if self.tweet.geo is not None:
# Write the tweet in our output file
outputFile.write(jsonpickle.encode(self.tweet.Json, unpicklable=False) + '\n') outputFile.flushQ
self.tweetsCount = self.tweetsCount + 1
# print information about our download progress
79


self.print_progress()
#update the id of the oldest tweet less one self.oldest = self.tweet.id-1
def collect_tweets(self,x):
min
# Create the output file name outputFileName = "tweets-Nov("+str(x)+").json"
print("\n................................\nDOWNLOADING TO FILE:"
, outputFileName, "\n...............................\n\n")
#Open a JSON text file to save the tweets to
with open(output_path+outputFilel\lame, 'w') as outputFile:
while True:
#on data: try:
80


# Collecting the tweets with parameters set to our search criteria
until='2017-ll-02', geocode = self.geoLocation, max_id = self.oldest since_id = 927687033463758849 # the id from the first tweets collected on Nov
6, last file'"
elif self.oldest is not None:
#changed this to be more discriptive, more lines but hopefully less time tweets = tweepy.Cursor(self.api.search, q = self.searchQuery, lang = 'en' , since='2017-ll-19', max_id = self.oldest , geocode = self.geoLocation , sincejd = 932421169491464192 , count = self.tweetsPerQ).items()
else:
tweets = tweepy.Cursor(self.api.search, q = self.searchQuery, lang = 'en' , since='2017-ll-19'
, geocode = self.geoLocation , sincejd = 932421169491464192 , count = self.tweetsPerQ).items()
81


# no more tweets found and so no more will be collected
if not any(tweets):
print("No more tweets found\n") break
# this means that we have reached the limit of our search criteia if self.tweetsCount>self.maxTweets:
print("Max tweets of {} reached\n".format(self.maxTweets)) break
# processing the tweets collected from the first if and else statements for self.tweet in tweets:
self.process_tweet(outputFile)
#on error:
except (ConnectionError, TweepError) as e:
print("\nERROR HAPPENED\n{0}\nTRYING TO RECONNECT...\n".format(e)) time.sleep(180) # less time
82


self.auth_api()
#on finishing:
print("\n....................................")
print("FINISHED DOWNLOADING TO FILE: {}.".format(outputFileName)) print("Downloaded {} tweets".format(self.tweetsCount)) print("....................................\n\n")
def main():
min
global output_path
output_path = "/home/alan/Desktop/ThesisGithub/Software/Output/Good Outputs :)/week4 (Nov20-27)/"
oldestID = 934548384635002881 - 1 # ID -1 oldestLIST = []
i range is always changing depending on output files names'" for i in range(25,19,-1):
myDownloader = TweetDownloader(oldestlD)
83


myDownloader.auth_api()
myDownloader.collect_tweets(i) oldestID = myDownloader. oldest oldestLIST.append(oldestlD+l)
print("\n\n**** { FINISHED DOWNLOADING ALL FILES } ****\n") print("Oldest IDs for each file: ",oldestLIST)
main()
84


APPENDIX B
Data Integration Code
import pandas as pd
path = "/home/alan/Desktop/ThesisGithub/Software/Data/Chicago/Cleaned Data/final preprocessing/"
arff = pd.read_csv(path+"after-weka.csv")
tweets= pd.read_csv(path + "TWEETS.csv")
tweets = tweets. reset_index() tweets = tweets.drop(['index'],axis = 1) tweets = tweets.drop(["Unnamed: 0"],axis = 1)
crimes= pd.read_csv(path + "Chicago Crimes Dataset - Processed4 - drop na.csv")
crimes = crimes.reset_index()
crimes = crimes.drop(['index'],axis = 1)
crimes = crimes.drop(["Unnamed: 0"],axis = 1)
def masking(tokens,tweets,crimes,time,area):
85


mask = tweets['Community Number'] == area
comml = tokens[mask]
mask = comml['Times'] == time timel = comml[mask]
mask = crimes['Community Area'] == area comm2 = crimes[mask]
mask = comm2['Times Number'] == time time2 = comm2[mask]
return timel, time2
full = pd.DataFrameQ
for area in range(l,78): for time in range(l,5):
area2 = float(area)
86


tl, t2 = masking(arff,tweets,crimes,time,area2)
if len(tl) > Ien(t2): tl = tl[:len(t2)]
tl = tl.reset_index()
tl = tl.drop(['index'],axis = 1)
#tl = tl.drop(["Unnamed: 0"],axis = 1)
tl['helper'] = tl.index
t2 = t2.reset_index()
t2 = t2.drop(['index'],axis = 1)
#t2 = t2.drop(["Unnamed: 0"],axis = 1)
print("Area = ", area," Time bin = ", time) print("tl = ",len(tl)," t2 = ",len(t2))
t2['helper'] = t2.index
tt = pd.merge(tl,t2,on= 'helper')
87


print("tl + t2 =",len(tt))
#tt = pd.concat([tl,t2], axis=0, join_axes= [tl.index])
#two = t2.join(tl[:len(t2)])
full = full.append(tt) print ("full=",len(full))
full.to_csv(path+"full_crimes_tweets.csv")
88


Full Text

PAGE 1

MACHINE LEARNING ALGORITHMS AND NATURAL LANGUAGE PROCESSING TECHNIQUES FOR CRIME PREDICTION WITH GEO TAGGED TWEETS by ALANOUD ALSALMAN B.S., Qassim University, 2009 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Computer Science Program 2018

PAGE 2

ii © 2018 ALANOUD ALSALMAN ALL RIGHTS RESERVED

PAGE 3

iii This thesis for the Master of Science degree by Alanoud Alsalman Has been approved for the Computer Science Program by Tom Altman Ashis Biswas Liang He Date: July 20, 2018

PAGE 4

iv Alsalman, Alanoud. (M.S., Computer Science Program) Machine Learning Algorithms and Natural Language Processing Techniques for Crime Prediction With Geo Tagged Twee ts Thesis directed by Professor Tom Altman ABSTRACT Twitter is one of the top 10 online social networks in the world. Many studies have been metadata, such as the tweets geo locations or names of places. This research f ocuses on collecting and analyzing tweet posts containing location metadata to increase the accuracy of predicting crimes types within the neighborhood boundaries of the city of Chicago. Our prediction model is a combination of machine learning and natural language processing tagged Chicago tweets that we have collected. The results are based on three experiments: (1) A model for using five classification algorithms wi crime types, as our baseline model. (2) A model for using the same classification algorithms, but with both the Chicago crime data and the geo tagged tweets. (3) Using ensemble learning on the previous t wo models. As a result, we gained an increase in performance of the three models after adding tweets features as inputs. Our approaches achieve an accuracy as high as 96% in predicting crime categories. The form and content of this abstract are approved. I recommend its publication. Approved: Tom Altman

PAGE 5

v DEDICATION I dedicate this work to my dear husband who supported me throughout my entire master studies. I also dedicate it to my mother, father, sisters and brother wh o supported me throughout my life and motivated me to pursue my graduate education.

PAGE 6

vi ACKNOWLEDGMENTS This work was supported by the Saudi Arabian Cultural Mission in the United States and by my employers at the Qassim University. I want to thank them for giving me this great Professor Tom Altman. Without his invaluable advice and guidance throughout the thesis work, o thank Professor Ashis Biswas for his guidance as well.

PAGE 7

vii TABLE OF CONTENTS CHAPTER I. 2 II. . III. . Data Collecti IV. . . Exploratory Data Analysis

PAGE 8

viii . .. .......... 3 8 V. VI. 2 APPENDIX A. . 5 B. 5

PAGE 9

1 CHAPTER I PR E LIMINARIES Introduction Social media has proven to be a valuable research asset in many fields, such as business and sociology. Recently, some approaches were used to deploy social media in the field of criminology. In this research, Twitter was chosen as the social media source of data to predict the category of crime s occurring in the city of Chicago , IL . This research uses the geo tagged tweets from Twitter data to our advantage in the prediction process. In addition, this research explore s and use s some n atural language processing techniques and machine learning algorithms for the crime s prediction task and identif ies the best prediction accuracy for the study. Furthermore, this work uses Twitter data and data from Chicago city crime s data to extract features and use s t hem as inputs for our predictive model s . T he performance of five classification algorithms are then compared, and this work uses the max voting technique that takes the prediction results of th ese five classification algorithms as inp uts and decides the best model accuracy. In addition, t he performance of our model is analyzed before and after adding Twitter data features as input s in conj unction with the Chicago city crimes data features in our model . This study is an interdisciplinary approach between the field s of computer science and c riminal justice and introduce s an approach for predicting crime s that can help law enforcement agencies prevent crimes faster than traditional crime s prediction approaches.

PAGE 10

2 Hypothesis B y analyzing both the tweets from Twitter and the data concerning Chicago crimes , a crimes prediction model with greater performance can be achieved. Objectives This work proves that Twitter text data can be a n asset in predicting crime s and create s a prediction model that, by using Twitter data features along with historical crimes data features , can efficiently and accurately be used as a prediction tool that police officers may use on a daily basis to prevent crimes by deploying their limited patrols in a smarter and more optimal manner . Problem Statement The field of criminology, which is the scientific study of crimes and criminal characteristics [1] time c onsuming task , and t he success rate of this task relies on timing and accuracy. There is also the problem of scarce police resources, such as patrols , that must be used in the smart est and the most effective way possible. Consequently , this work assists criminal analysts with their prediction task s by harnessing Twitter d ata to obtain better prediction results with the help o f data mining techniques. The advantage of using data mining in the analysis of crimes is that it helps solv e crimes and trac e criminals faster. The large amount of data in the criminology field make s it a suitable field to ap ply data mining algorithms .

PAGE 11

3 Proposed Solution Online social media users tend to share their everyday life activities. In addition , they also share their experiences and what they spotted during their day such as fights and sounds of something breaking. These online posts sometimes carry valuable information about a crime scene or a crime that just occurred . If these online feeds, such as Twitter posts , contain temporal data about the incident they witnessed and spatial information, then the police could incorporate the data into their crime prediction process. T his study filter s Twitter posts , called tweets, based on the boundaries of the city of Chicago to investigate whether adding the text ual information carried in these posts can positively affect the performance of predicting crime s . T he geo tagged tweets were specifically studied to detect signs of criminal activities based on tweet features found with in the boundaries of the city. The solu tion to the problem of accurately and efficiently predicting crimes around the area is mainly examining these tweets by perform ing statistical analysis and using them as inputs in our model s of prediction. This model also use s data samples of Chicago cr imes , which are open data downloaded from the Chicago Data Portal website [ 2 ] , as inputs. F ive classification techniques were used in the first experiment based on the features from the historical recordings of Chicago crimes : logistic regression, G aussian naïve B ayes, k nearest neighbor, support vector machine, and artificial neural networks. Then , the second experiment took both the historical crimes features and the tweet features as inputs to the same classifiers. T he last experiment used the ensemble le predictors among these five classifiers.

PAGE 12

4 F or our performance analysis, results of the prediction model when the tweets are used with the Chicago crimes data as inputs to our model and when the tweets were not used as inputs are compared in this work . In addition, the performance s of the five different classification models are compared with the results of ensemble learning max voting model . Limitations It is crucial to mention that our model does not serve as a standalone tool. The final decision of the prediction can only be determined by the crimes analyst s themselves. Our model is merely a tool to speed up the process of predicting crimes because it i s impossible to reach a 100 % accuracy of prediction . Another limitation is that the standard Twitter API allows developers to only collect a free sample of 1% of public Thus , the collected geo tagged tweets in this study make up 1% of a ll geotagged tweets in Chicago . Contributions T his research increase d the performance of the crimes category prediction in three of the chosen six classification models simply by integrating the tweets text features to our crimes dataset. In addition, adding tweets text features as inputs result ed in higher prediction accuracy rate than only using crimes features . In addition, 417.5 MB of raw tweets (119 , 196 tweets) data were obtained using the Twitter REST API. The tweets collected were tagged with the geo locations within a radius of 15 miles from the center of the city of Chicago , and t hey were posted during the months of November and December in 2017.

PAGE 13

5 T his data cannot be published to the public due to Twitter's developer policies and agreements on publishing downloadable Twitter contents [3 ] . Similarly, Twitter posts were downloaded were downloaded in this work since no downloadable public tweets was found anywhere online. Very few studies researching crimes are based on the geolocations of tweets. Our approach explores the advantages and uses of the geo metadata attached to a tweet object and t his study is an addition to the few studies that use this strategy.

PAGE 14

6 CHAPTER II BACKGROUND Data Mining Data m ining is the science of extracting useful knowledge from many data sources to predict or classify future outcomes. Applications of data mining have spread to many fields, such as biology and psychology, but few data mining studies have been conducted in the field of criminology. The top applications of data mining in criminology are the following : Tracking criminals in the society using their online feeds. Predicting criminal suspects, also known as criminal identification . Predicting criminal activities , such as predict ing crimes time, location, and type . Predicting patterns of crime s , e.g., serial killers and hate crime . Machine l earning (ML) , deep learning, and natural language p rocessing (NLP) are all data m ining techniques , and t hey have some overlap. For instance, ML is sometimes used for NLP tasks, as shown in Figure ( 1 ). Figure ( 1 ) : Venn diagram showi ng the overlap between ML and NLP [4]

PAGE 15

7 The next section provide s some background information about machine learning algorithms and natural la nguage p rocessing techniques. Mac hine Learning ML is a branch of artificial intelligence concerned with using historical data to extract useful information to predict future events. Machine learning has received much attention in th e recent years due to its support in big data and its application in a wide range of disciplines, such as marketing, bioengineering, banking, and criminology. M L algorithms are divided into many types , and t he two most common types of ML are supervised lear ning algorithms and unsupervised learning algorithms. These algorithms use different approaches to implement a machine learning model of prediction , and they are shown in Figure (2) . Figure (2): Machine l e arning t ypes [5] Supervised Learning Algorithms In this method of machine learning, the prediction task is performed on a training set with known labels , also known as target values . The labels are the true prediction values of the dataset, which are also known as the ground zero values. The labels can be categorical,

PAGE 16

8 discrete, or continuous values , and t hey are compared with the predicted values, which are the outputs of the prediction model. Figure (3) shows the steps of a supervised learning algorithm. There are two types of supervised learning algor ithms: classification and regression. Classification algorithms are used for classification problems where the output is a category or a discrete value. In contrast, r egression algorithms are used for regression problems where the output to predict is a real numeric or a continuous value. Figure (3): Supervised learning m odel [6] A classification algorithm is a two step task. This algorithm must perform a training task first, which trains the prediction model with the training dataset using the train l abels. After training the model, a prediction or a classification can be performed on the test dataset. The test dataset samples are treated as inputs to predict the test labels as outputs of the prediction model. The goal of a classification algorithm is to separate the data into categories.

PAGE 17

9 A regression algorithm follows the same steps as a classification algorithm. However, the goal of this algorithm is to fit the output data into the best fitting labels. The best fit is determined by calculating some ty pe of error and gaining the least error possible from the regression task. Unsupervised Learning Algorithms In this method, the labels are unknown for the dataset used in the prediction model. The most popular type of unsupervised learning algorithms is t he clustering algorithm, which is used for clustering problems where data points are clustered as groups with similar features. Figure (4) shows the steps of an unsupervised learning algorithm. Figure (4): Unsupervised learning m odel [6] An unsupervise d learning clustering algorithm is only a one step task. Only one dataset is given as input to the model, instead of having two datasets such as the training and test

PAGE 18

10 datasets. Then , the output s of the model are given as clusters or groups of data points , and e ach cluster represent s one of the predicted labels. Natural Language Processing Natural language processing (NLP) is often used for processing textual data. NLP is also a branch of artificial intelligence that is responsible for making computers under stand and process natural human languages , and NLP is frequently used in sentiment analysis as a preprocessing step. NLP is mainly used for lexical analysis tasks, such as part of speech (POS) tagging, named entity recognition (NER), and topic modeling (TM ). In this study, some NLP techniques are used to preprocess the data of the tweets . Prediction of Crimes Introduction The definition of crime by Merriam Webster is an illegal act for which someone can be punished by the government [7] , and criminology is the scientific field for identifying crimes and criminal characteristics. Crime analysis is the process of e xploring the behavior of crimes, detecting crimes, and finding correlation s between the crimes and the criminals. Within that process, many type s of crime prediction techniques have emerged. The two major types of crime prediction are crime hotspots prediction and crime pattern prediction. Crime hotspot prediction measur es the density rate of a specific crime in a specific boundary of a location , while c rime pattern prediction is the process of predicting the type of crimes in a specific area and time.

PAGE 19

11 Crimes hotspots prediction The police often use more patrols for the areas that are more crime prone , and t hese crime prone areas that have high rates of crimes are called crime hotspots. How do they know about these places? Well, some police officers are keenly aware of the areas with higher crime rates in the city. However, there are other ways than just prac ticing patrolling to identify these spots. Law enforcement employees who do not patrol, such as crime analysts, for example, can detect these trends by two methods: 1. Using maps and geographic systems . 2. Using statistical tests . These are based on the informat ion available at the National Institute o f Justice [8] . 1. Maps: Specialists create density maps to monitor crimes and show where crimes occur within the boundaries of a given city. These maps show the blocks with the most crimes as well. Analysts use g eo graphic i nformation s ystems ( GIS) to visualize the crime hotspots by combining street maps, the data about crime and public disorder, and data about other features, such as stores and bus stops. The GIS shows these spots as grids with colors that identify the severity or concentration of the crimes in each cell of the grid. Crime maps are categorized into the following: Points: They Convey the exact location of a crime hotspot , which is a n exact place where crimes usually occur on a reg ular basis.

PAGE 20

12 Figu re (5): Points in a grid map Street segment: It is a line that shades an entire street location , meaning that the places along that street are crime prone places. Figure (6): Street segment in a grid map Shaded area: A shaded area is a shaded cell on the grid of the city map that shows the distribution of crimes. Each cell could represent a district, a suburb, or a cluster of blocks. If that cell is shaded, then it requires police attention , as it indicates that the crime concentration in this cell is high. Figure (7): Shaded area in a grid map Density Surface: It is represented as color gradients in a map. This shows the crimes concentration by showing the inner dark areas of the gradient as high risk areas, wher eas the outer lighter colors are lower risk areas.

PAGE 21

13 Figure (8): Density surface on a grid map 2. Statistical tests: This is the use of computer software to analyze crimes data and geographical data to identify crimes hotspots . Other crimes predictions In addition to the hot spot prediction of crimes, which mostly relies on clustering, there is also crime type prediction, which helps predict what type of crime will occur in a specific time and place. This study is focused on this type of pre diction , and this prediction type is addressed in the methods section of this study. Social Media Analysis Introduction Nowadays, online social media is being deployed to analyze and predict crimes . The police and law enforcement units are collecting data from online interactions in social media, such as Twitter, Facebook, and personal blog s . They try to follow gang members, criminal or ganization s or suspected terrorists. Twitter Twitter is one of the top 10 microblogging websites in the world with over 12 TB of data generated daily, according to Alexa's ranking website [9] . Its main form of interaction between users is called a "tweet". A tweet is a 140 character message that gets posted and shared either

PAGE 22

14 publicly or privately i n the Twitter space , and o n November 7, 2017, Twitter launched the use of 280 characters long tweets [10] . Users of all ages and nationalities post about 500 million tweets per day , and t he latest character increase allowed the users to double the length of their published thoughts a nd opinions Twitter Developer APIs Twitter provides application program interface s ( APIs) for the developers to manage Twitter data. These APIs connect to the Twitter server via HTTP operations. There are two APIs available on Twitter, the R EST API and the Streaming API [11] . To use these APIs, the programmer or developer must first establish authentication with the server using OAuth credentials. OAuth is "An open protocol to allow secure authorization in a simple and standard method from web, m obile and desktop applications" [12] . There are two types of OAuth credentials, application only authentication and application user authentication. Each user must register an application through his Twitter account for which he/she has unique credentials. The HTTP methods used in Twitter's API are GET and POST. GET is simply responsible for fetching data , while POST is responsible for sending data. Both GET and POST use Twitter API resources, such as GET search/tweets and POST statuses/update. This stud y involved search ing and collect ing tweets. Thus , only GET search/tweets operation was needed . 1. REST API : It a llows the programmer to search terms or get tweets filtered based on specific parameters. REST API does not provide live streaming data, but it is useful for analyzing historical data.

PAGE 23

15 Figure (9): Twitter REST API [13] 2. Streaming API: It lets the programmer deal with live streaming tweets data per request and keeps sending it until stopped. This is useful for when one wants to do analytics ov er live campaigns on twitter rather than historic ones. Figure (1 0 ): Twitter Streaming API [14]

PAGE 24

16 Twitter analysis applications Twitter data applications are increasing. Some of the most notable applications for this data are as follows : Extracting Flu related tweets to predict the flu epidemic trend in a region [15 ] . Stock market prediction [16 ] . Predicting the results of elections [17 ] . Literature Review Crimes Prediction Other than traditional techniques for crime prediction, there are also new re search papers about using data mining techniques to do the same. A survey paper "Survey on Crime Analysis and Prediction Using Data Mining Tec h niques" was published in 2017[18 ]. In this paper, the authors split the types of crime analysis procedures into t wo types based on the data mining approaches used. These types are crime prediction by classification techniques and finding crime clusters by clustering techniques. This study uses classification techniques to predict the categories crimes that are bas ed on spatial and temporal features within a given location There is a method for crime p rediction proposed by Sharma [19 ], which is a tool called Z Crime , that can be used for detecting suspicious email s by enhancing an ID3 decision tree algorithm. His en hanced method produces a faster and better decision tree. He improved the feature selection method of the tree based on an improved information entropy. Thus , the resulting algorithm is a combination of the improved ID3 decision tree algorithm and the impr oved feature selection method.

PAGE 25

17 Another me thod proposed b y Hamdy et al. [20 ] is based on social media users' interactions and mobile usage data, such as locations and call logs. They used social feeds and mobile data to predict suspicious behaviors and move ments of individuals. Furthermore, t heir new model called the Criminal Act Detection and Identification Model can help crime analysts make better decisions via a sequence of inference rules to perform behavioral analyses. Agarwal et al. [21 ] developed a tool for crimes analysis using the k means clustering algorithm. They used it to predict crimes rates based on a spatial distribution of crimes data and were able to predict homicide rates on a yearly pattern. Twitter Prediction The most nota 22 ]. His approach was based on using geotagged tweets and he used kernel density estimation to predict crime types in variou s areas of Chicago. The results showed that there are improvement s in 19 out of 25 types Then , Xinyu Chen et al . in i ] also cited and used it as a benchmark model of prediction. They based their approach on sentiment polarity of the tweets and some weather data. Their results have model. Our approach is algorithms, classification algorithms are used in this work . T ime series type s of data are still used , but the timestamps are considered as merely features used in the classification task.

PAGE 26

18 CHAPTER III DATA Datasets The acquired data that serve d as inputs to the prediction model is the Chicago Data Portal crimes dataset for the year 2017 and the Twitter collected tweets for the month of Nov ember December, 2017. A description of each is provided in the following sections : Chicago Crimes Dataset 2017 This dataset was semi structured as a comma separated v alue ( CSV ) file. It was easily managed to be transform ed into a structured a pandas data frame type of data object during the preprocessing stage of progra mming. Before preprocessing these crimes records of the months of November and December, there were 40 , 583 records of crimes . After preprocessing them, that number was reduced to 39 , 225. The n a fter the data integration process, that number was reduced further down to 24 , 462. November and De cember tweets of 2017 Tweets in their raw form are considered unstructured data. However, since they were downloaded as a JavaScript Object Notation (JSON) files and converted them into CSV files, they became semi structured data. Further transformations were performed during the preprocessing stage of coding which transformed the tweets dataset into structured data as a pandas data frame format. Then, 121 , 853 raw tweets were collected . After preprocessing, we ended up with 114 , 095 tweets. That number was reduced further after the integration process to 24 , 462 tweets to match the number of crimes.

PAGE 27

19 First: Twitter Data All Twitter data consist ed of several d ata objects and their attributes contained in a JSON text format. These data objects are the following : Tweet Object s User Object s Entities Object s Extended Entities Object s Geospatial Objects T his study is interested in the tweet text and g eospatial objec ts only. Tweet o bject s Twitter's main data object is a Tweet. A tweet is textual content that is being shared on the Twitter microblogging website , and i t can also contain images, URLs, and GPS locations. A tweet also contains many metadata contents, such as the tweet's time of publication, its owner, location, retweets count, and other information related to the tweet. Geospatial o bjects There are two types of geographical metadata: 1. Tweet l ocation ( also known as a geo tagged tweet): A tweet is called a geo tagged tweet when it is associated with a location. A tweet location can be either an exact Point location or a Twitter Place with a bounding box. These locations are represented as a set of longitude and latitude coordinates. About 1 2% of all tweets are geo tagged using one of the methods , and t hese geospatial objects are saved as a data dictionary in a JSON format type of file.

PAGE 28

20 2. User location: The location of the user's hometown is found in their public profile , but this work is only interested in studying geo tagged tweets. Second: Chicago Crime Data T he city of Chicago, IL, was chosen as the city to use its crime data in evaluating our prediction model. Chicago has always been a center of attention for crime related topics because it scores high in crimes and that it never stays as a low rate crime city for long. The reasons for choosing it as t h e crime city of our prediction approach are the following: Chicago has always had a higher average rate of crime in the United States , especially in violent crimes, such as homicide. US cities have witnessed an increase in murder rates in 2016, but almost half of the murders occurred in Chicago. Availability of large amounts of publicly available (high quality) crime datasets. In 2016, Chicago witnessed a horrific spike of 50 percent in the murder rate from its previous year , and it was ranked the 24th most dangerous cit y in the United States [24]. Third: Community Areas The last data needed were shapefile data from all 77 communit y areas of Chicago. This file contains the geometrical coordinates of points that form a multipolygon area on the map for each one of these community areas of Chicago. As mentioned earlier, the tweets that are located within a 15 mile radius from the cente r point of Chicago were downloaded . The Chicago city center point has a latitude of 41.881832 and a longitude of 87.623177. T his shapefile was needed to determine the origin of each tweet among the 77 community areas.

PAGE 29

21 Data Collection Process Twitter's developer policy prohibited the distribution of any downloaded data from Twitter as a data source. Consequently, no historical tweets were available to be downloaded from within the boundaries of Chicago. Twitter strictly state s in their Develope r Terms page under the Redistribution of Twitter Contents term that it does not allow developers who previously downloaded twitter content to share more than 50,000 tweets with any group or individual. They also did not allow the collected data to be made publicly available [25]. For the previously stated reasons, a downloading schedule was used to collect the tweets manually. The Downloading Time Schedule A Tweet Downloader program written in Python using the Twitter REST API and the Tweepy library [App endix A] was used . T he downloader was run on Monday and Tuesday of each week starting from November 1, 2017, until December 31, 2017. The downloading time took up to a day and a half to download the previous week's tweets , that is , about 36 straight hours . The downloader stores the tweets backward starting from the most recent tweet going toward the oldest tweet of the previous week. The training window period was a two month period starting from November 1, 2017, to December 31, 201 7. The first tweet collected was on 10:11:12 a.m. Chicago time , and t he last tweet collected was exactly at midnight, December 31, 2017. B uilding t he Downloader To build the downloader program, the follow ing steps were followed : 1. Create a Twitter account b y signing up.

PAGE 30

22 2. Create an application by registering for one connected to your Twitter account. 3. Once we have created an application, get the customer key, customer secret, access token, and access token secret. Store these in a safe place or in a confiden tial text file. 4. Now that these properties are stored in a computer as a JSON files format, they can be used to authenticate a connection to the Twitter data source. This authentication oc curs directly using Twitter's APIs or indirectly by using helping libraries such as the one we used, Tweepy, which is a Python module that provides access to Twitter's RESTful API methods. Register ing a Twitter Application To retrieve tweets through our Twee t Downloader program, either the User Authentication keys and secrets or the Application Authentication keys and secrets can be used . The Application Authentication was used because more tweets per a given window of time can be downloaded . The rate limit f or the number of requests allowed per 15 minutes interval for the Application Authentication method is 450 requests, while the User Authentications method allows for only 150 requests per period . This rate limit is only for the search/tweets operation, whi ch was used for filtering and gathering tweets. Consequently , the Application Authentication can run 300 more requests than the User Authentication. That is the main reason why an Application was started and its Authentication was used for collecting tweets. If the rate limit is reached before the time of the window end s , then a wait time for the subsequent window time start s .

PAGE 31

23 Collecting Geo Tagged T weets To collect tweets that contain location data within the Chicago area, the following information is needed : The coordinates (longitu de and latitude) of Chicago. Understand ing of the geographic information provided as metadata within the tweets, such as geocode , reverse geocode, coordinates, bounding box, and place. Search criteria to fil ter the tweets wanted to download. Our search criterions are collecting tweets located within the longitude and latitude box of Chicago and published between November 1 and December 31, 2017.

PAGE 32

24 CHAPTER IV METHODS Data Preprocessing T wo different data sets were used : the crime data and the twitter data. The crime dataset was already cleaned and semi structured , and i t only needed a little amount of preprocessing before getting the final dataset. In contrast , the collected Twitter data was raw. First: Tweets After collecting raw twitter data, the data was clean ed and prepare d for analysis and prediction. The steps taken to preprocess the data and obtain a cleaned dataset ready for analysis are the following: 1. Converting the raw tweets from a JSON format data t o a Pandas data frame data structure. 2. Dropping duplicate tweet s based on the most recent tweets and keep ing the oldest tweet s, allowing for the exclusion of spams, advertisements, and retweets from the original tweets. 3. Dropping samples with null features. 4. Chang ing the time zone. The tweets are stored with their own metadata , and one category is the tweet time of posting. The time is saved as a GMT time zone. The time needed to be change d into US Central time zone, i.e., Chicago's time zone. 5. After setting th e time zone, the tweets were filtered to obtain only the tweets that were published between time 00:00:00 November 1, 2017, and 23:59:59 December 31, 2017.

PAGE 33

25 After those steps , the preprocessing of the tweets was finished . The preprocessing resulted in 114,0 95 tweets in total instead of the 121,853 unprocessed tweets originally collected. The next step was the extraction of new features and deleting/dropping unwanted columns. Second: Crimes The Chicago Data Portal website had two datasets of crimes : one was the homicide records from 2001 to the present, and the other dataset was all other types of crimes recor ded from 2001 to the present . T he crimes were filter ed based on the year 2017 and then download through some IDE provided in the website. Afte r downloading the two 2017 crime datasets from the Chicago Data Portal, the data were semi structured . However, some preprocessing steps were still needed: 1. Reading the CVS file and converting the data into a pandas data frame object. 2. Concatenating the two datasets, the crimes and the homicides, into one dataset; and then sorting the samples by date/time. 3. Filtering the crimes and storing only the crimes which were recorded during the months of November and December in 2017. After these steps , the crimes dataset was pr eprocess ed and contained 40 , 583 records of crimes . However, after the data integration process, that number was reduced to 24 , 462 records of crimes . The next step was extracting new features and dropping unwanted columns.

PAGE 34

26 Feature Extraction Our original datasets had many different features. Many of them had to be discarded and were kept . Each dataset had undergone numerous feature extractions and droppings. First : Tweets Based on the tweet time of posting, several time related features were extracted such as the day of the week, the time, binning the time of day (morning, afternoon, evening, and night), month (11 and 12), and the date was maintained as a date time format object. Binning the time of day was based on dividing the 24 hours in that day into the following four parts. These parts are the following: Morning: from 6:00 a.m. to 11:59 a.m. Afternoon: from 00:00 p.m. to 5:59 p.m. Evening: from 6:00 p.m. to 11 :59 p.m. Night: from 00:00 a.m. to 5:59 a.m. After extracting those time features, these values were converted from categorical values to numerical values . The most important features that were already present in the raw tweets are the time and date of whe n the tweet was created , the coordinate features, and the tweet text. Thus , the missing values (based on encountering one of these three features as missing) were handled by deleting that sample from the data. Most of the samples did contain the time and t ext. Consequently , only the samples that were missing the geo/location coordinates were deleted .

PAGE 35

27 There were several coordinates systems represented within the tweet data , and only the geo coordinates of that tweet were kept . Then, the longitude and latitud e features were extracted . There was no point in having multiple coordinate features , and only one coordinate system and its features were needed . Finally, the original raw tweets data had 176 features. T hat number was reduced to only three columns. Next, seven new features were extracted . Therefore, the following 10 features in our tweets dataset were obtained: the text, created_at, date, time, hour, day, month, geo.coordinates, longitude, latitude. Only t hose features were needed for the d ata integration process. Afterward , only the text was needed to extract textual features. In the end , the most important information from our tweets dataset was the text. WEKA, a data mining software, was used to preprocess the text , and this is explained fur ther in the tweets corpus section. The other features in the tweets help ed correlate the tweets data with the data concerning the crimes. Then , both the tweets and the crimes datasets were combined into one dataset with the same features. However, when predicting crimes without using the features extracted from the tweet text, which were 1 , 000 features, these features were simply excluded from our full dataset. Second: Crimes From the original features, some temporal features of the crimes dataset were e xtracted, which they were exactly similar to the ones extracted for the tweets dataset, i.e., the five temporal features. There were several spatial features in the dataset such as districts, wards, blocks, location description, and community areas. Since the shapefile for the community areas of Chicago was

PAGE 36

28 already obtained , that spatial feature was used with the location points feature, the latitude feature, and the longitude feature, and delete the other location features. The most important feature of the crimes dataset was the primary type feature, i.e., the exact crime committed in that recorded sample. This feature serves as the target value for our prediction process , and i t was stored as string categories. They were th en converted to numerical values for use in our predictive model and its evaluation. The original crimes data from the Chicago Data Portal website consisted of 22 features , and the number of original features was reduced to nine. Then, five temporal featur es that were extracted from the data were added . Consequently , the total number of features in the crimes dataset is 14. After integrating the crimes and tweets datasets into one full dataset, nine features were extracted from crimes and 1 , 000 features were extracted from the tweets. The nine features are re presented in Table (1) : Table (1): Descriptions of crimes features Feature Name Feature Description Week_day An integer number that represent s the number of a weekday , starting with 1 as Monday and ending with 7 as Sunday . Time The hour of the day in a 24 hour format . Times Number The time bin of the day in numbers as explained earlier . Domestic A Boolean value that indicates whether an incident was domestic related . Arrest A Boolean value that indicates whether an arrest was made . Beat Indicates the beat number where the incident occurred. A beat is a small police geographic area. Each beat has a dedicated police car. Community Area A number from 1 to 77 that indicates the communit y area. Longitude I ndicates the longitude of the location of the crime. Latitude I ndicates the latitude of the location of the crime.

PAGE 37

29 Exploratory Data Analysis T his section presents some exploratory data analysis to explain both our Chicago tweets and crimes datasets. First: Tweets Figure (11 ): Number of tweets posted during every day of November and December 2017 Figure (11) shows a line plot of the number of tweets posted on each day of the months of November and Decembe r in 2017 within a 15 mile radius of the center of Chicago. December 11, Christmas E ve , and Christmas day had the lowest number of tweets posted , while there was a spike in the number of tweets on November 11 and 12 and December 1 and 31.

PAGE 38

30 Figure (12): Number of tweets poster during each day of the wee k during November and December 2017 Next, a histogram of the number of tweets per day of the week is plotted in Figure (12). The figure shows that most tweets are posted on Saturdays, while few er were posted on Mondays and Tuesdays. M ost of the tweets were posted on weekend days.

PAGE 39

31 Figure (13): Scatter plot of all collected tweets with in a 15 mile radius of the Chicago center point Figure ( 13 ) shows a scatter plot of tweets within a 15 mile radius around the center point of Chicago . Most of the points are in the metro area of Chicago. Second: Crimes There are 30 categories of crimes in the crimes dataset. These categories are the target classes that are predict ed in this work . Predicting cri mes types is, therefore, a multi class problem. These crime types and their frequencies are represented in Table (1). S ome of the

PAGE 40

32 frequencies of the types are very low, whereas others are very high, indicating that this data is not random . Table (2 ): Crime s frequencies per category of crime T he plots in F igure (14) show the number of crimes per category. Some of these crimes only occurred once or twice during our two months period.

PAGE 41

33 Figure (14): Crimes numbers per category of crime.

PAGE 42

34 These crimes and their frequencies in each of the 77 communities of Chicago are represented in Table ( 3 ). Table (3 ): Crimes frequencies per community area

PAGE 43

35 In addition, F igure ( 15 ) shows the density of total crimes per community area represented as a scatter plot. L ar ger circles indicate higher crimes rates in that area , and t he three largest circles indicate the areas where the crimes occurred most frequently . Two of these areas are in the metro area of Chicago. Area 25, West Town, had the highest crimes rates in Chic ago, followed by area 32, South Lawndale, and area 28, East Garfield Park. Figure (15 ): Crime density per community area as scatter plot density circles Figure (16) shows a line plot of the number of crimes that occurred on each day of November and December in 2017. An interesting observation was that on December 25, the number of crimes was significantly low. Also, the subsequent days up until the end of the year also had low crimes compared to the other days. Finally, on Thanksgiving, which was Nov ember

PAGE 44

36 22, the number of crimes were slightly higher than the number of crimes during the days following Christmas day; however, that day still had a low crime rate. In conclusion, crimes in Chicago are less likely to happen during holidays . Figure (16 ): Number of crimes reported during every day of November and December in 2017 Tweets Corpus Dealing with boundaries Before preprocessing the tweets text, community area name s and number s were assigned to the tweets. This was performed by checking the longitude and latitude of each tweet to determine if they belong to a certain community area in Chicago. A mong the 114,095 tweets that we re collected within the 15 mile radius of Chicago, there were 105,008 geo tagged tweets within the boun

PAGE 45

37 Text preprocessing The text s of the 105 , 008 tweets were used as a corpus to perform some text preprocessing techniques on them . Each tweet text serve d as a document in the corpus . For preprocessing tweets texts, WEKA version 3.8.2 was used [26 ]. WEKA is a popular data mining software in Java developed by the University of Waikato , and i t can be installed and used either by the command line or the GUI interface. It is a fast and an efficient way of text preprocessing. F irst, the corpus was converted from a CSV format to an a ttribute r elation f ile f ormat ( ARFF ) , which is the data format suitable for WEKA . Second, the ARFF file was uploaded into WEKA and its StringToWordVector filter was used to filter the text using several text preprocessing techniques as parameters to the filter . The parameters set are the foll ow ing: Stemming: Which transforms the word or verb back to a crude heuristic form that ends up with the word being chopped to a smaller word (e.g. , Stop words removal: This is simply the process of removing English stop words. Stop words , which are commonly used words that can be ignored (e.g. , Lower casing: This is the process of converting all the letters in the words to lower case letters. Words to keep: Since including every single word of the texts of the tweets would result in a large data file, this option was set to 1 , 000. Consequently, the 1 , 000 most common words were used in the corpus . Finally, the filter was a pplied, resulting in 1 , 000 useful words serving as features that were extracted from each tweet in our corpus. These features cont ain either number 1 that indicates

PAGE 46

38 the presence of a word or 0 that indicates the absence of a word for each corresponding sample row. Thus , a features and samples matrix of size 105 , 008 times 1 , 000 was constructed . Once the tweets had been tokenized and filtered, they were saved as a CSV file. Then , they were converted in to a pandas data frame object fo rmat with the words features as the columns and the samples as the rows. Data Integration The r e was a challenge when trying to fuse the crimes data with the tweets text features. The problem was that there were many more tweets than crimes samples. Thus , a naïve approach for the data integration of these two datasets was developed [Appendix B] . The approach involved finding the tweets that were posted within a specific co mmunity area and time bin , resulting in a general, but faster, correlation with a wide range of crimes recorded. After integrating the crimes with the tweets data, the number of resulting samples was 24 , 426. The number of samples was much smaller than the 39 , 225 crimes records and the 105 , 008 tweets because crimes were often more active in different areas and time bins than the areas and time bins of the geo tagged tweets, and vice versa. After combining the two datasets, the crimes and the t weets, into one dataset, there were a total of 1 , 009 features and 24 , 462 samples to study. Predictive Model Train and test datasets After preprocessing the data and gaining our final prepared dataset, the data was first split into two datasets, a 20% validating set and an 80% testing and training sets . In the remaining 80% of the data , cross validation split ting of 10 folds was performed using a Python library

PAGE 47

39 called scikit learn [ 27 ] . Thus , the training and test datasets were split into 10 equal partitions, called folds with shuffling , and each fold was used in our training and prediction tasks during determine which of the parameters resulted in the best testing accuracy score . Then , the best tested parameters in our models were used to perform predictions on the validation set. Our experiments Next, our e xperiments were performed . Our baseline experiment used the crimes data features as inputs in five classification models. These models were logistic regression (LR), naïve B ayes (NB), k nearest neighbor (KNN), support vector machine s (SVM) , and artificial neural network (ANN). Our second experiment used the crimes features along with the extracted tweets features as inputs to the same five previously mentioned models. The third experiment was not a standalone experiment , and i t could be seen as a sixth model used in the both the two previously mentioned experiments. It is called, ensemble learning (EL), which is also a superv ised learning technique just like the other five classification models. Thus , this experiment was included as a six th classifier in both experiment one and two. The model parameters Before running the experiments, the 10 fold cross validation split and shuffle was used to tune and test the parameters on each model , and following are the models and the best parameters chosen for better prediction accuracy: Logistic regression Logistic regression is a linear model. The significant parameter i s called C, which is the inverse of regulation strength. So, scikit learn library provides the use of a regularized logistic

PAGE 48

40 regression model. C was set to 0.1, 1.0, and 10 , and when C was equal to 1.0, the results were the best . The strategy used for this model is the default one vs rest scheme. Gaussian naïve B ayes Gaussian naïve B ayes is a probabilistic classifier based on B ayes T heorem . No parameters are needed here. K nearest neighbor K nearest neighbor is both used for classification, which is the objective of this work , and regression. The most significant feature is the number of k , whi ch are the neighbors . T he best k parameter can be determined by performing grid search ; however, since a good k value should be used , the square root of the current dataset used is chosen . Our cross validation splits currently had 17 , 613 samples. Thus , the square root of that was 133. T his model was trained and tested with lower and higher k values , but it had the best results when k was 133. Support vector machines There was not sufficient time to test all the range of parameters for SVM. A small number of parameters w ere tested so the test would not take hours. T he penalty parameter C , also called the misclassification parameter, was set once to 1, once to 5, and lastly to 10 on a linear kernel called radial basis function (RBF). T he best results were w hen C was equal to 5. The gamma parameter is set to auto by default. Auto means setting the gamma value to 1/n_features. In the first experiment we have 9 features. So, gamma is equal to 0.1111. Whereas in the second experiment we have 9 features. So, gamm a is equal to 0.00099.

PAGE 49

41 Artificial neural ne t work ANN is a deep learning technique that can be used for classification tasks. In our study, we trained and tested this model on several parameters such as one, two, three, or four hidden layers of neurons with either 10, 20, 30, 40, 50, 60, and 100 neurons, initial learning rate s of 0.01, 0.1, and 0.2, and max iteration s of 200, 400, 500, 70, and 1 , 000. We chose the rectified linear unit (ReLU) function as our activation function. The best parameters for this model were with three hidden layers of 30, 60, and 100 neurons , respectively, a learning rate initial rate of 0.01, and a max iteration of 500. Ensemble learning Ensemble learning is a collection of models working together to g ive one output result , and t here are many types of ensemble learning techniques such as bagging, boosting, stacking, voting , and many others [ 28 ] . In our study, the max voting technique was used , also called hard voting , which basically compar es the target value that was predicted for each sample for the different classification models and chooses the one which had the max number of votes. There were 30 categories of crimes which were considered the classes used in the target of the dataset that needed to b e predicted. Model evaluation metrics After acquiring the results of each experiment, the results were compared using the following classification evaluation methods: Accuracy score It is the ratio of correctly predicted samples . In other words, it is the number of the true values, both the true positive and the true negativ e, over the total number of samples. By

PAGE 50

42 calculating that, the accuracy score of the classification will be obtained . This is a relatively good scoring metric, especially since the true v alues are desired , not the negatives. The formula for this measure is: (true positive + true negative) / (all samples) . Confusion matrix A generic example of a confusion matrix is shown in T able (4 ): Table (4 ): Interpretation of the confusion matrix Actual Predicted Positive Negative Positive True Positive (TP) False Negative (FN) Negative False Positive (FP) True Negative (TN) The confusion matrix is one of the most frequently used visualization s of performance in supervised learning algorithms. I t can be considered as some sort of visualization of the classification report. Classification report The classification report is a way to represent the metric scores for each crimes type in the target label. The classification report is also a type of descriptive representation of the confusion matrix. Our classification problem is a multi class, single label problem. Thus , the classification report represent s several scoring metrics for each one of these clas ses. The classification report represent s the following four types of metrics : Precision: This is the number of true positive samples divided by the number of all positive samples. In other words, it is the ratio of the correct positive samples to positive samples . The formula is:

PAGE 51

43 true positive / (true positive + false positive) . Recall : This is also called the true positive rate. It is the ratio of the correctly predicted positive samples to true samples . It is the number of true positive samples divided by the number of true positive and false negative samples. The formula is: true positive / (true positive + false negative) . Both the precision and the recall are good for predicting biased class distributions because t hey focus more on the performance of positive samples than the negative ones. F1 score: This is the weighted average of the precision and the recall. Thus , this measure considers both false positive and false negatives. This measure is usually more accurat e in prediction tasks with uneven class distribution s like the case with our data. The formula is: ( precision · recall ) / (precision + recall) . Support: This is not an evaluation metric itself. This is supportive information to compare scores . In other words, the support is the number of actual samples predicted for each class. So, it is the actual number of crimes for each category.

PAGE 52

44 CHAPTER V RESULTS AND DISCUSSI ON Experiments The first experiment was our baseline experiment. This experiment only took the crimes features as the input features. I ts results were compared with our second experiment once the tweets features were added . The second experiment us e d both the crimes nine features and the tweets 1 , 000 text extracted features, which were called t erms. This section first show s the results for each model for each crime type. Then , the overall scores of the models for the prediction s of all the crimes categories are compared . Lastly, the scores for classifying each crime of the best model are compare d . First experiment Logistic regression The logistic regression overall performance on the validation set had an accuracy of 61%. The prediction, recall, and f1 scores are shown in Table (5). By observing the diagonal of the confusion matrix in Figure (17), the figure shows that this model worked well with classifying the CRIMINAL DAMAGE type of crime. It predicted 1 , 483 samples of that crime as true positives out of a total of 1 , 515 of actual true positives. Thus , it only mispredicted 32 samples. That crime category had the largest number of samples in the validation set. Perhaps it had a better chance of getting the prediction right because there were many samples of this type. In addition, this model failed to observe any type s of crime with few samples.

PAGE 53

45 Figure (17): Confusion matrix for the logistic regression model of the first experiment

PAGE 54

46 Table (5): Classification report for the logistic regression model of the first experiment Gaussian naïve B ayes The overall accuracy of this model was 86%. Based on the diagonal of the confusion matrix in Figure (18), this model predicted the true positives well . It especially predicted class 1, CRIMINAL DAMAGE, and class 10, DECEPTIVE PRACTICE well with f1 score s of 99% and 81% , respectively. Table (6) shows that these two types of crimes were also the highest number of crimes within this set. NB also predicted classes with lower number s of crimes well. For instance, it 100% predicted the true positives of PROS T ITUTION with only six samples available in our validation set.

PAGE 55

47 Figure (18 ): Confusion ma trix for the naïve B ayes model of the first experiment

PAGE 56

48 Table (6 ): Classification report for the naïve B ayes model of the first experiment K nearest neighbor The overall accuracy of KNN was 31%, which was a poor accuracy score performance. Based on the diagonal of the confusion matrix in Figure (19), most of the predictions were set in the CRIMINAL DAMAGE category , possibly because the largest number of crimes in our validation set were CRIMINAL DAMAGE crimes. In addition, this model failed to predi ct any category other than C R IMINAL DAMAGE, ASSAULT, and DE C EPTIVE PRACTICE. In other words, it failed to predict any class other than the major three classes in this set.

PAGE 57

49 Figure (19 ): Confusion matrix for the k nearest neighbor model of the first experi ment

PAGE 58

50 Table (7 ): Classif ication report for the k nearest neighbor model of the first experiment Support vector machines The overall performance accuracy score was 32% . It was 1% better than the KNN performance, but it was still poor. T he confusion matrix in Figure (20) shows that SVM produced similar results as KNN . The prediction scores were similar even for the same crime categor ies . The only difference was that this model predicted fewer values in categories other than the three major ones based on the classification report in Table (8).

PAGE 59

51 Figure (20) : Confusion matrix for the support vector machines model of the first experiment

PAGE 60

52 Table ( 8 ): Classif ication report for the support vector machines model of the first experiment Artificial neural network The overa ll accuracy score for ANN was 53 %. Thus far, the best model was NB with an accuracy of 86%, followed by logistic regression with an accuracy of 61% and now ANN with 53% accuracy. This model also worked well with predicting the number of cr imes of types with a high number of crimes, but the model had a poor prediction performance with types with a low number of crimes , as seen in the confusion matrix in Figure (21) and the classification report in Table (9).

PAGE 61

53 Figure (21 ): Co nfusion matrix for the artificial neural networks model of the first experiment

PAGE 62

54 Table (9 ): Classif ication report for the artificial neural networks model of the first experiment Ensemble learning Ensemble learning used all the predictions of the five pr evious models and set the prediction result to the most voted value for each sample in the validation set. This technique is called the max voting technique. The overall performance of this model was 52%, which was better tha n the accuracy scores of KNN and SVM, but the performance was worse than the scores of logistic regression, NB, and ANN. This model scores were very similar to the ANN scores based on the results of the confusion matrix in Figure (22) and the classification report in Table (10).

PAGE 63

55 Figure (22 ): Confusion ma trix for the ensemble learning model of the first experiment

PAGE 64

56 Table (10 ): Classif ication report for the ensemble learning model of the first experiment Accuracy score and classification report scores for all five models and ensemble learning model Table (11) shows that the model with the best accuracy score was the naïve B ayes classifier with a score of 86%. This was followed by logistic regression with an accuracy of 61%. The worst two models for thi s experiment were the k nearest neighbor classifier and the support vector machines with accuracy scores of 31% and 32% , respectively.

PAGE 65

57 Table ( 11 ): Performance metrics for the first experiment for the models Model Accuracy Precision Recall F1 score LR 0.61 0.53 0.61 0.53 NB 0.86 0.89 0.86 0.86 KNN 0.31 0.16 031 0.19 SVM 0.32 0.46 0.32 0.20 ANN 0.48 0.50 0.48 0.41 Ensemble 0.52 0.56 0.52 0.44 Second experiment Logistic regression Similar to the previous experiment, logistic regression predicted the crimes with the highest occurrences better than the ones with the lowest occurrences. However, the accuracy score in the second experiment after adding the 1 , 000 tweets features as inputs to the model has improved significantly. It is now 81% compared to a score of 61% in the previous experiment when only the crimes features were used as inputs. Figure (23) and Table (12) show detailed scorings for each categor y of crime.

PAGE 66

58 Figure (23 ): Confusion matrix for the logistic regression model for the second experiment

PAGE 67

59 Table (12 ): Classification report for the logistic regression model for the second experiment Gaussian naïve B ayes NB also scored a higher accuracy s core in this experiment compared to the first one. Here it has an accuracy of 96% , while in the previous experiment , it had an accuracy of 86% , showing a 10% increase in the accuracy score of this model. Figure (24) and Table (13) show a detailed scoring o f each crime category.

PAGE 68

60 Figure (24 ): Co nfusion matrix for the naïve B ayes model for the second experiment

PAGE 69

61 Table (13 ): Classification report for the naïve B ayes model for the second experiment

PAGE 70

62 K nearest neighbor was 31% , and i t was an identical accuracy score to the model s performance in the first experiment. Figure (25 ): Confusion ma trix for the k nearest neighbor model for the second experiment

PAGE 71

63 Table (14 ): Classification re port for the k nearest neighbor model for the second experiment Support vector machines The overall accuracy performance of SVM was 47%. It was 15% higher than the previous score in the first experiment. This was a big improvement after adding the tweets features into the model along w ith the crimes features.

PAGE 72

64 Figure (26 ): Confusi on matrix for the support vector machines model for the second experiment

PAGE 73

65 Table (15 ): Classif ication report for the support vector machines model for the second experiment Artificial neural network ANN scored an accuracy of 47% in this experiment , whereas it scored 53% in the previous experiment. This model did not show improvement. Instead, i t showed a decrease in performance when the 1 , 000 text features were added as inputs to the model.

PAGE 74

66 Figure (27 ): Co nfusion matrix for the artificial neural networks model for the second experiment

PAGE 75

67 Table (16 ): Classif ication report for the artificial neural networks model for the second experiment Ensemble learning The overall accuracy score of ensemble l earning was the second experiment was 60%. It was 8% higher than the accuracy of the same model in our first experiment . Thus , the model has improved after using the tweet features.

PAGE 76

68 Figure (28 ): Co nfusion matrix for the ensemble learning model for the second experiment

PAGE 77

69 Table (17 ): Classif ication report for the ensemble learning model for the second experiment Accuracy score and classification report scores for all five models and ensemble learning model Table (18) shows that the model of the best accuracy score of this experiment was also the naïve B ayes model , just like in the first experiment. This model score was followed by a logistic regression score of 81% and an ensemble learning score of 60%. Comparing the ensemble learning results in the first and second experiments with an accuracy of 52% and 60% , respectively, ensemble learning was not the best model in terms of high accuracy rate. However, it was also not the worst. The ensemble learning model in both

PAGE 78

70 experiments had a middle score of accuracy. In contrast, the best model for our crimes classification task was G aussian naïve B ayes. Table (18 ): Pe rformance metrics for the second experiment Model Accuracy Precision Recall F1 score LR 0.81 0.80 0.81 0.80 NB 0.96 0.96 0.96 0.96 KNN 0.31 0.16 0.31 0.19 SVM 0.27 0.23 0.27 0.24 ANN 0.47 0.31 0.38 0.27 Ensemble 0.60 0.74 0.60 0.55 Discussion Comparing the results of the two experiments Table (19) shows the accuracy scores of the two experiments side by side. There was a significant improvement in the logistic regression model. Its accuracy score increased by 20% in the second experiment compared to the first one. There was also an increase in the perfor mance of NB and ensemble learning models by 10% and 8% , respectively. In contrast , there was a slight decrease in the performance of the SVM and ANN models. Their accuracy score s decreased by 5% and 1% , respectively. Lastly, there was no change in the perf ormance of the KNN model. It stayed the same at an accuracy score of 31%. Table (19): A comparison of the accuracy score of the first and second experiments Model Experiment 1 accuracy Experiment 2 accuracy Rate change Logistic Regression 0.61 0.81 + 0.20 Gaussian Naïve Bayes 0.86 0.96 + 0.10 K Nearest Neighbor 0.31 0.31 0.00 Support Vector Machines 0.32 0.27 0.05 Artificial Neural Network 0.48 0.47 0.01 Ensemble Learning 0.52 0.60 + 0.08

PAGE 79

71 CHAPTER VI CONCLUSION AND FUTUR E WORK Conclusion Comparing the accuracy scores of both experiments side by side shows that the accuracy of our second experiment surpasses the accuracy score of our first experiment wi t h the logistic regression model, the NB model, and the ensemble learning model by 20%, 1 0%, and 8% increase s in accuracy , respectively. Furthermore , the decrease s in accuracy scores between the SVM and ANN model s of the two experiments are very low with only a 5% and 1% decrease in performance , respectively . T his result shows that our hypothesis of adding the 1 , 000 features collected from the tweets of Chicago nine features has significantly increased the performance of predicting the category of crimes . Overall , an accuracy as high as 96% was achieved using the G aussian naïve B ayes classifier with the nine crimes features and the 1 , 000 features from the geo tagged tweets as inputs to the model. Future W ork O ur approach will be improved by finding more advanced ways to integrate the twe ets with the crimes. In addition, time series analysis could be used to predict the time and place of the upcoming crimes, and not just predicting their number of occurrences in categories . This approach would mean working with clustering techniques .

PAGE 80

72 REFERENCES [1 webster.com/dictionary/criminology. [Accessed: 01 Jul 2018]. [2] Chicago. (2018). City of Chicago | Data Portal | City of Chicago | Data Portal. [online] Available at: https://data.cityofchicago.org/ [Accessed 14 Jul. 2018]. [3 terms/policy.html. [Accessed: 01 Jul 2018]. [4 ine Learning vs. Deep Lear ning. Dr. Rutu Mulkar ML. [Accessed: 01 Jul 2018]. [5 Available: https://www.mathworks.com/discovery/machine learning.html. [Accessed: 01 Jul 2018]. [6 https://morganpolotan.wordpress.com/tag/supervised learning/. [Accessed: 01 Jul 2018]. [7 ] https://www.merriam webster.com/dictionary/crime . [Accessed: 02 Jul 2018]. [8 National Institute of Justice . [Online]. Available: https://www.nij.gov/topics/law enf orcement/strategies/hot spot policing/pages/identifying.aspx. [Accessed: 29 Mar 2018]. [9 ] Available: https://www. alexa.com/siteinfo/twitter.com . [Accessed: 01 Apr 2018]. [10 ] Nov 2017. [Online]. Available: http://social.techcrunch.com/2017/11/07/twitter officially expands its character count to 280 starting today/ . [Accessed: 01 Apr 2018]. [11 ] Available: https://code.tutsplus.com/tutorials/building with the twitter api getting started -cms 22192 . [Accessed: 02 Apr 2018]. [12 ] https://oauth.net/ . [Accessed: 02 Apr 2018].

PAGE 81

73 [13 ] Available: https://code.tutsplus.com/tutorials/building with the twitter api getting started -cms 22192 . [Accessed: 02 Apr 2018]. [14 ] Sergio Sola , 23 Nov 2016. [Online]. Available: https://medium.com/@ssola/playing with twitter streaming api b1f8912e50b0 . [Accessed: 02 Apr 2018]. [15 ] A. Culot Proceedings of the First Workshop on Social Media Analytics SOMA 10, 2010. [16 Web Intelligence, vol. 15, no. 1, pp. 1 17, 2017. [17 2017 9th Computer Science and Electronic Engineering (CEEC), 2017. [18 ] H. B. F. David and A. Suruliandi ICTACT Journal on Soft Computing, vol. 7, no. 3, pp. 1459 1466, Jan. 2017. [19 CRIME: A data mining tool for the detection of suspicious criminal activities based on 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC), 2014. [20 ] E. Hamdy, A. Adl, A. E. Hassanien, O. Hegazy, and T. 2015 Seventh International Conferen ce on Advanced Communication and Networking (ACN), 2015. [21] Means International Journal of Computer Applications, vol. 83, no. 4, pp. 1 4, 2013. [22] Decision Support Systems, vol. 61, pp. 115 125, 2014. [23] 2015 Systems and Information Engineering Design Symposium, 2015. [24] Time, 17 Jan 2017. [Online]. Available: http://time.com/4635049/chicago murder rate homicides/. [Accessed: 15 Jul 2018].

PAGE 82

74 [25] Twit Available: https://developer.twitter.com/en/developer terms/more on restricted use cases . [Accessed: 31 Mar 2018]. [ 26 ] Cs. waikato.ac.nz, 2018. [Online]. Available: https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pdf. [Accessed: 20 Jul 2018]. [27] "scikit learn: machine learning in Python scikit learn 0.19.2 documentation ", Scikit learn.org , 2018. [Online]. Available: http://scikit learn.org/stable/. [Accessed: 16 Jul 2018]. [28] M. Learning and A. codes), "A Comprehensive Guide to Ensemble Learning (with Python codes)", Analytics Vidhya, 2018. [Online]. Available: https://www.analyticsvidhya.com/blog/2018 /06/comprehensive guide for ensemble models/. [Accessed: 24 Jul 2018].

PAGE 83

75 APPENDIX A Tweet Downloader Code #Importing libraries import sys import psutil import time import os import jsonpickle import simplejson import json #from HTMLParser import HTMLParser from IPython.display import display import tweepy from tweepy.error import TweepError from tweepy import OAuthHandler, AppAuthHandler from requests.exceptions import ConnectionError # Our tweets downloader main class of our program class TweetDownloader (object): ''''''

PAGE 84

76 # ---Authentication information (put your own) ----# consumer_key = '**************' consumer_secret = '**************' access_token = '**************' access_secret = '**************' # --------------------------------------------------# # -------------Our search variables --------------# geoLocation = "41.833584450000004, 87.67181069718896,15mi" # The geocode for Chicago, IL at a radius of 15 miles searchQuery = " "#"pla ce:1d9a5370a355ab0c" # We leave this empty because we want to collect all of Chicago tweets tweetsPerQ = 100 # This is the max count the API permits # so 100 tweets * 450 queries ( which is the rate limit per 15 minutes window) = 45,000 tweets can be collected maxTweets = 2300 # The maximum number of tweets we want to collect # If we leave the items() empty, it's suppose to collect all tweets up to 7 days old # ---------------------------------------------------#

PAGE 85

77 def __init__(self,old = None): '''''' # a counter for how many tweets we collect ed self.tweetsCount = 0 self.oldest = old self.api = None self.tweet = None def auth_api (self): '''''' #Pass our application authentication information to Tweepy's AppAuthHandler auth = AppAuthHandler(self.consumer_key, self.consumer_secret) #Creating a twitter API wrapper using tweepy #wait_on_rate_limit will let us know when we reach the rate limit and how much time left self.api = tweepy.API(auth, retry_coun t=5, retry_delay=10, retry_errors=set([401, 404, 408, 500, 503, 504]), wait_on_rate_limit=True, wait_on_rate_limit_notify=True) #Error handling

PAGE 86

78 if (not self.api): print ("Problem connecting to API") def print_tweet(self): '''''' #tweetJSON=json.loads(str(self.tweet)) #print (" \ nDate: ", self.tweet.created_at, " \ n \ "", self.tweet.text, " \ " \ n" ,"User: ",self.tweet.user.name, ' @', self.tweet.user.screen_name) #print ("Place: ", self.tweet.place, " \ nCoordinates: ", self.tweet.coordinates," \ nGeo: ", self.tweet.geo,' \ n') print("ID: ",self.tweet.id, " \ nCreated at: ",self.tweet.created_at) #print("Place: ", tweetJSON["place"]["name"], " \ nType: ",tweetJSON["place"][" place_type"]) def print_progress(self): '''''' #Display how many tweets we have collected print(" \ nDownloaded {0} tweets".format(self.tweetsCount))

PAGE 87

79 #Display some information about the tweet self.print_tweet() #You can check how many queries you have left using rate_limit_status() method print("Remaining rate limit: ", self.api.rate_limit_status()['resources']['search']['/search/tweets']['remaining']) def process _tweet(self,outputFile): '''''' #Verify the tweet has specific place info before writing ''' or self.tweet.place["place_type"] == "neighborhood"''' if self.tweet.geo is not None: # Write the tweet i n our output file outputFile.write(jsonpickle.encode(self.tweet._json, unpicklable=False) + ' \ n') outputFile.flush() self.tweetsCount = self.tweetsCount + 1 # print information about our download progress

PAGE 88

80 self.print_progress() #update the id of the oldest tweet less one self.oldest = self.tweet.id 1 def collect_tweets(self,x): '''''' # Create the output file name outputFileName = "tweets Nov("+str(x)+").json" print(" \ n -----------------------------------------------\ nDOWNLOADING TO FILE: " ,outputFileName," \ n -----------------------------------------------\ n \ n") #Open a JSON text file to save the tweets to with open(output_path+outputFileName, 'w') as outputFile: while True: #on data: try:

PAGE 89

81 # Collecting the tweets with parameters set to ou r search criteria ''', until='2017 11 02', geocode = self.geoLocation, max_id = self.oldest since_id = 927687033463758849 # the id from the first tweets collected on Nov 6, last file''' elif self.oldest is not None: #changed this to be more discriptive, more lines but hopefully less time tweets = tweepy.Cursor(self.api.search, q = self.searchQuery, lang = 'en' , since='2017 11 19', max_id = self.oldest , geocode = self.geoLocation , since_id = 932421169491464192 , count = self.tweetsPerQ).items() else: tweets = tweepy.Cursor(self.api.search, q = self.searchQuery, lang = ' en' , since='2017 11 19' , geocode = self.geoLocation , since_id = 932421169491464192 , count = self.tweetsPerQ).items()

PAGE 90

82 # no more tweets found and so no more will be collected if not any(tweets): print("No more t weets found \ n") break # this means that we have reached the limit of our search criteia if self.tweetsCount>self.maxTweets: print("Max tweets of {} reached \ n".format(self.maxTweets)) break # processing the tweets collected from the first if and else statements for self.tweet in tweets: self.process_tweet(outputFile) #on error: except (ConnectionError, TweepError) as e: print(" \ nERROR HAPPENED \ n{0} \ nTRYING TO RECONNECT... \ n".format(e)) time.sleep(180) # less time

PAGE 91

83 self.auth_api() #on finishing: print(" \ n -----------------------------------------------") print("FINISHED DOWNLOADING TO FILE: {}.".format(ou tputFileName)) print("Downloaded {} tweets".format(self.tweetsCount)) print(" -----------------------------------------------\ n \ n") def main(): '''''' global output_path output_path = "/home/alan/Desktop/ThesisGithub/ Software/Output/Good Outputs :)/week 4 (Nov20 27)/" oldestID = 934548384635002881 1 # ID 1 oldestLIST = [] ''' i range is always changing depending on output files names''' for i in range(25,19, 1): myDownloader = Twee tDownloader(oldestID)

PAGE 92

84 myDownloader.auth_api() myDownloader.collect_tweets(i) oldestID = myDownloader.oldest oldestLIST.append(oldestID+1) print(" \ n \ n**** { FINISHED DOWNLOADING ALL FILES } **** \ n") print("Oldest IDs for each file: ",oldestLIST) main()

PAGE 93

85 APPENDIX B Data Integration Code import pandas as pd path = "/home/alan/Desktop/ThesisGithub/Software/Data/Chicago/Cleaned Data/final preprocessing/" arff = pd.read_csv(path+"after we ka.csv") tweets= pd.read_csv(path + "TWEETS.csv") tweets = tweets.reset_index() tweets = tweets.drop(['index'],axis = 1) tweets = tweets.drop(["Unnamed: 0"],axis = 1) crimes= pd.read_csv(path + "Chicago Crimes Dataset Processed4 drop na.csv") crime s = crimes.reset_index() crimes = crimes.drop(['index'],axis = 1) crimes = crimes.drop(["Unnamed: 0"],axis = 1) def masking(tokens,tweets,crimes,time,area):

PAGE 94

86 mask = tweets['Community Number'] == area comm1 = tokens[mask] mask = comm1['Times'] == time time1 = comm1[mask] mask = crimes['Community Area'] == area comm2 = crimes[mask] mask = comm2['Times Number'] == time time2 = comm2[mask] return time1, time2 full = pd.DataFrame() f or area in range(1,78): for time in range(1,5): area2 = float(area)

PAGE 95

87 t1, t2 = masking(arff,tweets,crimes,time,area2) if len(t1) > len(t2): t1 = t1[:len(t2)] t1 = t1.reset_index() t1 = t1.drop(['index'],axis = 1) #t1 = t1.drop(["Unnamed: 0"],axis = 1) t1['helper'] = t1.index t2 = t2.reset_index() t2 = t2.drop(['index'],axis = 1) #t2 = t2. drop(["Unnamed: 0"],axis = 1) print("Area = ", area, " Time bin = ", time) print("t1 = ",len(t1)," t2 = ",len(t2)) t2['helper'] = t2.index tt = pd.merge(t1,t2,on= 'helper')

PAGE 96

88 print("t1 + t2 =",len(tt)) #tt = pd.concat([t1,t2], axis=0, join_axes= [t1.index]) #two = t2.join(t1[:len(t2)]) full = full.append(tt) print ("full=",len(full)) full.to_csv(path+"full_crimes_tweets.csv")