yukimasano / pet_forecast

Time-series forecasting for online petitions using Bayes
MIT License
0 stars 0 forks source link

pet_forecast: time-series forecasting of online petitions using Bayes

(for code description see bottom)

Data & Exploratory Analysis

The data was gathered between July 2007 and April 2015 from the website \textit{epetitions.direct.gov.uk} using an automated script. Apart from a steep initial increase, the number of created petitions shows a steady growth rate in the observed time period. This site was maintained by the government until 30th March 2015 and subsequently replaced with a newer one (\textit{petition.parliament.uk}). In total, our data set includes 60,950 online petitions, of which 255 attracted more than 10,000 signatures and 41 more than 100,000 signatures. In total, 15\,201\,511 signatures were collected.

The time-series data contains signatures for all petitions at an hourly resolution. Additionally, we have the petition's main text, its author, the corresponding government department, opening date and closing date. As petitions are usually open for one year, we have approximately 365*24 = 9000 data points for each petition time-series. In the total signatures distribution plot below, we can confirm the distribution following a something similar to a power-law. We were able to verify the signatures curve changing shape at both the critical values of 10\,000 and 100\,000 signatures as in (Yasseri et al.) on a much larger dataset. However, the change at 100\,000 is far more drastic meaning that petitions rarely grow much larger after achieving this mark.

Data and motivation comes from research conducted at the Oxford Internet Institute, more specifically Yasseri et al. and Hale et al.. Note: I was able to access the data while doing this project and the figures are created from that time.

Signatures

The time-series data contains signatures for all petitions at an hourly resolution. Additionally, we have the petition's main text, its author, the corresponding government department, opening date and closing date. As petitions are usually open for one year, we have approximately 365 * 24 ~ 9000 data points for each petition time-series. In the figure below, we can confirm the distribution of signatures following a multi-scale power-law. We were able to verify the signatures curve changing shape at both the critical values of 10000 and 100000 signatures. However, the change at 100000 is far more drastic meaning that petitions rarely grow much larger after achieving this mark. This is because the goal of any online petition is to reach 100k, which will prompt a parliamentary debate.

Petition text

Before a petition is launched, there is no information about whether it is going to be successful or not. Moreover, as the shapes of the cumulative signature curves vary greatly, it is even harder to tell what the value will be at a certain time. To support our adaptive forecasting algorithm which will be introduced in Section 4, we aim to pre-classify the petitions based solely on a-priori information. Specifically, we assume that most information about a petition can be extracted from its text description and that 'similar' petitions tend to follow similar curves during the observation period. Hence we categorise a petition based on its description text and subsequently utilise our prediction algorithm using parameters that were trained on petitions of this category.

More concretely, we structure our tokenized input text and use the bag-of-words model, where a text just corresponds to a collection of words. For simplicity, we only include unigrams (1-grams) and exclude higher order $n$-grams, which are terms containing $n$ words together in a certain order. Here, we assign each text description a vector of the words that have appeared. The rows of this sparse vector will correspond to the different words that occurred in all the descriptions. This vector is cleaned of common English stop words such as the' orand' and words common in any petition like petition' orsignature'. Next, we utilise the \mbox{tf-idf} statistic on these high dimensional vectors. It is a frequently utilised weighting scheme in information retrieval

                 

By using $K=4$ we can obtain three quite distinct clusters with silhouette scores above zero and are left with one cluster in the middle. However, the 'curse' of high-dimensional space \citep{Bellmann2003} results in very high overall distances (low silhouette scores) and the PCA projection may hide important features apparent in the topology of the data. Hence, we check the detected clusters by creating wordclouds (pictures of words that we scale according to their tf-idf score) of the centroid vectors.

From the wordcloud figures we can confirm clusters 0, 1 and 3 representing distinct topics. These are children/eduction, taxes/benefits and Scottish/EU referendum, while cluster 2 seems to contain rather unspecific terms, as could be presumed from the PCA plot above.

Example petition time-series

    

Twitter Users

        

Model

Results

   

Code