Using the Pushshift API to Collect Gab Data
URL: https://gab.pushshift.io/
Introduction
This document details the use of the Pushshift API for working with Gab data. Recently, Gab has transitioned to a new platform using the Mastodon framework. Pushshift is now actively ingesting all new content posted to Gab as well as moving backwards to update prior posts. The index is set up to be near real-time and updates approximately every 30 seconds. Most of the fields within the new JSON structure have been included in the Elasticsearch mapping so that most fields can be searched, filtered, and aggregated.
Getting Started
There are three primary methods for querying data -- using a uri based search (easily used with web browsers), making a GET request using a data body, and a hybrid request where the data body is passed as a JSON string via a source parameter. Below are examples of making a very basic request using the three methods.
Objective: Fetch up to 100 posts with the search term “Trump” and sort by most recent:
URI Search:
This method is the most straightforward and can be done from any web browser. The query for this method is:
https://gab.pushshift.io/search/?q=body:trump&sort=created_at:desc&size=100
GET Request with Data Body:
The GET request method uses a JSON string and is generally used by scripts or programming languages such as Python. The request looks like this:
curl -H "Content-Type: application/json" -XGET https://gab.pushshift.io/search -d '{"query":{"match":{"body":"trump"}},"size":100,"sort":{"created_at":"desc"}}'
There is also a third “hybrid” method that allows for making any possible query with a GET request available via a URI search by using two parameters -- “source” and “source_content_type”. The request would look like this:
URI Hybrid Request:
This method is similar to the first type but also uses a JSON string similar to the second type. The JSON string is passed as the value for the “source” parameter. The request looks like this:
https://gab.pushshift.io/search/?source_content_type=application/json&source={"query":{"match":{"body":"trump"}},"size":100,"sort":{"created_at":"desc"}}
The type of request that should be used depends on how complicated the query is and how the request is made (either from a script or manually made in a web browser). For simple searches, a URI search request is usually the easiest, but that method does not support advanced query options like aggregations, etc.
For most of the following example queries, the URI search method will be used if possible. If the query is complicated, the third option will be used.
Basic Python 3 Script Example:
import requests
import collections
import json
def query_es(term):
base_url = 'https://gab.pushshift.io/search'
headers = {'Content-type': 'application/json'}
q = collections.defaultdict(lambda : collections.defaultdict(dict))
q['query']['match']['body'] = term
q['sort']['created_at'] = 'desc'
q['size'] = 100
r = requests.get(base_url, headers=headers, data=q)
if r.status_code == 200:
return r.json()
else:
sys.exit(r.content)
result = query_es('president|trump')
if 'hits' in result:
for obj in result['hits']['hits']:
post = obj['_source']
print(post)
Learning by Examples:
This section will show many examples of how to make different types of queries. The URLs below do not have URL encodings for characters such as spaces to help make them easier to read. Generally they can still be copied and pasted into a browser without the encodings.
Get the latest 1,000 posts:
https://gab.pushshift.io/search/?sort=created_at:desc&size=1000
Get the latest 100 posts that mention “president” or “trump”:
https://gab.pushshift.io/search/?q=text:(president|trump)&sort=created_at:desc
Get the latest 100 posts from a specific account that mention a term:
https://gab.pushshift.io/search/?q=body:iran AND account.acct.keyword:a&sort=created_at:desc
Create an aggregation showing the top authors for posts mentioning trump (previous 24 hours):
https://gab.pushshift.io/search/?source_content_type=application/json&source={"aggs":{"author":{"terms":{"field":"account.acct.keyword","size":100}}}}&size=0&q=created_at:[now-25h TO now] AND body:trump
Create a time histogram aggregation for posts mentioning president or trump (previous 24 hours). In this example, a 25 hour range is used so that the 24’th hour window is complete.
https://gab.pushshift.io/search/?source_content_type=application/json&source={%22aggs%22:{%22time%22:{%22date_histogram%22:{%22field%22:%22created_at%22,%22interval%22:%22hour%22}}}}&size=0&q=created_at:[now-25h%20TO%20now]%20AND%20body:(president|trump)
Gab Post Object Fields:
Below is a complete listing of all known Post fields. Fields that are searchable via the API are highlighted in light green.
Field Name | Type | Description |
id | Integer | ID of the post |
body | String | Text body of post |
created_at | Date | Creation time of post |
retrieved_utc | Epoch | Time post was first ingested |
updated_utc | Epoch | Time post was last updated |
reblogs_count | Integer | Number of reblogs |
replies_count | String | Number of replies |
in_reply_to_account_id | Integer | User ID of the post replied to |
in_reply_to_id | Integer | ID of the post replied to |
favourites_count | Integer | Number of favourites for the post |
sensitive | Boolean | If the post is sensitive |
visibility | Boolean | If the post is visibile |
content | String | HTML version of body field |
language | String | Language detected for post |
media_attachments | object | Media attachments for post |
Account Object | ||
account.acct | String | Account name that made the post |
account.avatar | String | Link to account avatar |
account.header | String | Header graphic for the account |
account.note | String | Description for the account (HTML) |
account.note_text | String | Description for the account (Text) |
account.created_at | Date | Creation Date of the account |
account.id | Long | ID for the account |
account.url | String | URL link for the account |
account.is_pro | Boolean | Is the account PRO status |
account.is_verified | Boolean | Is the account verified |
account.bot | Boolean | Is the account a known bot |
account.locked | Boolean | Is the account locked |
account.display_name | String | Display Name for the account |
account.followers_count | Integer | Number of followers for the account |
account.following_count | Integer | Number of accounts being followed |
account.statuses_count | Integer | Number of posts made by account |
account.username | String | Username of the account |
Elasticsearch Gab Mapping with Settings:
https://gab.pushshift.io/ (Will show mapping with settings)