Using the Pushshift API to Collect Gab Data

URL: https://gab.pushshift.io/

Introduction

This document details the use of the Pushshift API for working with Gab data. Recently, Gab has transitioned to a new platform using the Mastodon framework. Pushshift is now actively ingesting all new content posted to Gab as well as moving backwards to update prior posts. The index is set up to be near real-time and updates approximately every 30 seconds. Most of the fields within the new JSON structure have been included in the Elasticsearch mapping so that most fields can be searched, filtered, and aggregated.

Getting Started

There are three primary methods for querying data -- using a uri based search (easily used with web browsers), making a GET request using a data body, and a hybrid request where the data body is passed as a JSON string via a source parameter. Below are examples of making a very basic request using the three methods.

Objective: Fetch up to 100 posts with the search term “Trump” and sort by most recent:

URI Search:

This method is the most straightforward and can be done from any web browser. The query for this method is:

https://gab.pushshift.io/search/?q=body:trump&sort=created_at:desc&size=100

GET Request with Data Body:

The GET request method uses a JSON string and is generally used by scripts or programming languages such as Python. The request looks like this:

curl -H "Content-Type: application/json" -XGET https://gab.pushshift.io/search -d '{"query":{"match":{"body":"trump"}},"size":100,"sort":{"created_at":"desc"}}'

There is also a third “hybrid” method that allows for making any possible query with a GET request available via a URI search by using two parameters -- “source” and “source_content_type”. The request would look like this:

URI Hybrid Request:

This method is similar to the first type but also uses a JSON string similar to the second type. The JSON string is passed as the value for the “source” parameter. The request looks like this:

https://gab.pushshift.io/search/?source_content_type=application/json&source={"query":{"match":{"body":"trump"}},"size":100,"sort":{"created_at":"desc"}}

The type of request that should be used depends on how complicated the query is and how the request is made (either from a script or manually made in a web browser). For simple searches, a URI search request is usually the easiest, but that method does not support advanced query options like aggregations, etc.

For most of the following example queries, the URI search method will be used if possible. If the query is complicated, the third option will be used.

Basic Python 3 Script Example:

import requests

import collections

import json

def query_es(term):

    base_url = 'https://gab.pushshift.io/search'

    headers = {'Content-type': 'application/json'}

    q = collections.defaultdict(lambda : collections.defaultdict(dict))

    q['query']['match']['body'] = term

    q['sort']['created_at'] = 'desc'

    q['size'] = 100

    r = requests.get(base_url, headers=headers, data=q)

    if r.status_code == 200:

        return r.json()

    else:

        sys.exit(r.content)

result = query_es('president|trump')

if 'hits' in result:

    for obj in result['hits']['hits']:

        post = obj['_source']

        print(post)

Learning by Examples:

This section will show many examples of how to make different types of queries. The URLs below do not have URL encodings for characters such as spaces to help make them easier to read. Generally they can still be copied and pasted into a browser without the encodings.

Get the latest 1,000 posts:

https://gab.pushshift.io/search/?sort=created_at:desc&size=1000

Get the latest 100 posts that mention “president” or “trump”:

https://gab.pushshift.io/search/?q=text:(president|trump)&sort=created_at:desc

Get the latest 100 posts from a specific account that mention a term:

https://gab.pushshift.io/search/?q=body:iran AND account.acct.keyword:a&sort=created_at:desc

Create an aggregation showing the top authors for posts mentioning trump (previous 24 hours):

https://gab.pushshift.io/search/?source_content_type=application/json&source={"aggs":{"author":{"terms":{"field":"account.acct.keyword","size":100}}}}&size=0&q=created_at:[now-25h TO now] AND body:trump

Create a time histogram aggregation for posts mentioning president or trump (previous 24 hours). In this example, a 25 hour range is used so that the 24’th hour window is complete.

https://gab.pushshift.io/search/?source_content_type=application/json&source={%22aggs%22:{%22time%22:{%22date_histogram%22:{%22field%22:%22created_at%22,%22interval%22:%22hour%22}}}}&size=0&q=created_at:[now-25h%20TO%20now]%20AND%20body:(president|trump)

Gab Post Object Fields:

Below is a complete listing of all known Post fields. Fields that are searchable via the API are highlighted in light green.

Field Name

Type

Description

id

Integer

ID of the post

body

String

Text body of post

created_at

Date

Creation time of post

retrieved_utc

Epoch

Time post was first ingested

updated_utc

Epoch

Time post was last updated

reblogs_count

Integer

Number of reblogs

replies_count

String

Number of replies

in_reply_to_account_id

Integer

User ID of the post replied to

in_reply_to_id

Integer

ID of the post replied to

favourites_count

Integer

Number of favourites for the post

sensitive

Boolean

If the post is sensitive

visibility

Boolean

If the post is visibile

content

String

HTML version of body field

language

String

Language detected for post

media_attachments

object

Media attachments for post

Account Object

account.acct

String

Account name that made the post

account.avatar

String

Link to account avatar

account.header

String

Header graphic for the account

account.note

String

Description for the account (HTML)

account.note_text

String

Description for the account (Text)

account.created_at

Date

Creation Date of the account

account.id

Long

ID for the account

account.url

String

URL link for the account

account.is_pro

Boolean

Is the account PRO status

account.is_verified

Boolean

Is the account verified

account.bot

Boolean

Is the account a known bot

account.locked

Boolean

Is the account locked

account.display_name

String

Display Name for the account

account.followers_count

Integer

Number of followers for the account

account.following_count

Integer

Number of accounts being followed

account.statuses_count

Integer

Number of posts made by account

account.username

String

Username of the account

Elasticsearch Gab Mapping with Settings:

https://gab.pushshift.io/  (Will show mapping with settings)