Published using Google Docs
Final Report
Updated automatically every 5 minutes

 APS 360: Artificial Neural Networks in Detecting Photo Manipulation

 Final Report

Group 18

Chris Palumbo - 1003019048

Ita Zaporozhets - 1003196864

Merih Atasoy - 1003285170

Word Count: 1953 (Body) + 43 (Appendix) = 1996

Penalty = 0%

1.0 Goal and Motivation

With photo-altering tools becoming readily available to the public, it is becoming harder to distinguish between real and fake images. People create altered photos that go on to be seen (and ultimately believed) by internet users. Photo-manipulation scandals can be dangerous, especially with the popularity of social media platforms. The issue has been given the popular term “fake news”, and photoshop is among the biggest contributors. Recent scandals range from forged political campaign material [1], to fake UFO sightings [2].

Past methods to spot image manipulation utilize Meta Data analysis, error-level analysis, and filtering [3]. However, these approaches require a human to analyze the results and make the final prediction. More recent research has turned towards artificial intelligence to aid in this decision making process [3]. Hence, the goal of this project was to develop a neural network model to predict whether an image has been doctored.

2.0 Data 

The data used is the public Reddit Photoshop Battle dataset which was a contest for Reddit users to post manipulated images. Members of the University of Basel summarized the data statistics in a report and provided a GitHub repository to download the labelled photos [4]. The dataset was diverse in photoshop techniques, such as splicing: combining parts of 2 different images, and copy-move: rearranging parts of the same image. Furthermore, in some photos, there is emphasis on making the image look “obviously photoshopped” for humour purposes.

2.1 Data Collection

The entire 40GB dataset comprises of approximately 11,000 original and over 100,000 photoshopped images. The photo heights range from 136-20,000 pixels represented by the histogram in Appendix A. It was downloaded using a script published on GitHub [4]. Due to resource constraints, only about half was obtained. The data is labelled well; the original images are named with a unique code, and the photoshopped versions include the original image code and the derivative number.

2.2 Data Cleaning

Considering all of the images is an overwhelming amount of data, it was necessary to downsize the amount of pictures for training due to computational resources. Resizing options included: padding, scaling, and random cropping. Scaling changes the noise-level of the image, which can affect training as it is expected for the model to learn the quality discrepancies (such as noise) between different parts of a photoshopped image. Cropping was chosen over padding because having a large padding border relies on the fact that the model will learn that the padded component has nothing to do with the classification problem [5]. Moreover, padding increases all images to the maximum padded size, whereas cropping decreases input size to the minimum. Therefore, centered cropping was implemented to decrease input size and computation time.

A set of images with the most frequent size, 600-1100 pixels, was selected to minimize the amount of cropping. Each image was center cropped to 600 by 600 pixels. To avoid negative subsampling, only 1 photoshopped version of each original was considered. It was ensured that each batch in training saw both the original and photoshopped version to compare the photoshopped components to the ground truth. They were concatenated on disk and split within the training loop. As anticipated, the photoshopped features were cropped out of some images- these images were manually removed.

3.0 Overall Software Structure

A comprehensive diagram of the software structure can be seen below in Figure 1. The structure can be summarized as follows:

Figure 1. Overall Software Structure

3.1 Preprocessing: Filters

From literature, it was found that applying filters on the images can help bring out hidden pixel information, and therefore improve performance. Two filters were implemented: a high-pass (HP) filter which sharpens an image to highlight noisiness, and a low-pass (LP) filter which reveals edges [6][7]. A comparison of an unfiltered, HP-filtered, and LP-filtered image can be seen below in figures 2, 3 and 4, respectively.



Figure 2. Unfiltered Sample             Figure 3. HP-Filtered Sample      Figure 4. LP-Filtered Sample

3.2 Preprocessing: ResNet-18

Currently, there are publicly available models such as ResNet, VGGNet, and AlexNet that have been used for image feature extraction [10]. Each comes with pretrained parameters along with additional fully-connected classifier layers that have been used for image classification problems. The team recognized the similarities between these problems and photoshop detection, and decided to use ResNet-18 library as ResNet was the standout model [11]. As will be discussed in the following section, this was a key step to help beat the base model.

ResNet18 is a convolutional architecture that uses various sized kernels and pooling. A detailed image of the architecture is shown in Figure 5, and includes the following key components: the first 2D-Convolution layer uses a 7x7 kernel, while the next 16 layers use 3x3 kernels with various striding and padding combinations. Next, each layer uses a ReLu activation function along with 2x2 max-pooling. Finally, the output of the model is a [512x13x13] tensor.

Figure 5. ResNet18 Architecture [12]

For efficient computation, ResNet features for each image in all sets were forward passed and saved to disk to be readily available.

3.3 Classification Model

Instead of using ResNet’s fully-connected layer, the team customized new layers for classification. This is because ResNet is trained to classify the subject in a photo, which is not the goal of this model.

Using a grid-search described in Appendix B, a model consisting of 5 fully-connected layers with 0.1 dropout was selected. Figure 6 shows the model diagram including layer size.

Figure 6. Final Classification Model

4.0 Training, Validation, and Test

Although research in using ANNs for identifying altered photos is increasing, the team was unable to find published studies on similar datasets with reported accuracies. As a base model, a simple CNN structure was implemented first. The base model was created with 2 convolutional layers and 1 fully-connected layer, however it overfit immediately and was unable to get a validation accuracy above 53%. This became the benchmark for the ResNet model.

4.1 Training

Training was conducted on Google Colab for faster computation time. The dataset was copied to create 3 versions: unfiltered, High-Pass filtered, Low-Pass filtered. The ResNet features were extracted for each of the versions; 3 Data Loader functions were created to load these features to compare filtering effects. Through grid search, it was found that applying the high-pass filter worked the best so it was used in the final model. 

In each batch of images, the ResNet tensors representing the images are split into the original and derivative components, resulting in a total number of data points of twice the batch size n. These tensors are concatenated into a single batch. The label is generated as: [0,0,0,...,n, 1,1,1,...n] since the concatenate function always combines the split images in this pattern. The list of ResNet tensors and corresponding labels are entered as inputs to the model.

In addition to model parameters, including number of layers and dropout, different values for the following were compared using a grid search: batch size, learning rate, learning rate decay factor, and weight decay. The search indicated to use a learning rate of 0.0001 with a decay of 0.05, weight decay of 0.1, and a batch size of 32. The full results can be found in Appendix B.

The model takes the input and applies 5 fully-connected layers to classify each image and returns a list of predicted labels. The cross-entropy loss function was chosen because it is useful for classification problems like this one, and combines both Log Softmax and NLL loss functions. Cross-entropy loss increases as the predicted probability diverges from the ground-truth label, which is the desired effect for this problem. The standard SGD (stochastic gradient descent) optimizer was used to update weights taking in the chosen parameters from the grid search.


After every epoch, the validation accuracy is computed. If it decreased for 2 epochs in a row (tolerance level), the learning rate decays at a factor of 0.005. A tolerance level of 2 epochs was chosen because the model is able to overfit on the training set in under 30 epochs, so a low tolerance level is appropriate to avoid accuracy decay over more epochs. This was implemented using PyTorch’s Learning Rate Decay on Plateau function after witnessing a lot of volatility in the training and validation curves in order to help improve smoothness and accuracy. The effect can be seen in figures 7 and 9 below.

                    Figure 7. Accuracy Curve                                Figure 8. Loss Curve

Figure 9. Accuracy Curve Before LR Decay

The chosen model’s training, validation, and test accuracy was 91%, 74%, ALand 70%, respectively. The test accuracy being close to the validation accuracy assures that the model has learned valuable features and is not overfitted to the idiosyncrasies of the training set. Additionally, it beat the 50% accuracy of the base model. Comparison for the different model types are summarized in Table 1 below.

Table 1. Comparison for different model types, accuracies based on 30 epochs. ResNet (LR Decay) is the selected final model.

5.0 Ethical Issues

Ethics is a very important consideration for projects in the developing field of artificial intelligence. Our model addresses ethics by incorporating beneficence and justice. It is beneficial in helping people identify what is real or fake in order to limit the amount of harmful information spread online. Evidently, this model could be used to help expose those who are spreading false information online to neutralize social platforms.

However, because our photoshop detection model cannot guarantee 100% accuracy, the presence of false negatives and false positives can have quite serious backlash. A false negative would be predicting an image as real, but it was actually fake, and a false positive would be predicting an image is fake, but it was actually real. Therefore, due to the existence of these cases, it is better for the users to use the model only as an aid in their decision on the authenticity of the photo, as opposed to using the predicted classification as a definitive decision.

6.0 Key Learnings

Upon completing this project, the team learned that achieving the overall goal was not an easy task. This could also have been seen from intermediary training and validation accuracies. One difficulty faced by the team was that the baseline model could not outperform guessing, and was unable to achieve a validation accuracy over 50%. Instead of spending a lot of effort in researching potential methods to implement into the model to fix this issue, the team could have saved a lot of time by deciding to attempt ResNet features initially,

Additionally, more time could be invested into data cleaning, instead of being eager to start building a model. Even after rigorous cleaning, the team realized that there were still certain images that had the photoshopped components cropped out, and that more careful review of the data could have provided us with better initial results.

7.0 References

[1] D. Eisinger, "The 15 Biggest Photoshop Scandals of All Time2. Bush Campaign Ads Duplicate Crowds", Complex, 2019. [Online]. Available: [Accessed: 21-Feb-2019].

[2] A. Wienstein, "Bloomberg - Are you a robot?",, 2019. [Online]. Available: [Accessed: 23-Feb-2019].

[3] C. Abode, "Spotting Image Manipulation with AI | Adobe Blog", Adobe Blog, 2019. [Online]. Available: [Accessed: 24-Feb-2019].

[4] "dbisUnibas/PS-Battles", GitHub, 2019. [Online]. Available: [Accessed: 24-Feb-2019].

[5] (2019). · Making neural nets uncool again. [online] Available at: [Accessed 7 Apr. 2019].

[6] D. Kim, "Image Manipulation Detection using Convolutional Neural Network",, 2019. [Online]. Available: [Accessed: 24-Feb-2019].

[7] J. Bunk, "Detection and Localization of Image Forgeries using Resampling Features and Deep Learning", 2019. [Online]. Available: [Accessed: 24-Feb-2019].

[8] "Pretrained ResNet-18 convolutional neural network - MATLAB resnet18",, 2018. [Online]. Available: [Accessed: 17- Mar- 2019].

[9] "ImageNet",, 2016. [Online]. Available: [Accessed: 17- Mar- 2019].

[10] Das, S. (2019). CNN Architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more ….. [online] Medium. Available at: [Accessed 7 Apr. 2019].

[11] Prakash, J. (2019). Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image…. [online] Medium. Available at: [Accessed 7 Apr. 2019].

[12] M. Hasan, “Proposed Modified ResNet-18 architecture for Bangla HCR,” ResearchGate, Dec-2017. [Online]. Available: [Accessed: Apr-2019].

Appendix A:

Figure 9. Histogram of data image size

Appendix B

The grid search was conducted on each type of filtering option: unfiltered, LP, and HP. We kept track of the best validation accuracy (assume ability to early stop).

Table 2. Grid Search Results for HP-filter