Today I will show you how to scrape mobile phone specs data from a website. We will use data scraping methods to scrape and extract data.

Website: epey.com
Data: Mobile phone models and their specs.

Preparing Toolbox Before Start Code

I need to install these libraries/apps to continue this tutorial. They are my tools in my toolbox.

Libraries:

For data storage:

  • MySQL (for creating a database to keep data)
  • JSON (sometimes to keep parameters for our scrapper)

Answering Questions to Create Strategy

These questions are from

  1. Check blabla.com/sitemap.xml is available? (no)
    https://www.epey.com/sitemap.xml is not found
  2. Check the robots.txt file has an URL to the sitemap.xml file? (yes)
    https://www.epey.com/sitemap/urun.xml
  3. Check is there a page that has pagination for all pages that have data? (yes)
    https://www.epey.com/akilli-telefonlar/
  4. Check is there a security or authorization system while we’re trying to access data? (no)
    Data is public and we can access it directly.

with these results, we can use two methods:

  • First, we can get a list with the sitemap file
  • Another is we can access the list with the listing page.

Accesing list with sitemap is the easy way. So I will use this method.

Preparation for Getting Data

First I will create a database table to save this data. Here is the schema of my mysql table.

CREATE TABLE `mobile_phone_specs` (
  `id` int(11) NOT NULL,
  `name` varchar(255) CHARACTER SET utf8mb4 NOT NULL,
  `manufacturer` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `model` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `data` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Then I will create a php file to scrape data.

<?php

//Change error mode to display all
error_reporting(E_ALL);

//Change error displaying mode on
ini_set('display_errors', 1);

//Include libraries which we installed with composer
require 'vendor/autoload.php';

//Parameters
$domain = 'https://epey.com/';

//Create a client for http requests
$client = new \GuzzleHttp\Client([
    'base_uri' => $domain, //Set a base url before make requests
    'timeout' => 3.0 //If there is no answer after x seconds stop waiting
]);

//Handle exceptions
try {

    //Get response object from client
    $response = $client->get('/');

    //print http response code from response object
    print $response->getBody()->getContents();

} catch (\GuzzleHttp\Exception\ConnectException $exception) {

    //Print client exception if is there any error.
    die('could not connect to host : ' . $exception->getMessage());

} catch (\GuzzleHttp\Exception\ClientException $exception) {

    //Print client exception if is there any error.
    die($exception->getMessage());

}

This code will result with:

Client error: `GET https://epey.com/` resulted in a `403 Forbidden` response:

This means you can not access me as a bot. They are protecting their selves with Cloudflare. So we need to find bypass this security system or find a vulnerability.

While I’m trying to bypass Cloudflare I found a website vulnerability then I could access all the latest data. Then I decided to give this information to this company.

The vulnerability was: “they forgot to hide ns4.epey.com domain behind Cloudflare.” When I explored this there was DirectoryListing vulnerability. All directories and files were listed as public. I tried to surf on folders I explored all actual data exported by their software.

File timestamps was today. So, all data was up to date.

Sorry guys, this time was so easy. I decided to exchange this information with a “thank you”. Let it be like this time. I cannot publish this vulnerability.

10 May 2021 Update:

The owner of company gave me a gift for this information. Thanks to epey.com for this approach!

See you at next scrapping journey!