
Today I will show you how to scrape mobile phone specs data from a website. We will use data scraping methods to scrape and extract data.
Website: epey.com
Data: Mobile phone models and their specs.
Preparing Toolbox Before Start Code
I need to install these libraries/apps to continue this tutorial. They are my tools in my toolbox.
Libraries:
- Composer (for installing PHP packages)
- Guzzle (for making HTTP requests like a browser)
- PHP Simple HTML DOM Parser (for extracting data)
- PHP-MySQLi-Database-Class (for talking with MySQL server)
For data storage:
- MySQL (for creating a database to keep data)
- JSON (sometimes to keep parameters for our scrapper)
Answering Questions to Create Strategy
These questions are from
- Check blabla.com/sitemap.xml is available? (no)
https://www.epey.com/sitemap.xml is not found - Check the robots.txt file has an URL to the sitemap.xml file? (yes)
https://www.epey.com/sitemap/urun.xml - Check is there a page that has pagination for all pages that have data? (yes)
https://www.epey.com/akilli-telefonlar/ - Check is there a security or authorization system while we’re trying to access data? (no)
Data is public and we can access it directly.
with these results, we can use two methods:
- First, we can get a list with the sitemap file
- Another is we can access the list with the listing page.
Accessing a list with a sitemap is the easy way. So I will use this method.
Preparation for Getting Data
First I will create a database table to save this data. Here is the schema of my MySQL table.
Then I will create a PHP file to scrape data.
This code will result with:
Client error: `GET https://epey.com/` resulted in a `403 Forbidden` response:
This means you can not access me as a bot. They are protecting their selves with Cloudflare. So we need to find bypass this security system or find a vulnerability.
Sorry to Say That
While I’m trying to bypass Cloudflare I found a website vulnerability then I could access all the latest data. Then I decided to give this information to this company.
The vulnerability was: “they forgot to hide ns4.epey.com domain behind Cloudflare.” When I explored this there was DirectoryListing vulnerability. All directories and files were listed as public. I tried to surf on folders I explored all actual data exported by their software.
File timestamps were today. So, all data was up to date.
Sorry guys, this time was so easy. I decided to exchange this information with a “thank you”. Let it be like this time. I cannot publish this vulnerability.
If you need other data is here, you can get them over my GitHub profile!
https://github.com/ilyasozkurt/mobilephone-brands-and-models
10 May 2021 Update:
The owner of the company gave me a gift for this information. Thanks to epey.com for this approach!
See you at the next scrapping journey!