In this article , I would like to share with you “How to create a data scrapping strategy.”, “How we should think while you’re creating strategy.”
What is Data Scraping?
Scraping is a technique used to extract information from websites that do not have an API available. It can be done by using some software programming skills, third-party applications, or browser extensions.
Data Scraping Tools
The key thing is using the most known programming language. This gives you more knowledge to create the best strategy. This is PHP for me. So, I’ll tell tutorial over PHP.
Software Programming Languages
- Any another programming language which allows you to access internet.
Here are some applications that I’m using while scraping basic data.
You can write applications that you’re using to extract data in comments.
How to Create Scraping Strategy?
I will show you how to create a scrapping strategy on a website. Sometimes there will security systems like session-based access, cookie-based access… We will try to bypass them to extract data.
When you started to think about extracting data you need to think more widely about how to get data. Here are some steps of creating a strategy.
- Check blabla.com/sitemap.xml is available?
- Check the robots.txt file has an URL to the sitemap.xml file?
- Check is there a page that has pagination for all pages that have data?
- Check is there a security or authorization system while we’re trying to access data?
These are most important creterias for creating strategy.
1 – Is blabla.com/sitemap.xml Available?
We are checking this URL to access the sitemap file because it has a list of pages. (You can check my sitemap as an example here) When we checked this sometimes the sitemap file will not there. So we will try the next step.
2 – Has the robots.txt File got an URL to The sitemap.xml File?
Sometimes we can not access the sitemap files through blabla.com/sitemap.xml. For SEO robots.txt files should have a link to the sitemap.xml file. But sometimes, for security purposes, there may a directory password or FTP password for the sitemap files. So if we could not find the sitemap.xml file we’ll look after the robots.txt file.
2.1 – If we can access the sitemap.xml file
We will search for a list that has our pages that have target data. If we find this list we can go extracting step.
2.2 – If we can not access the sitemap.xml file
When you can not access or find the sitemap.xml file you need to change the method to get a list of pages. Continue reading to understand another method.
3 – Is There a Page That Has Pagination for All Pages That Have Data?
Think about we’re reading a book. Every page has some data for us. What do we need to do? The answer is “we’ll read the first page and extract data then go next”. Do this until the book finishes. This is the algorithm that we use for scrapping.
If we find a page that has pagination like a book. We can get all pages one by one and extract our data. So you need to surf on the website. Try to find a page that has pagination.
When you find a page that has pagination think about you found a list of pages of book.
4 – Is There a Security or Authorization System While We’re Scraping Data?
Sometimes we can not access sitemap.xml files or pages that have target data. We need to find a way to access them. In this step, you should know how HTTP works and how the browser talks with the server.
When you log in to a website, the browser creates a cookie file for that website to keep the session-id which server sent. So if you can access or save this cookie file to make requests to the server with it. You can simulate that login process. When you send these requests with these cookie files you can access pages or data like as you logged.
I will continue to this content with an example. If you want to see an example about scraping data, you can visit Scrapping Epey.com Data with PHP #1
Thanks for reading.
Hello I’m İlyas Özkurt. I am a software developer who has been working on websites for 10 years. My first studies in this field were during my high school years. Now I work as a software developer at 6harf and am responsible for all web projects there.