几乎每个行业都使用网络(Web)抓取来从互联网上提取和分析数据。公司使用收集到的数据来制定新的业务战略和产品。您的数据很有价值。除非您采取措施保护您的隐私,否则公司会使用您的数据来赚钱。
如果大企业在做,你为什么不做呢?学习如何抓取网站可以帮助您找到最佳交易,为您的业务收集潜在客户,甚至帮助您找到新工作。
使用网页抓取服务
从互联网收集数据的最快和最简单的方法是使用专业的网络抓取服务。如果你需要收集大量数据,像Scrapinghub这样的服务可能是一个不错的选择。它们为在线数据收集提供了大规模、易于使用的服务。
如果您正在寻找较小规模的东西,ParseHub值得研究一下以抓取一些网站。所有用户都从免费的 200 页计划开始,不需要信用卡,以后可以通过分层定价系统进行构建。
网页抓取应用
对于一种快速、免费且方便的网站抓取方式,Web Scraper Chrome 扩展程序(Web Scraper Chrome Extension)是一个不错的选择。
有一点学习曲线,但开发人员提供了出色的文档(documentation)和教程(tutorial )视频(videos)。Web Scraper是用于小规模数据收集的最简单和最好的工具之一,在其免费(Free)层中提供的功能比大多数工具都多。
使用 Microsoft Excel(Use Microsoft Excel)抓取网站(Website)
对于一些更熟悉的东西,Microsoft Excel提供了一个基本的网络抓取功能。要试用它,请打开一个新的Excel工作簿,然后选择“数据(Data)”选项卡。单击工具栏中的来自 Web(From Web),然后按照向导中的说明开始收集。
从那里,您有几个选项可以将数据保存到电子表格中。查看我们的 Excel 网页抓取指南以获取(guide to web scraping with Excel)完整教程。
如果您熟悉Python 编程语言(Python programming language),那么 Scrapy(Scrapy)是您的完美库。它允许您设置自定义“蜘蛛”,爬取网站以提取信息。然后,您可以使用程序中收集的信息,或将其导出到文件中。
Scrapy教程涵盖了(Scrapy)从基本的网络抓取到专业级别的多蜘蛛计划信息收集的所有内容。学习如何使用Scrapy抓取网站不仅仅是满足您自己需求的有用技能。知道如何使用Scrapy的(Scrapy)开发人员(Developers)需求量很大,这可能会带来一个全新的职业(a whole new career)。
Beautiful Soup是一个用于网页抓取的Python库。它类似于Scrapy,但已经存在了更长的时间。许多用户发现 Beautiful Soup比(Soup)Scrapy更易于使用。
它不像Scrapy那样功能齐全,但对于大多数用例来说,它是(Scrapy)Python程序员的功能和易用性之间的完美平衡。
使用 Web Scraping API
如果您愿意自己编写 Web 抓取代码,您仍然需要在本地运行它。这对于小型操作来说很好,但随着您的数据收集规模扩大,它将占用宝贵的带宽(use up precious bandwidth),可能会减慢您的网络速度(slowing down your network)。
使用网络抓取API可以将一些工作卸载到远程服务器,您可以通过代码访问该服务器。这种方法有几个选项,包括像Dexi(Dexi)这样功能齐全且价格专业的选项,以及像ScraperAPI这样简单地剥离服务。
两者都需要花钱使用,但ScraperAPI提供 1000 次免费API调用,然后再进行任何付费尝试服务,然后再提交服务。
使用 IFTTT 抓取网站
IFTTT是一个强大的自动化工具。您可以使用它来自动化几乎任何事情(use it to automate almost anything),包括数据收集和网络抓取。
IFTTT的巨大优势之一是它与许多 Web 服务的集成。使用Twitter的基本示例可能如下所示:
- 登录 IFTTT 并选择创建(Create)
- 在服务菜单上选择Twitter
- 从推文中(New Search From Tweet)选择新搜索
- 输入搜索词或主题标签,然后单击创建触发器(Create Trigger)
- 选择Google 表格(Google Sheets)作为您的操作服务
- 选择将行添加到电子表格(Add Row to Spreadsheet)并按照步骤操作
- 单击创建操作(Create Action)
只需几个简短的步骤,您就创建了一个自动服务,该服务将记录与搜索词或主题标签相关的推文以及用户名及其发布时间。
有这么多连接在线服务的选项,IFTTT 或其替代方案之一(IFTTT, or one of its alternatives)是通过抓取网站进行简单数据收集的完美工具。
对于 iOS 用户,Shortcuts应用程序是链接和自动化您的数字生活的绝佳工具。虽然您可能熟悉它在日历、联系人和地图之间的集成(integration between your calendar, contacts, and maps),但它的功能远不止这些。
在一篇详细的帖子中,Reddit 用户(Reddit user)u/keveridge 概述了如何将正则表达式与 Shortcuts 应用程序(how to use regular expressions with the Shortcuts app)一起使用以从网站获取详细信息。
正则表达式(Expressions)允许更细粒度的搜索,并且可以跨多个文件工作(can work across multiple files)以仅返回您需要的信息。
(Use Tasker)使用Android 版(Android)Tasker搜索网络(Web)
如果您是Android用户,则没有简单的选项可以抓取网站。您可以按照上述步骤使用IFTTT应用程序,但(IFTTT)Tasker可能更适合。
Available for $3.50 on the Play Store,许多人将Tasker视为 IFTTT 的老兄弟。它有大量的自动化选项。其中包括自定义 Web 搜索、选定网站上的数据发生变化时的警报,以及从 Twitter 下载内容(download content from Twitter)的能力。
虽然不是传统的网络抓取方法,但自动化应用程序可以提供与专业网络抓取工具大部分相同的功能,而无需学习如何编码或为在线数据收集服务付费。
自动网页抓取
无论您是想为您的业务收集信息还是让您的生活更方便,网络抓取都是一项值得学习的技能。
您收集的信息,一旦正确分类(once properly sorted),将使您更深入地了解您、您的朋友和您的商业客户感兴趣的事物。
How To Scrape a Website
Web scrаping is used by almost everу industry to extrаct and analyze data from the internet. Companies use collected data to come up with new busineѕs strategies аnd products. Your data is valuable. Unless you are taking steps to protect yоur privacy, companies are using your data to make money.
If big business is doing it, why don’t you do it too? Learning how to scrape a website can help you find the best deal, gather leads for your business, and even help you find a new job.
Use a Web Scraping Service
The quickest and simplest way to gather data from the internet is to use a professional web scraping service. If you need to collect large amounts of data, a service like Scrapinghub might be a good fit. They provide a large scale, easy to use service for online data collection.
If you are looking for something on a smaller scale, ParseHub is worth looking into to scrape a few websites. All users begin with a free 200-page plan, requiring no credit card, which can be built upon later through a tiered pricing system.
Web Scraping App
For a quick, free, and convenient way of scraping websites, the Web Scraper Chrome Extension is a great choice.
There is a bit of a learning curve, but the developer has provided fantastic documentation and tutorial videos. Web Scraper is among the simplest and best tools for small scale data collection, offering more in its Free tier than most.
Use Microsoft Excel To Scrape a Website
For something a little more familiar, Microsoft Excel offers a basic web scraping feature. To try it out, open a new Excel workbook, and select the Data tab. Click From Web in the toolbar, and follow the instructions in the wizard to start the collection.
From there, you have several options for saving the data into your spreadsheet. Check out our guide to web scraping with Excel for a full tutorial.
If you are familiar with the Python programming language, Scrapy is the perfect library for you. It allows you to set up custom “spiders,” which crawl websites to extract information. You can then use the information gathered in your programs, or export it to a file.
The Scrapy tutorial covers everything from basic web scraping through to professional level multi-spider scheduled information gathering. Learning how to use Scrapy to scrape a website isn’t just a useful skill for your own needs. Developers who know how to use Scrapy are in high demand, which could lead to a whole new career.
Beautiful Soup is a Python library for web scraping. It’s similar to Scrapy but has been around for much longer. Many users find Beautiful Soup easier to use than Scrapy.
It’s not as fully featured as Scrapy, but for most use cases, it’s the perfect balance between functionality and ease of use for Python programmers.
Use a Web Scraping API
If you are comfortable writing your web scraping code yourself, you still need to run it locally. This is fine for small operations, but as your data collection scales up, it will use up precious bandwidth, potentially slowing down your network.
Using a web scraping API can offload some of the work to a remote server, which you can access via code. This method has several options, including fully-featured and professionally priced options like Dexi, and simply stripped back services like ScraperAPI.
Both cost money to use, but ScraperAPI offers 1000 free API calls before any payment to try the service before committing to it.
Use IFTTT To Scrape a Website
IFTTT is a powerful automation tool. You can use it to automate almost anything, including data collection and web scraping.
One of the huge benefits of IFTTT is its integration with many web services. A basic example using Twitter could look something like this:
- Sign in to IFTTT and select Create
- Select Twitter on the service menu
- Select New Search From Tweet
- Enter a search term or hashtag, and click Create Trigger
- Choose Google Sheets as your action service
- Select Add Row to Spreadsheet and follow the steps
- Click Create Action
In just a few short steps, you have created an automatic service that will document tweets connected to a search term or hashtag and the username with the time they posted.
With so many options for connecting online services, IFTTT, or one of its alternatives is the perfect tool for simple data collection by scraping websites.
For iOS users, the Shortcuts app is a great tool for linking and automating your digital life. While you might be familiar with its integration between your calendar, contacts, and maps, it is capable of much more.
In a detailed post, Reddit user u/keveridge outlines how to use regular expressions with the Shortcuts app to get detailed information from websites.
Regular Expressions allow much more fine-grain searching and can work across multiple files to return only the information you need.
Use Tasker for Android To Search The Web
If you are an Android user, there are no simple options to scrape a website. You can use the IFTTT app with the steps outlined above, but Tasker might be a better fit.
Available for $3.50 on the Play Store, many view Tasker as IFTTT’s older sibling. It has a vast array of options for automation. These include custom web searches, alerts when data on selected websites changes, and the ability to download content from Twitter.
While not a traditional web scraping method, automation apps can provide much of the same functionality as professional web scraping tools without needing to learn how to code or pay for an online data gathering service.
Automated Web Scraping
Whether you want to gather information for your business or make your life more convenient, web scraping is a skill worth learning.
The information you gather, once properly sorted, will give you much greater insight into the things that interest you, your friends, and your business clients.