有一个PDF 文档(PDF document),您想从中提取所有文本?想要转换为可编辑文本的扫描文档的图像文件呢?这些是我在工作场所处理文件时遇到的一些最常见的问题。
在本文中,我将讨论尝试从PDF或图像中提取文本的几种不同方法。您的提取结果将根据PDF 或图像中文(PDF or image)本的类型和质量(type and quality)而有所不同。此外,您的结果会因您使用的工具而异,因此最好尽可能多地尝试以下选项以获得最佳结果。
从图像或 PDF 中提取文本
最简单、最快捷的开始方式是尝试在线PDF 文本(PDF text) 提取器服务(extractor service)。这些通常是免费的,无需在您的计算机上安装任何东西,就可以准确地为您提供所需的内容。以下是我使用过的两个,效果非常好:
提取PDF
ExtractPDF是一个免费的工具,可以从PDF 文件(PDF file)中抓取图像、文本和字体(text and fonts)。唯一的限制是PDF 文件的(PDF file)最大大小(max size)为10 MB。这有点小;因此,如果您有更大的文件,请尝试以下其他一些方法。选择您的文件,然后单击发送文件(Send file)按钮。结果通常非常快,当您单击“文本”选项卡(Text tab)时,您应该会看到文本的预览。
它还可以从PDF 文件(PDF file)中提取图像,以防万一您需要这些图像,这也是一个很好的附加好处!总体而言,在线工具效果很好,但我遇到了几个PDF 文档(PDF docs),它们给了我有趣的输出。文本被提取得很好,但由于某种原因,每个单词后都会有一个换行符!(line break)对于简短的PDF 文件(PDF file)来说不是一个大问题,但对于包含大量文本的文件来说肯定是一个问题。如果您遇到这种情况,请尝试下一个工具。
在线 OCR
在线 OCR通常适用于无法使用(Online OCR)ExtractPDF正确转换的文档,因此最好尝试这两种服务以查看哪些服务可以提供更好的输出。在线 OCR(Online OCR)还具有一些更好的功能,对于拥有大型PDF 文件(PDF file)的人来说,这些功能非常方便,只需要转换几页上的文本而不是整个文档。
您要做的第一件事是继续创建一个免费帐户。这有点烦人,但如果您不创建免费帐户,它只会部分转换您的PDF而不是整个文档。此外,您可以使用帐户上传最多 100MB 的文件,而不是只能上传 5 MB 的文档。(MB document)
首先,选择一种语言,然后为转换后的文件选择您想要的输出格式类型。您有几个选项,如果您愿意,可以选择多个。在Multipage document下,您可以选择页码(Page numbers),然后仅选择要转换的页面。然后选择文件并单击(file and click) 转换(Convert)!
转换后,您将被带到“文档”(Documents)部分(如果您已登录),您可以在其中查看剩余的可用免费页面数量以及下载转换文件的链接。您似乎每天只有 25 页免费,所以如果您需要更多,您将不得不稍等片刻或购买更多页。
在线 OCR在转换我的(Online OCR)PDF方面做得非常出色,因为它能够保持文本的实际布局。在我的测试中,我使用了一个使用项目符号、不同字体大小等的Word 文档并将其转换为(Word doc)PDF。然后我使用Online OCR将其转换回Word 格式(Word format),它与原始格式大约 95% 相同。这对我来说非常令人印象深刻。
另外,如果您希望将图像转换为文本,那么在线 OCR可以像从(Online OCR)PDF文件中提取文本一样轻松地做到这一点。
免费在线 OCR
既然在谈论图像到文本OCR,让我提一下另一个在图像上效果很好的好网站。 从我的测试图像中提取文本时,免费在线 OCR(Free Online OCR)非常好且非常准确。我从我的 iPhone 上拍了几张照片,这些照片来自书籍、小册子等,我对它能够很好地转换文本感到惊讶。
选择您的文件,然后单击上传按钮(Upload button)。在下一个屏幕上,有几个选项和图像预览。如果您不想对整个内容进行 OCR,则可以裁剪它。然后只需单击OCR 按钮(OCR button),您转换的文本将出现在图像预览(image preview)下方。它也没有任何限制,这非常好。
除了在线服务之外,如果您需要在计算机上本地运行的软件来执行转换,我想提一下两个免费软件PDF转换器。(PDF)使用在线服务,您将始终需要Internet 连接(Internet connection),而这对每个人来说可能都无法实现。但是,我注意到免费软件程序的转换质量明显低于网站。
A-PDF 文本提取器
A-PDF Text Extractor是一款免费软件,可以很好地从PDF 文件(PDF file)中提取文本。下载并安装后,单击打开(Once)按钮(Open button)选择您的PDF 文件(PDF file)。然后单击提取文本(Extract text)以开始该过程。
它会询问您存储文本输出文件(text output file)的位置,然后开始提取。您还可以单击“选项(Option)”按钮,它可以让您仅选择要提取的某些页面和提取类型(extraction type)。第二个选项很有趣,因为它以不同的布局提取文本,值得尝试所有三个选项,看看哪些可以为您提供最佳输出。
PDF2Text 试点
PDF2Text Pilot 在提取文本方面做得很好。它没有任何选项;您只需添加文件或文件夹,转换并希望(convert and hope)最好。它在某些PDF(PDFs)上运行良好,但对于其中的大多数,存在许多问题。
只需单击添加文件(Add Files),然后单击转换(Convert)。转换完成后,单击浏览(Browse)打开文件。使用此程序,您的里程会有所不同,所以不要期望太多。
此外,值得一提的是,如果您在公司环境中或者可以从工作中获得Adobe Acrobat的副本,那么您确实可以获得更好的结果。Acrobat显然不是免费的,但它具有将PDF转换为Word、Excel 和 HTML 格式(Excel and HTML format)的选项。它还在维护原始文档的结构和转换复杂的文本方面做得最好。
Extract Text from PDF and Image Files
Haνe a PDF document that уou would like to extract all the text оυt of? What about іmage files of a scannеd document that yоu want to convert into editable text? These are some of the most common issues I’ve seen at the workplace when working with files.
In this article, I’ll talk about several different ways you can go about trying to extract text from a PDF or from an image. Your extraction results will vary depending on the type and quality of the text in the PDF or image. Also, your results will vary depending on the tool you use, so it’s best to try out as many of the options below as possible to get the best results.
Extract Text from Image or PDF
The simplest and quickest way to start is to try an online PDF text extractor service. These are normally free and can give you exactly what you are looking for without having to install anything on your computer. Here are two that I have used with very good to excellent results:
ExtractPDF
ExtractPDF is a free tool to grab images, text and fonts out of a PDF file. The only limitation is that the max size for the PDF file is 10 MB. That’s a bit small; so if you have a bigger file, try some of the other methods below. Choose your file and then click the Send file button. The results are normally very fast and you should see a preview of the text when you click on the Text tab.
It is also a nice added benefit that it extracts images out of the PDF file too, just in case you need those! Overall, the online tool works great, but I have run into a couple of PDF docs that give me funny output. The text is extracted just fine, but for some reason it’ll have a line break after each word! Not a huge problem for a short PDF file, but certainly an issue for files with lots of text. If that happens to you, try the next tool.
Online OCR
Online OCR usually tended to work for the documents that didn’t convert properly with ExtractPDF, so it’s a good idea to try both services to see which ones gives you better output. Online OCR also has some nicer features that can prove handy for anyone with a large PDF file that only needs to convert text on a few pages rather than the whole document.
The first thing you want to do is go ahead and create a free account. It’s a bit annoying, but if you don’t create the free account, it will only partially convert your PDF rather than the entire document. Also, instead of only being able to upload only a 5 MB document, you can upload up to 100MB per file with an account.
First, choose a language and then pick the type of output formats you would like for the converted file. You have a couple of options and you can choose more than one if you like. Under Multipage document, you can select Page numbers and then choose only the pages that you want to convert. Then you select the file and click Convert!
After conversion, you’ll be brought to the Documents section (if you’re logged in) where you can see how many available free pages you have left and links to download your converted files. It seems like you only have 25 pages for free a day, so if you need more than that, you’ll have to either wait a bit or buy more pages.
Online OCR did an excellent job of converting my PDFs because it was able to maintain the actual layout of the text. In my test, I took a Word doc that used bullets, different font sizes, etc and converted it to a PDF. Then I used Online OCR to convert it back to Word format and it was about 95% the same as the original. That’s pretty impressive for me.
Plus, if you are looking to convert an image to text, then Online OCR can do that just as easily as extracting text from PDF files.
Free Online OCR
Since were talking about image to text OCR, let me mention another good website that works really well on images. Free Online OCR was very good and very accurate when extracting text from my test images. I took a couple of photos from my iPhone of pages from books, pamphlets, etc and I was surprised at how well it was able to convert the text.
Choose your file and then click the Upload button. On the next screen, there are a couple of options and a preview of the image. You can crop it if you don’t want to OCR the whole thing. Then just click the OCR button and your converted text will appear below the image preview. It also doesn’t have any limitations, which is really nice.
In addition to the online services, there are two freeware PDF converters I want to mention in case you need software running locally on your computer to perform the conversions. With online services, you’ll always need an Internet connection and that may not be possible for everyone. However, I noticed that the quality of the conversions from the freeware programs were significantly worse than those of the websites.
A-PDF Text Extractor
A-PDF Text Extractor is freeware that does an fairly good job of extracting text from PDF files. Once you download it and install it, click the Open button to choose your PDF file. Then click Extract text to start the process.
It’ll ask you a location to store the text output file and then it will begin extracting. You can also click on the Option button, which lets you choose only certain pages to extract and the extraction type. The second option is interesting because it extracts the text in different layouts and it’s worth trying all three to see which ones gives you the best output.
PDF2Text Pilot
PDF2Text Pilot does an ok job of extracting text. It doesn’t have any options; you just add files or folders, convert and hope for the best. It worked well on some PDFs, but for the majority of them, there were numerous issues.
Just click Add Files and then click Convert. Once the conversion is complete, click on Browse to open the file. You mileage will vary using this program so don’t expect much.
Also, it’s worth mentioning that if you are in a corporate environment or can get your hands on a copy of Adobe Acrobat from work, then you can really get much better results. Acrobat is obviously not free, but it has options to convert PDF to Word, Excel and HTML format. It also does the best job of maintaining the structure of the original document and converting complicated text.