This wiki has been archived. The articles are no longer editable.

Converting a PDF document to .DOC format

From ProZ.com Wiki

Jump to: navigation, search


Introduction

We increasingly get source documents for translation in PDF format. Although it is preferable to get the source document in an editable format (after all, the source document existed in an editable format, such as Word, Quark Express or Frame Maker, before the PDF was created), we sometimes do not have an alternative and have to deal with a PDF all the same. These are the instructions for dealing with PDF source documents.

PDF in image format

A PDF in image format is a PDF that is based on images, such as a scanned document, a fax or other types in which text cannot be selected using the text selection tool in Adobe Acrobat Reader.

Counting words in an image PDF is not feasible unless you first extract the text using OCR. In fact, there is no way to process the text in such a PDF other than using OCR, unless you don't mind typing the entire document manually to obtain an editable copy.

PDF in text format

A PDF in text format is the most common PDF format. Text within the document can be selected using the text selection tool in Adobe Acrobat Reader.

Using the free Adobe Acrobat Reader, you can export the text in a PDF document as text. However, none of the formatting is retained in this case. If you have an Adobe Acrobat Standard or Professional license, then you can also export to other formats.

In all cases, you can manually select part or all of the text in a PDF document using the text selection tool, and then copy it using the copy command (usually Ctrl + C). However, depending on the original formatting of the PDF document, the copied text may lose some or all of its formatting.

One classical symptom found in texts copied from a PDF document is that each line ends with a line break (carriage return). This causes the text, once pasted into another editing environment, to be split into smaller, illogical chunks. In this case, you would have to delete all unnecessary line breaks manually. However, there is also a free tool you can use for this, called AutoUnbreak by Hollmén Digital. AutoUnbreak will remove unnecessary line breaks and retain most basic formatting (boldface, bullets, etc.) and produce an RTF version of the copied text, which you can then paste into Word and other word processors that support the Rich Text format.

(please expand)