
Conventionally
there are only three types of PDF files. Viz.
PDF Image Only
PDF Image+Hidden Text and
PDF Normal
For simplicity of understanding
we have subdivided these into the following types...
PDF Image Only
This is exact replica of the tiff file and is always better
than the original document as it undergoes first seven steps
of file preparation. PDF Image only files are used for archiving
historical documents and are not used regularly. As such
these are not searchable and is the cheapest variety of
all PDF Types with limited usage.
PDF Image+Hidden Text
- Level
1
In this PDF type the document overlays the OCRed text.
This is in one way a better version of PDF Image only
type with addition of limited searchability.
First six steps remaining the same, a tiff file is passed
through an OCR engine and using steps 13 PDF files are
produced. There is little post OCR manual intervention
and hence the textual accuracy is as good as the quality
of the input document and right selection of the OCR
engine. Step 14 gives a bonus by way of compactness
it renders to the PDF file.
This is the most popular type of PDF as it gives image
as well as limited searchability to PDF. Normally the
textual accuracy is limited to 70-80% depending on the
source document quality. This type is cost effective
too as there is little manual intervention.
-
Level 2
In this PDF type the document overlays the OCRed text
and the OCRed text is near 100% accurate. This is the
Best version of PDF Image+Hidden text as it gives full
text searchability.
First seven steps remaining the same, spell check, 100%
proof reading and corrections makes this files textually
perfect. This is a labor-intensive PDF type. One has
to check every character and every word to get near
100% accurate text. This makes it an expensive PDF type
and is very useful where textual accuracy is of prime
importance than the cost factor at the same time display
of the original document is equally important.
PDF Normal
This is a Royal PDF type where there is no image and no
hidden text. Whatever is seen on the original document is
reproduced as PDF. Depending on the contents of the document
all 14 steps mentioned above need to be executed.
If Adobe Capture is used as a tool to produce PDF Normal,
getting image is very easy. However, some OCR engines distort
images which need re-insertion from the tiff image. Most
of the OCR engines cannot recognize Tables, TOCs and Indexes
properly and as such tables, Indexes and TOCs are invariably
require rebuilding manually. This is a pure manual exercise
and adds to the cost of production. Some OCR engines cannot
capture/recognize small fonts and keep the text as bitmap
image. In such cases a lot of text need to be added manually.
The best part of PDF Normal is its clean appearance and
its compact size. A good PDF Normal having text only can
be as small as 8kb, which makes this file, format the most
preferred type for web publishing. It is said that on an
average PDF Normal should be around 11kb.
We in PDF India have mastered the art of producing PDF Normal
from any document, in any language, of any quality. We assure
good appearance, near 100% accuracy as well compact PDF
Normal.
<-- Click here to
go back |