| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Whenever you search in PBworks or on the Web, Dokkio Sidebar (from the makers of PBworks) will run the same search in your Drive, Dropbox, OneDrive, Gmail, Slack, and browsed web pages. Now you can find what you're looking for wherever it lives. Try Dokkio Sidebar for free.

View
 

Portable Document Files (PDF) - Why is Document Structure and Reading Order Important

Page history last edited by Tammy 1 year, 7 months ago

Better Images yield Better Optical Character Recognition (OCR)

Since OCR software determines a character by 'optical' means, it makes sense that a clear image of a letter produces a better result. Most software packages that enable scanning have OCR software with their product. Some software that hosts document content, such as CONTENTdm, have OCR software. Correcting OCR content can be difficult or impossible depending on the software that you are using.

Most of time (not all of the time), the higher the resolution, the more accurate the OCR is. However, the higher the resolution, the larger the file. There is usually a compromise between resolution and OCR accuracy.

It is all fun and games until someone looses the letter I

It is quite common that the letter "i" can be lost if the document was of low quality before it was scanned or if the font puts characters very close together. There are other characters too. For example, the first and second page of a document was scanned at 200 dots per inch (dpi) and read using Job Access with Speech (JAWS).  There were several errors with the optical character recognition (OCR). The user could guess some of the correct words, but likely, they would not guess when a 5 replaced a dollar sign. 538 is much different from $38.

Is it Chicken or Egg? Document Structure versus Reading Order

This reminds me of the riddle, "which comes first the chicken or the egg"? Reading order and document structure are closely related. When you create a tag structure in Adobe Reader, it will create a reading order. It might not be the one you wanted. Sometimes you can simply swap items around in the reader order panel. Other times, you need to delete entries and redo them, which usually requires swapping around the tags afterward. This software is far from perfect; however, they are the only "game in town" right now. I decided to address reading order first for better flow of my narrative with the videos.

Reading words out of order doesn't make sense

There are two problems with the 600 dpi file: screen reader users cannot scan the headings and the reading order is incorrect. Both can be fixed by adding PDF tags to the document. Our example document has a complicated structure, so Adobe Acrobat did not guess well when it added tags to the document.

Document Structure is Important

A structured document breaks up the document into chunks, which is recommended for some print disabilities and enable screen reader users to browse the part or chuck of text they need. In actuality, it helps all users scan the document. A logically structured document is absolutely necessary for all users. The author of the document won't be there to provide clarification to the document. To quote a wise man (my academic advisor), "There will not be little Tammys that will come with your thesis to tell them what you meant."

Right now, my thesis is not online. When that changes, I will update this section. To demonstrate PDF tags. I will use an excerpt from my thesis because I can being the copyright holder (yay). There is some subjective decision making here. I chose the simpler representative structure. Just as a note, I write better literature reviews now. Please don't judge.

PDF tag <H1>

file box MATHEMATICAL BACKGROUND

PDF tag <H2>

file box 2.1 Model Order Reduction Methods

Paragraph symbol <P>

file box The model order reduction methods to be discussed in this section are BT, SPT, and SVRT. Interested readers are referred to [4, 8, 11, 14, 16, 25, 26, 33, 35, 37, 38, 41, 42] for SPT and [12, 30, 32, 33, 36] for BT. In this section, SVD will be briefly discussed because it is used by two of the methods that are reviewed here. Many references discuss SVD, but the information in 2.1.1 was primarily obtained from [18].

PDF tag <H3>

file box 2.1.1 Singular Value Decomposition

Paragraph symbol <P>

file box The SVD of a matrix ARmxm of rank can be written as ...

When considering thesis and dissertations, there are good points and bad points. The good points are that they might have tags and they have a straightforward structure if they don't. The bad point is that thesis and dissertations are long, so there would many pages to check. The ugly point is that Adobe makes more mistakes in guessing tags in long documents.

As an example, a page of a thesis was read by NVDA with and without tags. For the version without tags, NVDA read the document from right to left and line by line, which was wrong for the formula. This video illustrates how a screen reader reads the original downloaded document in the video. Go to PDF read by NVDA without tags (video). The transcript is on YouTube.

Tags were added to the document and the same page was read by the screen reader in the video, PDF OCR, with tags, using NVDA. Since it is a formula, I had to add alternative text so it read the mathematics correctly. If it is just normal text, you will likely just have to rearrange lines of text in the reading order panel in Adobe Professional. Go to PDF read by NVDA with tags and alternative text (video), The transcript is on YouTube.

 

*tag image provided by CFCF
*file box image adapted from an image created by Nevit.

Comments (0)

You don't have permission to comment on this page.