Extracting pages from a PDF with Acrobat JavaScript

Learn how to use Acrobat JavaScript to automate splitting apart smaller subsets of pages from large PDF-based documents.

By Thom Parker – February 12, 2009

 

Scope: Acrobat 5.0 and later
Category: Automation
Skill Level: Intermediate and Advanced
Prerequisites: Basic Acrobat JavaScript Programming

Imagine receiving a large, automatically generated report in PDF that needs to be sliced and diced so different parts can be sent to clients or other departments. Not an uncommon activity, and one that’s possible to do manually with Acrobat Professional. Now imagine having to do this every week to a document that needs to be split 100 different ways. That’s a big task, one prone to human error. Fortunately, this can be easily automated with Acrobat JavaScript.

About page extraction

Page extraction is performed with the doc.extractPages() function. This function takes three input arguments: The page numbers for the beginning and end of the extraction, and a path to a PDF file where the extracted pages are saved.

This is a simple function to use, especially since all the input arguments are optional. But it does have a couple restrictions. First, page extraction cannot be done in the free Adobe Reader; this can only be done with Acrobat Professional or Standard. Second, due to security restrictions in Acrobat scripting, the path input can only be used if this function is called from a privileged context. This means the path input cannot be used if this function is run from a script in a PDF file. Extracting pages is for automation, not document interactivity. Automation scripts include JavaScript code run from the JavaScript Console, a Batch Process, or a Folder Level Script.

All the examples in this article will be run from the Acrobat Console Window, which is a privileged context and also very handy for running quick cut-and-paste automation scripts. I’ve made up an example file for testing. Download this file and save it to a local folder on your system:

Example file
NelsonsInc_Employee1040s.pdf

This file was generated from the accounting mainframe at Nelson’s Buggy Whips. In 1864, Nelson’s provided all its employees with filled-out 1040s to make it easier for them to file taxes. The sample above is a single file with all employees’ 1040s included. It was generated for print, but now needs to be split and e-mailed to the individual employees.

We’ll start off with some simple examples before getting into the full automation script.

Open the example file in Acrobat Professional, then open the JavaScript Console by pressing Ctrl+J on Windows, or Command+J on Mac.

To extract a single page from the document, specify only the nStart input. Run the following code in the JavaScript Console:

this.extractPages({nStart:5});

If your screen isn’t large enough to accommodate both the Console Window and Acrobat, close the Console Window. Notice Acrobat has created a new temporary file with a single page (page six) from the original document. It’s very important to remember that page numbers in JavaScript are zero-based, i.e., page zero in JavaScript is page one in the Acrobat viewer.

Notice also Acrobat created a temporary document to place the extracted page. This is because the path input, cPath, was not specified. Look back in the Console Window (Figure 1). The return value from running the code printed out the text [object Doc]. If you are using Acrobat 7 or earlier, the output will be slightly different. For Acrobat 7, the output will be [object Global].

Figure 1 – Document object returned from extractPages() function.

The extractPages() function returns a pointer to the newly created document object with the extracted pages. If this code was part of a larger script, then the document pointer would be critical for actually doing something with the extracted pages. We’ll get to this in a later example.

Delete the temporary PDF. Note: be sure to do this for every example that creates a temporary PDF so you don’t get mixed up about which document you are working on.

Let’s do this again, using a simple path argument:

this.extractPages({nStart:5, cPath: "TestExtract1.pdf"});

This time, the extractPages() function returns null, and no temporary PDF is created. Look in the folder where you saved the example file. There will be a new file in that folder named "TestExtract1.pdf.” Acrobat saved the extracted page, so there was no need to return a document pointer.

Before we move to the next example, it’s worthwhile to point out the notation used to pass the arguments into the function. This “Object Style” notation is an Acrobat DOM feature, not a core JavaScript feature. It only works on functions that are part of the Acrobat JavaScript Model. It’s useful because it eliminates having to specify the other optional arguments, but it’s not necessary. The first example could have been run like this:

this.extractPages(5);

Or the second example like this:

this.extractPages(5, 5, "TestExtract1.pdf");

Which leads into the next example, using the cEnd input. Using cEnd by itself extracts all pages from the beginning of the document to the page value specified by cEnd. Run this code in the Console Window:

this.extractPages({nEnd:5});

This code extracts pages one through six. It is exactly the same as running this code:

this.extractPages(0,5);

To extract the pages from page five to the end of the document, use this code:

this.extractPages(5, this.numPages-1 );

where this.numPages is a document property that returns the number of pages in the document. So, (this.numPages-1) is the page number for the last page in the file.

Creating a cut-and-paste automation script

Now we’re ready to create the script to split all the 1040s and e-mail them to the right people. Let’s start with breaking out the individual forms for the employees.

Each 1040 form has four pages. Forms were simpler in 1864 (although the tax calculations were still incomprehensible), no schedules or related forms, so we can write a loop to both extract the pages and e-mail the documents.

for(var i=0; i<this.numPages; i+=4) {
	var oNewDoc = this.extractPages({nStart: i, nEnd: i + 3});
	oNewDoc.mailDoc( … );
	oNewDoc.closeDoc(true);
}

This script walks through the document extracting four-page blocks. The extractPages() function returns a pointer to the newly created object, which is then used to e-mail the document, and finally to close it before moving on to the next extraction. You can look up the mailDoc() and closeDoc() functions in the Acrobat JavaScript Reference.


One thing is missing from this script: Where do the e-mail addresses come from? For simplicity, we’ll modify the code to use a list of names and e-mail addresses.

var aEmailList = ["[email protected]","[email protected]","[email protected]"];
for(var i=0,j=0; i<this.numPages; i+=4,j++) {
	var oNewDoc = this.extractPages({nStart: i, nEnd: i + 3});
	// Build file name and path for new file
	var cFlName = aEmailList[j].split("@").shift() + "_1040.pdf";
	var cPath = oNewDoc.path.replace(oNewDoc.documentFileName,cFlName);
	oNewDoc.saveAs(cPath);
	oNewDoc.mailDoc(false, aEmailList[j]);
	oNewDoc.closeDoc(true);
}

A second variable is added to the for statement for walking through the array of e-mails, and a saveAs command is included. Copy and paste the above code into the Console Window. Make sure to select all lines in the script before running it, so all the code is executed at the same time. Acrobat will go out to lunch for a short time. When it returns, you should have three new e-mails in your out folder, each with a PDF attachment.

Unfortunately, the name of the temporary file created by extracting the pages is a bit cryptic, and it ends with “.tmp” instead of “.pdf.” Files should have sensible names so it’s easier to tell a bit about the contents from the name. But we have a potentially bigger problem because of the “.tmp” extension. It’s possible an e-mail server will block an attachment with this extension. The code for creating a new file name and the doc.saveAs() function were added to the script to fix these issues. It saves the temporary file to a name derived from the e-mail address. For example, the first set of extracted pages will be saved to “HBabner_1040.pdf.” The file is saved to a temporary file folder, so it can be cleaned up easily later.

This is a pretty simple script that can make our job a lot easier. But, what if the individual 1040s varied in page length, or the document was so huge it wasn’t practical to set up the e-mail addresses to match the extraction order? How do we make a more flexible automation script?

All these issues can be handled with Acrobat JavaScript. For example, we could use the doc.getPageNthWord() function to both find the page ranges and extract the employee’s name. This information could then be used to look up the e-mails on a local list, or even the company’s server. But, that is a much more complex script, so it will have to wait for another day.

Using the example scripts

In this article, we ran the example code by copying and pasting the scripts into the JavaScript Console Window. In fact, for doing simple-automation tasks, it’s a good idea to place all your favorite scripts into a plain-text document from which you can copy and paste.

To extract pages from a group of files, you would use a Batch Sequence. Batch Sequences are a privileged context, so all the example code can be copied directly into a Batch Sequence.

A more interesting and useful way to run an automation script is with an Acrobat toolbar button or menu item. However, using one of these options requires that the code be enclosed in a trusted function. Code for creating toolbar buttons and trusted functions can be found in this article, Applying PDF security with Acrobat JavaScript.

For more information on functions used in this article, see the Acrobat JavaScript Reference and the Acrobat JavaScript Guide.

http://www.adobe.com/devnet/acrobat/

Click on the Documentation tab and scroll down to the JavaScript section.



Related topics:

JavaScript

Top Searches:


19 comments

Comments for this tutorial are now closed.

Lori Kassuba

5, 2015-03-19 19, 2015

Hi Linda Haworth,

Can you post your question here so some of our other experts can assist you (be sure to select the JavaScript category):
https://answers.acrobatusers.com/AskQuestion.aspx

Thanks,
Lori

Linda Haworth

9, 2015-03-17 17, 2015

I have a form that will have changing amount of pages based on user input.  I want to extract and email the last 20 pages.  is there a way to extract counting backwards so one day my form may be a total of 25 the next time a total of 45 but I always extract the last 20

Thom Parker

5, 2014-10-27 27, 2014

Hello Jean, These are both interesting questions, but not related to the article topic. Here is a link to an article on setting email address, subjects, and such
https://acrobatusers.com/tutorials/dynamically-setting-submit-e-mail-address

There is in fact a JavaScript command for importing data from a CSV file, you’ll find an article on the topic at this membership site.

http://www.pdfscripting.com/public/ExcelAndAcrobat.cfm

And here is an article on another technique for acquiring CSV data.

https://acrobatusers.com/tutorials/getting-external-data-into-acrobat-x-javascript

None of these are simple, and all require some knowledge of programming.

Jean

11, 2014-10-25 25, 2014

Is there a sample script to read the email list aEmailList from a csv file?

Jean

11, 2014-10-25 25, 2014

If I wanted to add email subject and message to the script, what do I add in oNewDoc.mailDoc(false, aEmailList[j]);?

Lori Kassuba

2, 2014-05-06 06, 2014

Hi Bob Hurt,

Please see this discussion on extracting metadata:
http://answers.acrobatusers.com/Is-extract-metadata-PDF-file-write-file-association-PDF-q29727.aspx

Thanks,
Lori

Bob Hurt

7, 2014-04-28 28, 2014

How do I extract the pdf document description and author?  I want to display that in a web page beside a list of PDF file names

Ed

6, 2013-10-24 24, 2013

I’m trying to extract pages. When I run the following from the Adobe console the first extract works but the second is not processed. Can anyone help with this.  Thank you.

// Extract pages1
extractPages({nStart: 103, nEnd: 104, cPath: “file1.pdf”});

// Extract pages2
extractPages({nStart: 105, nEnd: 106, cPath: “file2.pdf”});

Ed

8, 2013-10-23 23, 2013

I’m trying to extract separate files. I want to use very basic commands. How do I separate the arguments. When I run this script it only processes the last “extractPages” ? How do I separate these arguments so both will be processed. Thank You.

this.extractPages(29, 31, “Coyotes 10-31-13 210 B 10-12.pdf”);
this.extractPages(32, 35, “Coyotes 10-31-13 211 B 5-8.pdf”);

Thom Parker

4, 2013-10-17 17, 2013

Milton, Read these two articles to learn about manipulating file paths in Acrobat JavaScript
http://acrobatusers.com/tutorials/file-paths-acrobat-javascript
http://acrobatusers.com/tutorials/splitting-and-rebuilding-strings

A script in Acrobat cannot create new folders, for security reasons. So your target folder must already exist

Cori

11, 2013-10-14 14, 2013

I admit, I have not been on acrobatusers.com in a long time however it was another joy to see It is such an important topic and ignored by so many, even professionals. I thank you to help making people more aware of possible issues.

Milton Fosneca

7, 2013-09-29 29, 2013

I’m looking to extract the “selected” pages and then do a “save as” to a specific folder. The name of the file has to be “‘date’.pdf” on a specific path. Can someone help me with this?

Thom Parker

12, 2012-10-23 23, 2012

Dennis,
  Saving the Extracted page in Acrobat X requires privileged. It sounds like you need to place your script into a trusted function.

Dennis

5, 2012-10-19 19, 2012

Has anyone had issues moving to Adobe Pro X? I had a script which extracted a cover page and saved the file with the same filename in a different folder. Now the script creates a temp file instead.

Thom Parker

3, 2012-10-15 15, 2012

Patrick, the extraction loops in the article extract in 4 page blocks. If you want 2 page blocks, then all you have to do is to change the page increment to “2” instead of “4”, (“i” is used as the page increment in all the loops).

And since you are only saving the file. You can include the cPath parameter in the “extractPages” function.

patrick ball

8, 2012-10-11 11, 2012

I’m trying to do this exact extraction except two pages at a time. basically going every 2 pages to in a doc, and splitting a large PDF so that every 2 pages become a new file. what small addition in the script changes this?

Thom Parker

4, 2012-09-13 13, 2012

Nathan,
  Why Yes, this is possible. The current page is in the “this.pageNum” document property. So you would set nStart and nEnd variable like this.


nStart = this.pageNum-2;
nEnd = this.pageNum+2;

Although you also need to add code to check for and correct values that overrun the first and last pages.

Artem Burmakin

3, 2012-09-13 13, 2012

@Nathan Gardner
This should not be too difficult to do, at least if I understood your request correctly.

All you need to know is the current page number.
this.pageNum - does this.
Then simply use it to extract pages:
this.extractPages(this.pageNum,this.pageNum+1);

the above will extract the current page and the next one.

Is it answering your question?

Nathan Gardner

12, 2012-09-12 12, 2012

I am looking to use this script to set up a function to extract 2 pages before and after the displayed page of a large document.  We want to do this to provide context to a particular search result for document review.

Is this possible? 

Any help would be much appreciated.

Artem Burmakin

8, 2012-08-16 16, 2012

Thank you Thom. you are right it is not possible to find problem remotly. In any case thank you for help, the article above was really usefull, so was your comment.

After struggling a bit with the code I finally created something that works and does what I need to do.
If you do not mind I would like to share it here, mybe someone will find it usefull.

So the task was: I have a big report in pdf with employee salaries and other payments. Employees are groupped by country and the country name in the format like this - Country: Austria - is stated at the same place, but not on every page. What I need to do is to split this big file into smaller reports by Country (all in all there are about 60 countries in the report of 650 pages).
Here is the code that makes this for me (I start this code from Console):

for (var p = this.numPages - 1;p >=0; p—)
{
var numWords = this.getPageNumWords (p);
{
var ckWord = this.getPageNthWord (p, 60, true);
if ( ckWord ==“Country”)
{
console.println(p);
//CHANGE THE File name IN THE NEXT LINE
this.extractPages(p, this.numPages-1,“07-Mnthly_Comp_perAsgne_by_PLS - ” + this.getPageNthWord (p, 61, true) + “.pdf”);
this.deletePages(p, this.numPages-1);
}}}

Hope this will help someone.
Thanks again Thom, I could not do this without your help.

Thom Parker

6, 2012-08-08 08, 2012

Artem, The likely problem is that the words are not being found. However, just as a general rule, scripts of any complexity are going to have issues that require debugging and extra code for testing your values. In this case extra code needs to be added to ensure extraction only takes place when the words are found. There may also be other issues. I don’t know, I don’t have your test documents and I have not analyzed or debugged this code. I just wrote it off the top of my head. If you are having issues then you need to learn about debugging or hire a programmer. I would suggest reading this article and then asking questions on the regular forum.

http://acrobatusers.com/tutorials/why-doesnt-my-script-work]Why Doesn’t my Script Work

Artem Burmakin

11, 2012-08-08 08, 2012

I really appreciate your support, but still can’t make it work.
I am a dummy in coding, so if you could help a little bit more it would be really great.

So, I take the code you gave and I add “extract” function, but it says:
TypeError: Invalid argument type.
Doc.extractPages:26:Batch undefined:Exec
===> Parameter nStart.

What am I doing wrong?
Here is the code:

/* Test111 */
var cKeyWord1 = “Austria”;
var cKeyWord2 = “Canada”;
var bFound1 = false;
var nPage1 = -1;
var nPage2 = -1;
for(var nPg=0;nPg<this.numPages;nPg++)
{
if(!bFound1)
{
if(this.getPageNthWord(61) == cKeyWord1)
{
nPage1 = nPg;
bFound1 = true;
}
  }
  else
  {
if(this.getPageNthWord(61) == cKeyWord2)
{
nPage2 = nPg;
break;
}
  }
this.extractPages({nStart: nPage1,nEnd: nPage2, cPath: "TestExtract1.pdf"});
}


thanks a million in advance

Thom Parker

3, 2012-08-06 06, 2012

The best way to find your words is to use loop to search for the words on all pages. Use a state variable to control which word is being looked for.

var cKeyWord1 = “Key1”;
var cKeyWord2 = “Key2”;
var bFound1 = false;
var nPage1 = -1;
var nPage2 = -1;
for(var nPg=0;nPg<this.numPages;nPg++)
{
if(!bFound1)
{
if(this.getPageNthWord(65) == cKeyWord1)
{
nPage1 = nPg;
bFound1 = true;
}
  }
  else
  {
if(this.getPageNthWord(65) == cKeyWord2)
{
nPage2 = nPg;
break;
}
  }
     
}

And that is how you find the page range.

Artem Burmakin

4, 2012-08-03 03, 2012

Could you please advise how to split pdf based on content with doc.getPageNthWord()? I have broken my head trying to find the solution.
I have a pdf wich I want to split by keywords that are always on the same place (say word 65), but not present on every page. So I need to define the range between two keywords and extract the pages in between.

Thank you in advance,

Comments for this tutorial are now closed.