diff --git a/.xqdoca b/.xqdoca index 841b9d6..fd65659 100644 --- a/.xqdoca +++ b/.xqdoca @@ -1,5 +1,4 @@ src/ docs/xqdoc/ - true \ No newline at end of file diff --git a/changelog.md b/changelog.md index 9ed79c0..217bb34 100644 --- a/changelog.md +++ b/changelog.md @@ -1,5 +1,3 @@ -# 0.5.0 2025-06-08 -* remove `hasChildren` from outline map # 0.4.0 2025-06-04 * ADD Label access * various renames diff --git a/docs/guide.md b/doc.md similarity index 84% rename from docs/guide.md rename to doc.md index c47ac4d..1029ff9 100644 --- a/docs/guide.md +++ b/doc.md @@ -2,11 +2,7 @@ ## Introduction -The `Pdfbox3.xqm` XQuery library module enables features from **Apache PDFBox 3.0** to be called from `BaseX`. - ->The [Apache PDFBox®](https://pdfbox.apache.org/) library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. - -The library is distributed as a **XAR (XQuery Archive) file**, which includes the necessary PDFBox JAR files, making it easy to install and use in BaseX 10.7+. +The `Pdfbox3.xqm` library is an XQuery module designed to interface with **Apache PDFBox 3.0**, a powerful Java library for working with PDF documents. This module allows you to perform various operations on PDF files, such as extracting text, rendering pages as images, managing outlines, and more. The library is distributed as a **XAR (XQuery Archive) file**, which includes the necessary PDFBox JAR files, making it easy to install and use in BaseX 10.7+. --- @@ -16,13 +12,18 @@ The library is distributed as a **XAR (XQuery Archive) file**, which includes th The library is distributed as a XAR file that includes the required PDFBox JAR files. You can obtain the XAR file from the distribution source (e.g., a repository or a shared location). ### 2. Install the XAR File in BaseX -The latest version is avaiable at https://github.com/expkg-zone58/pdfbox/releases. -The `XAR` can be installed into the repository. For example: +To install the XAR file in BaseX, follow these steps: -``` -REPO INSTALL https://github.com/expkg-zone58/pdfbox/releases/download/v0.4.0/pdfbox-0.4.0.xar -``` -### 3. Verify the installation by listing the installed packages: +1. Open the BaseX GUI or command-line interface. +2. Use the `REPO INSTALL` command to install the XAR file: + + ```xquery + REPO INSTALL path/to/pdfbox3.xar + ``` + + Replace `path/to/pdfbox3.xar` with the actual path to the XAR file. + +3. Verify the installation by listing the installed packages: ```xquery REPO LIST diff --git a/docs/xqdoc/annotations.html b/docs/xqdoc/annotations.html index 8235a0e..f1c7141 100644 --- a/docs/xqdoc/annotations.html +++ b/docs/xqdoc/annotations.html @@ -6,6 +6,6 @@ / Annotations importsimports-diagimports-diag.mmdreportrestxqxqdoc-validatexqdoca.xml

Contents -

  1. Summary
  2. Annotations
    1. 2.1 http://www.w3.org/2012/xquery

Summary

This project uses 1 annotation namespaces.

Related documents
ViewDescriptionFormat
reportIndex of sourcesxhtml
restxqSummary of REST interfacexhtml
importsSummary of import usagexhtml
imports-diagProject wide module imports as html mermaid class diagramhtml5
imports-diag.mmdProject wide module imports as a mermaid class diagramtext
xqdoca.xmlxqDocA run configuration report (XML)xml
xqdoc-validatevalidate generated xqdoc filesxml

Annotations

2.1 http://www.w3.org/2012/xquery

private
\ No newline at end of file +   on Wednesday, 4th June 2025

\ No newline at end of file diff --git a/docs/xqdoc/imports.html b/docs/xqdoc/imports.html index 38768fa..06a69c6 100644 --- a/docs/xqdoc/imports.html +++ b/docs/xqdoc/imports.html @@ -6,4 +6,4 @@ Contents
  1. Summary
  2. Imports

    Summary

    Lists all modules imported.

    Related documents
    ViewDescriptionFormat
    reportIndex of sourcesxhtml
    restxqSummary of REST interfacexhtml
    imports-diagProject wide module imports as html mermaid class diagramhtml5
    imports-diag.mmdProject wide module imports as a mermaid class diagramtext
    annotationsSummary of XQuery annotation usexhtml
    xqdoca.xmlxqDocA run configuration report (XML)xml
    xqdoc-validatevalidate generated xqdoc filesxml

    Imports (0)

    \ No newline at end of file +   on Wednesday, 4th June 2025

    \ No newline at end of file diff --git a/docs/xqdoc/index.html b/docs/xqdoc/index.html index e1a1ab7..83d6779 100644 --- a/docs/xqdoc/index.html +++ b/docs/xqdoc/index.html @@ -6,9 +6,9 @@ 1 XQuery source files, and uses 1 annotation namespaces.

    This document was built from source folder C:/Users/mrwhe/git/expkg-zone58/pdfbox/src/ on - Monday, 9th June 2025.

    Related documents
    ViewDescriptionFormat
    reportIndex of sourcesxhtml
    restxqSummary of REST interfacexhtml
    importsSummary of import usagexhtml
    imports-diagProject wide module imports as html mermaid class diagramhtml5
    imports-diag.mmdProject wide module imports as a mermaid class diagramtext
    annotationsSummary of XQuery annotation usexhtml
    xqdoca.xmlxqDocA run configuration report (XML)xml
    xqdoc-validatevalidate generated xqdoc filesxml

    XQuery Main (0)

    None

    XQuery Library (1)

    UriPrefixDescriptionUseAMetrics
    org.expkg_zone58.Pdfbox3pdfbox + Wednesday, 4th June 2025.

    Related documents
    ViewDescriptionFormat
    reportIndex of sourcesxhtml
    restxqSummary of REST interfacexhtml
    importsSummary of import usagexhtml
    imports-diagProject wide module imports as html mermaid class diagramhtml5
    imports-diag.mmdProject wide module imports as a mermaid class diagramtext
    annotationsSummary of XQuery annotation usexhtml
    xqdoca.xmlxqDocA run configuration report (XML)xml
    xqdoc-validatevalidate generated xqdoc filesxml

    XQuery Main (0)

    None

    XQuery Library (1)

    UriPrefixDescriptionUseAMetrics
    org.expkg_zone58.Pdfbox3pdfbox -A BaseX 10.7+ interface to pdfbox3 https://...
    0
    Library
    ↖0
    P
    V#1
    F#36

    File view (1)

    Annotation namespaces (1)

    A total of 7 annotations are defined. -

    http://www.w3.org/2012/xquery

    0
    Library
    ↖0
    P
    V#1
    F#37

    File view (1)

    Annotation namespaces (1)

    A total of 8 annotations are defined. +

    http://www.w3.org/2012/xquery

    private8
    \ No newline at end of file +   on Wednesday, 4th June 2025

    \ No newline at end of file diff --git a/docs/xqdoc/modules/F000001/index.html b/docs/xqdoc/modules/F000001/index.html index f60d577..d8bbd3c 100644 --- a/docs/xqdoc/modules/F000001/index.html +++ b/docs/xqdoc/modules/F000001/index.html @@ -1,7 +1,12 @@ src - xqDocA - xqDocA

    org.expkg_zone58.Pdfbox3  library module
    P

    Summary

    See also
    Authors
    • Andy Bunce 2025
    Custom

    Functions

    4.1 pdfbox:binary

    Arities: #1

    Summary
    -Create binary representation (xs:base64Binary) of $pdf object
    Signatures
    pdfbox:binary +}

    Functions

    4.1 pdfbox:binary

    Arities: #1

    Summary
    +Create binary representation of $pdf object as xs:base64Binary
    Signatures
    pdfbox:binary ( $pdf as item() ) as xs:base64Binary
    Parameters
    • pdf as item()
    Return
    • xs:base64Binary
    Referenced by 1 functions from 1 modules
    References 3 functions from 2 modules
    • {java:java.io.ByteArrayOutputStream}new#0
    • {java:java.io.ByteArrayOutputStream}toByteArray#1
    • {java:org.apache.pdfbox.pdmodel.PDDocument}save#2
    Source ( 7 lines)
    function pdfbox:binary($pdf as item())
     as xs:base64Binary{
    @@ -56,30 +61,43 @@ as xs:base64Binary{
        let $_:=PDDocument:save($pdf, $bytes)
        return  Q{java:java.io.ByteArrayOutputStream}toByteArray($bytes)
              =>convert:integers-to-base64()
    -}

    4.2 pdfbox:bookmark-xml

    Arities: #1P

    Summary
    +}

    4.2 pdfbox:bookmark

    Arities: #2P

    Summary
    +Return bookmark info for $bookmark +
    Signatures
    pdfbox:bookmark + ( + $bookmark as item(), $pdf as item() ) as map(*)
    Parameters
    • bookmark as item()
    • pdf as item()
    Return
    • map(*) map{index:..,title:..,hasChildren:..}
    Referenced by 1 functions from 1 modules
    References 3 functions from 1 modules
    • {java:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem}findDestinationPage#2
    • {java:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem}getTitle#1
    • {java:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem}hasChildren#1
    Annotations (1)
    %private()
    Source ( 10 lines)
    function pdfbox:bookmark($bookmark as item(),$pdf as item())
    +as map(*)
    +{
    + map{ 
    +  "index":  PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf),
    +  "title":  (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)}
    +  (:=>translate("�",""), :),
    +  "hasChildren": PDOutlineItem:hasChildren($bookmark)
    +  }
    +}

    4.3 pdfbox:bookmark-xml

    Arities: #1P

    Summary
    Convert outline map to XML
    Signatures
    pdfbox:bookmark-xml ( - $outline as map(*)* ) as element(bookmark)*
    Parameters
    • outline as map(*)*
    Return
    • element(bookmark)*
    Referenced by 2 functions from 1 modules
    References 1 functions from 1 modules
    Annotations (1)
    %private()
    Source ( 8 lines)
    function pdfbox:bookmark-xml($outline as map(*)*)
    +			$outline as map(*)* ) as element(bookmark)*
    Parameters
    • outline as map(*)*
    Return
    • element(bookmark) *
    Referenced by 2 functions from 1 modules
    References 1 functions from 1 modules
    Annotations (1)
    %private()
    Source ( 8 lines)
    function pdfbox:bookmark-xml($outline as map(*)*)
     as element(bookmark)*
     {
       $outline!
       <bookmark title="{?title}" index="{?index}">
         {?children!pdfbox:bookmark-xml(.)}
       </bookmark>
    -}

    4.3 pdfbox:close

    Arities: #1

    Summary
    +}

    4.4 pdfbox:close

    Arities: #1

    Summary
    Release any resources related to $pdf
    Signatures
    pdfbox:close ( - $pdf as item() ) as empty-sequence()
    Parameters
    Return
    Referenced by 3 functions from 1 modules
    References 1 functions from 1 modules
    Source ( 6 lines)
    function pdfbox:close($pdf as item())
    +			$pdf as item() ) as empty-sequence
    Parameters
    Return
    Referenced by 3 functions from 1 modules
    References 1 functions from 1 modules
    Source ( 6 lines)
    function pdfbox:close($pdf as item())
     as empty-sequence(){
       (# db:wrapjava void #) {
          PDDocument:close($pdf)
       }
    -}

    4.4 pdfbox:do-until

    Arities: #3P

    Summary
    +}

    4.5 pdfbox:do-until

    Arities: #3P

    Summary
    fn:do-until shim for BaseX 9+10 if fn:do-until not found use hof:until, note: $pos always zero
    Signatures
    pdfbox:do-until ( - $input as item()*, $action as function(item()*, xs:integer) as item()*, $predicate as function(item()*, xs:integer) as xs:boolean? ) as item()*
    Parameters
    Return
    Referenced by 2 functions from 1 modules
    References 5 functions from 2 modules
    Annotations (1)
    %private()
    Source ( 15 lines)
    function pdfbox:do-until(
    +			$input as item()*, $action as function(item()*, xs:integer) as item()*, $predicate as function(item()*, xs:integer) as xs:boolean? ) as item()*
    Parameters
    Return
    Referenced by 2 functions from 1 modules
    References 5 functions from 2 modules
    Annotations (1)
    %private()
    Source ( 15 lines)
    function pdfbox:do-until(
      $input 	as item()*, 	
      $action 	as function(item()*, xs:integer) as item()*, 	
      $predicate 	as function(item()*, xs:integer) as xs:boolean? 	
    @@ -93,7 +111,7 @@ if  fn:do-until not found use hof:until, note: $pos always zero
                           then $hof($predicate(?,0),$action(?,0),$input)
                           else error(xs:QName('pdfbox:do-until'),"No implementation do-until found")
     
    -}

    4.5 pdfbox:extract-range

    Arities: #3

    Summary
    +}

    4.6 pdfbox:extract-range

    Arities: #3

    Summary
    Return new PDF doc with pages from $start to $end as xs:base64Binary, (1 based)
    Signatures
    pdfbox:extract-range ( @@ -103,10 +121,10 @@ as xs:base64Binary { let $a:=PageExtractor:new($pdf, $start, $end) =>PageExtractor:extract() return (pdfbox:binary($a),pdfbox:close($a)) -}

    4.6 pdfbox:find-page

    Arities: #2

    Summary
    +}

    4.7 pdfbox:find-page

    Arities: #2

    Summary
    pageIndex of $page in $pdf
    Signatures
    pdfbox:find-page ( - $page as item()?, $pdf as item() ) as item()?
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Source ( 10 lines)
    function pdfbox:find-page(
    +			$page as item()?, $pdf as item() ) as item()?
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Source ( 10 lines)
    function pdfbox:find-page(
        $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :),
        $pdf as item())
     as item()?
    @@ -115,15 +133,15 @@ as item()?
       then PDDocument:getDocumentCatalog($pdf)
           =>PDDocumentCatalog:getPages()
           =>PDPageTree:indexOf($page)
    -}

    4.7 pdfbox:gregToISO

    Arities: #1P

    Summary
    +}

    4.8 pdfbox:gregToISO

    Arities: #1P

    Summary
    Convert date
    Signatures
    pdfbox:gregToISO ( - $item as item()? ) as xs:string?
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Annotations (1)
    %private()
    Source ( 6 lines)
    function pdfbox:gregToISO($item as item()?)
    +			$item as item()? ) as xs:string?
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Annotations (1)
    %private()
    Source ( 6 lines)
    function pdfbox:gregToISO($item as item()?)
     as xs:string?{
      if(exists($item))
      then Q{java:java.util.GregorianCalendar}toZonedDateTime($item)=>string()
      else ()
    -}

    4.8 pdfbox:label-as-map

    Arities: #2

    Summary
    +}

    4.9 pdfbox:label-as-map

    Arities: #2

    Summary
    label/page-range for $page as map
    Signatures
    pdfbox:label-as-map ( $pagelabels, $page as xs:integer ) as map(*)
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 5 functions from 3 modules
    Source ( 13 lines)
    function pdfbox:label-as-map($pagelabels,$page as  xs:integer)
    @@ -138,10 +156,10 @@ as map(*)
           "start":  PDPageLabelRange:getStart($label),
           "style":  PDPageLabelRange:getStyle($label)
           }
    -}

    4.9 pdfbox:label-as-string

    Arities: #2

    Summary
    +}

    4.10 pdfbox:label-as-string

    Arities: #2

    Summary
    label for $page formated as string, empty if none
    Signatures
    pdfbox:label-as-string ( - $pagelabels, $page as xs:integer ) as xs:string?
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 7 functions from 3 modules
    Source ( 15 lines)
    function pdfbox:label-as-string($pagelabels,$page as  xs:integer)
    +			$pagelabels, $page as xs:integer ) as xs:string?
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 7 functions from 3 modules
    Source ( 15 lines)
    function pdfbox:label-as-string($pagelabels,$page as  xs:integer)
     as xs:string?{
       let $label:=PDPageLabels:getPageLabelRange($pagelabels,$page)
       return  if(empty($label))
    @@ -155,17 +173,17 @@ as xs:string?{
                                     if(($start eq 1)) then "" else $start,
                                     if(exists($prefix)) then '*' || $prefix  (:TODO double " :)
                         ))
    -}

    4.10 pdfbox:labels-as-map

    Arities: #1

    Summary
    +}

    4.11 pdfbox:labels-as-map

    Arities: #1

    Summary
    sequence of maps for each label/page range defined in $pdf
    Signatures
    pdfbox:labels-as-map ( - $pdf as item() ) as map(*)*
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 3 functions from 2 modules
    Source ( 8 lines)
    function pdfbox:labels-as-map($pdf as item())
    +			$pdf as item() ) as map(*)*
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 3 functions from 2 modules
    Source ( 8 lines)
    function pdfbox:labels-as-map($pdf as item())
     as map(*)*{
       let $pagelabels:=PDDocument:getDocumentCatalog($pdf)
                        =>PDDocumentCatalog:getPageLabels()
       return  $pagelabels
               !(0 to pdfbox:number-of-pages($pdf)-1)
               !pdfbox:label-as-map($pagelabels,.)
    -}

    4.11 pdfbox:labels-as-string

    Arities: #1

    Summary
    +}

    4.12 pdfbox:labels-as-string

    Arities: #1

    Summary
    sequence of label ranges defined in PDF as formatted strings
    Signatures
    pdfbox:labels-as-string ( @@ -177,22 +195,22 @@ as xs:string{ !(0 to pdfbox:number-of-pages($pdf)-1) !pdfbox:label-as-string($pagelabels,.)=>string-join("&#10;") -}

    4.12 pdfbox:labels-by-page

    Arities: #1

    Summary
    +}

    4.13 pdfbox:labels-by-page

    Arities: #1

    Summary
    pageLabel for every page from derived from page-ranges The returned sequence will contain at MOST as much entries as the document has pages.
    Signatures
    pdfbox:labels-by-page ( - $pdf as item() ) as xs:string*
    Parameters
    Return
    See also
    Referenced by 0 functions from 0 modules
    References 1 functions from 1 modules
    Source ( 7 lines)
    function pdfbox:labels-by-page($pdf as item())
    +			$pdf as item() ) as xs:string*
    Parameters
    Return
    Tags
    Referenced by 0 functions from 0 modules
    References 1 functions from 1 modules
    Source ( 7 lines)
    function pdfbox:labels-by-page($pdf as item())
     as xs:string*
     {
       PDDocument:getDocumentCatalog($pdf)
       =>PDDocumentCatalog:getPageLabels()
       =>PDPageLabels:getLabelsByPageIndices()
    -}

    4.13 pdfbox:metadata

    Arities: #1

    Summary
    +}

    4.14 pdfbox:metadata

    Arities: #1

    Summary
    XMP metadata as "RDF" document
    Signatures
    pdfbox:metadata ( - $pdf as item() ) as document-node(element(*))?
    Parameters
    Return
    Tags
    • @note: + $pdf as item() ) as document-node(element(*))?
    Parameters
    Return
    Tags
    Referenced by 0 functions from 0 modules
    References 5 functions from 4 modules
    Source ( 17 lines)
    function pdfbox:metadata($pdf as item())
     as document-node(element(*))?
     {
    @@ -209,14 +227,14 @@ as document-node(element(*))?
                             function($output,$pos) { $output?n eq -1 }     
                          )?data=>parse-xml()
               else ()
    -}

    4.14 pdfbox:number-of-bookmarks

    Arities: #1

    Summary
    +}

    4.15 pdfbox:number-of-bookmarks

    Arities: #1

    Summary
    The number of outline items defined in $pdf
    Signatures
    pdfbox:number-of-bookmarks ( $pdf as item() ) as xs:integer
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Source ( 5 lines)
    function pdfbox:number-of-bookmarks($pdf as item())
     as xs:integer{
       let $xml:=pdfbox:outline-xml($pdf)
       return count($xml//bookmark)
    -}

    4.15 pdfbox:number-of-labels

    Arities: #1

    Summary
    +}

    4.16 pdfbox:number-of-labels

    Arities: #1

    Summary
    The number of labels defined in PDF
    Signatures
    pdfbox:number-of-labels ( $pdf as item() ) as xs:integer
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 3 functions from 3 modules
    Source ( 9 lines)
    function pdfbox:number-of-labels($pdf as item())
    @@ -227,15 +245,14 @@ as xs:integer
       return if(exists($labels)) 
              then PDPageLabels:getPageRangeCount($labels)
              else 0
    -}

    4.16 pdfbox:number-of-pages

    Arities: #1

    Summary
    +}

    4.17 pdfbox:number-of-pages

    Arities: #1

    Summary
    Number of pages in PDF
    Signatures
    pdfbox:number-of-pages ( $pdf as item() ) as xs:integer
    Parameters
    Return
    Referenced by 2 functions from 1 modules
    References 1 functions from 1 modules
    Source ( 4 lines)
    function pdfbox:number-of-pages($pdf as item())
     as xs:integer{
       PDDocument:getNumberOfPages($pdf)
    -}

    4.17 pdfbox:open

    Arities: #1#2

    Summary
    -open pdf from file/url/binary, opts may have password , returns pdf object -
    Signatures
    pdfbox:open +}

    4.18 pdfbox:open

    Arities: #1#2

    Summary
    +open pdf using fetch:binary, returns pdf object
    Signatures
    pdfbox:open ( $pdfsrc as item() ) as item()
    pdfbox:open ( @@ -259,12 +276,12 @@ as item(){ else $pdfsrc return error(xs:QName("pdfbox:open"),"Failed PDF load " || $loc || " " || $err:description) } -}

    4.18 pdfbox:outline

    Arities: #1#2P

    Summary
    +}

    4.19 pdfbox:outline

    Arities: #1#2P

    Summary
    Return outline for $pdf as map()*
    Signatures
    pdfbox:outline ( $pdf as item() ) as map(*)*
    pdfbox:outline ( - $pdf as item(), $outlineItem as item()? ) as map(*)*
    Parameters
    Return
    Referenced by 3 functions from 1 modules
    References 6 functions from 5 modules
    Annotations (1)
    %private()
    Source ( 16 lines)
    function pdfbox:outline($pdf as item())
    +			$pdf as item(), $outlineItem as item()? ) as map(*)*
    Parameters
    Return
    Referenced by 3 functions from 1 modules
    References 6 functions from 5 modules
    Annotations (1)
    %private()
    Source ( 16 lines)
    function pdfbox:outline($pdf as item())
     as map(*)*{
       (# db:wrapjava some #) {
       let $outline:=
    @@ -278,33 +295,28 @@ as map(*)*{
     as map(*)*{
       let $find as map(*):=pdfbox:outline_($pdf ,$outlineItem)
       return map:get($find,"list")
    -}

    4.19 pdfbox:outline-xml

    Arities: #1

    Summary
    +}

    4.20 pdfbox:outline-xml

    Arities: #1

    Summary
    PDF outline in xml format
    Signatures
    pdfbox:outline-xml ( - $pdf as item() ) as element(outline)?
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 3 functions from 2 modules
    Source ( 7 lines)
    function pdfbox:outline-xml($pdf as item())
    +			$pdf as item() ) as element(outline)?
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 3 functions from 2 modules
    Source ( 7 lines)
    function pdfbox:outline-xml($pdf as item())
     as element(outline)?{
      let $outline:=pdfbox:outline($pdf)
       return if(exists($outline))
              then <outline>{$outline!pdfbox:bookmark-xml(.)}</outline>
              else ()
    -}

    4.20 pdfbox:outline_

    Arities: #2P

    Summary
    +}

    4.21 pdfbox:outline_

    Arities: #2P

    Summary
    outline helper. BaseX bug 10.7? error if inlined in outline
    Signatures
    pdfbox:outline_ ( - $pdf as item(), $outlineItem as item()? ) as map(*)
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 10 functions from 4 modules
    Annotations (1)
    %private()
    Source ( 25 lines)
    function pdfbox:outline_($pdf as item(),$outlineItem as item()?)
    +			$pdf as item(), $outlineItem as item()? ) as map(*)
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 8 functions from 4 modules
    Annotations (1)
    %private()
    Source ( 20 lines)
    function pdfbox:outline_($pdf as item(),$outlineItem as item()?)
     as map(*){
       pdfbox:do-until(
         
          map{"list":(),"this":$outlineItem},
     
          function($input,$pos ) { 
    -        let $bookmark:=$input?this
    -        let $bk:=map{ 
    -              "index":  PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf),
    -              "title":  (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)}
    -              }
    -
    -        let $bk:= if(PDOutlineItem:hasChildren($bookmark))
    -                  then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($bookmark))
    +        let $bk:= pdfbox:bookmark($input?this,$pdf)
    +        let $bk:= if($bk?hasChildren)
    +                  then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this))
                             return map:merge(($bk,map:entry("children",$kids)))
                       else $bk 
             return map{
    @@ -314,14 +326,14 @@ as map(*){
     
          function($output,$pos) { empty($output?this) }                      
       )
    -}

    4.21 pdfbox:page-labels

    Arities: #1

    Summary
    +}

    4.22 pdfbox:page-labels

    Arities: #1

    Summary
    get pagelabels exist
    Signatures
    pdfbox:page-labels ( $pdf )
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 1 functions from 1 modules
    Source ( 5 lines)
    function pdfbox:page-labels($pdf)
     {
       PDDocument:getDocumentCatalog($pdf)
       =>PDDocumentCatalog:getPageLabels()
    -}

    4.22 pdfbox:page-media-box

    Arities: #2

    Summary
    +}

    4.23 pdfbox:page-media-box

    Arities: #2

    Summary
    Return size of $pageNo (zero based)
    Signatures
    pdfbox:page-media-box ( @@ -330,7 +342,7 @@ as xs:string{ PDDocument:getPage($pdf, $pageNo) =>PDPage:getMediaBox() =>PDRectangle:toString() -}

    4.23 pdfbox:page-render

    Arities: #3

    Summary
    +}

    4.24 pdfbox:page-render

    Arities: #3

    Summary
    Pdf page as image (zero is cover) options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi??
    Signatures
    pdfbox:page-render ( @@ -344,7 +356,7 @@ as xs:base64Binary{ return Q{java:java.io.ByteArrayOutputStream}toByteArray($bytes) =>convert:integers-to-base64() -}

    4.24 pdfbox:page-text

    Arities: #2

    Summary
    +}

    4.25 pdfbox:page-text

    Arities: #2

    Summary
    return text on $pageNo
    Signatures
    pdfbox:page-text ( $pdf as item(), $pageNo as xs:integer ) as xs:string
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 1 modules
    Source ( 9 lines)
    function pdfbox:page-text($pdf as item(), $pageNo as xs:integer)
    @@ -355,16 +367,16 @@ as xs:string{
              => PDFTextStripper:setEndPage($pageNo)
            }
       return (# db:checkstrings #) {PDFTextStripper:getText($tStripper,$pdf)}
    -}

    4.25 pdfbox:pdf-save

    Arities: #2

    Summary
    -Save pdf $pdf to filesystem at $savepath , returns $savepath
    Signatures
    pdfbox:pdf-save +}

    4.26 pdfbox:pdf-save

    Arities: #2

    Summary
    +Save pdf $pdf to filesystem at $savepath , returns $savepath
    Signatures
    pdfbox:pdf-save ( $pdf as item(), $savepath as xs:string ) as xs:string
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Source ( 4 lines)
    function pdfbox:pdf-save($pdf as item(),$savepath as xs:string)
     as xs:string{
        PDDocument:save($pdf, File:new($savepath)),$savepath
    -}

    4.26 pdfbox:property

    Arities: #2

    Summary
    +}

    4.27 pdfbox:property

    Arities: #2

    Summary
    Return the value of $property for $pdf
    Signatures
    pdfbox:property ( - $pdf as item(), $property as xs:string ) as item()*
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 5 functions from 2 modules
    Source ( 9 lines)
    function pdfbox:property($pdf as item(),$property as xs:string)
    +			$pdf as item(), $property as xs:string ) as item()*
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 5 functions from 2 modules
    Source ( 9 lines)
    function pdfbox:property($pdf as item(),$property as xs:string)
     as item()*{
       let $fns:= $pdfbox:property-map($property)
       return if(exists($fns))
    @@ -372,13 +384,13 @@ as item()*{
                             $pdf, 
                             function($result,$this as function(*)){$result!$this(.)})
              else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined."))
    -}

    4.27 pdfbox:property-names

    Arities: #0

    Summary
    +}

    4.28 pdfbox:property-names

    Arities: #0

    Summary
    Defined property names, sorted
    Signatures
    pdfbox:property-names ( - ) as xs:string*
    Return
    Referenced by 1 functions from 1 modules
    Source ( 4 lines)
    function pdfbox:property-names() 
    +			) as xs:string*
    Return
    Referenced by 1 functions from 1 modules
    Source ( 4 lines)
    function pdfbox:property-names() 
     as xs:string*{
       $pdfbox:property-map=>map:keys()=>sort()
    -}

    4.28 pdfbox:read-stream

    Arities: #2P

    Summary
    +}

    4.29 pdfbox:read-stream

    Arities: #2P

    Summary
    read next block from XMP stream
    Signatures
    pdfbox:read-stream ( $is, $read as xs:string ) as map(*)
    Parameters
    Return
    Referenced by 1 functions from 1 modules
    References 6 functions from 5 modules
    Annotations (1)
    %private()
    Source ( 8 lines)
    function pdfbox:read-stream($is,$read as xs:string)
    @@ -388,13 +400,13 @@ as map(*){
       let $n:= COSInputStream:read($is,$buff,xs:int(0),xs:int($blen))
       let $data:=convert:integers-to-base64(subsequence($buff,1,$n))=>convert:binary-to-string()
       return map{"n":$n, "data": $read || $data}
    -}

    4.29 pdfbox:report

    Arities: #1#2

    Summary
    -summary CSV style info for named $properties for PDFs in $pdfpaths +}

    4.30 pdfbox:report

    Arities: #1#2

    Summary
    +summary CSV style info for all properties for $pdfpaths
    Signatures
    pdfbox:report ( $pdfpaths as xs:string* ) as map(*)
    pdfbox:report ( - $pdfpaths as item()*, $properties as xs:string* ) as map(*)
    Parameters
    Return
    See also
    Referenced by 1 functions from 1 modules
    References 8 functions from 3 modules
    Source ( 28 lines)
    function pdfbox:report($pdfpaths as xs:string*)
    +			$pdfpaths as item()*, $properties as xs:string* ) as map(*)
    Parameters
    Return
    Tags
    Referenced by 1 functions from 1 modules
    References 8 functions from 3 modules
    Source ( 28 lines)
    function pdfbox:report($pdfpaths as xs:string*)
     as map(*){
      pdfbox:report($pdfpaths,pdfbox:property-names())
     }
    function pdfbox:report($pdfpaths as item()*, $properties as xs:string*)
    @@ -420,14 +432,14 @@ as map(*){
                      }
                    
       }
    -}

    4.30 pdfbox:report-save

    Arities: #2

    Summary
    +}

    4.31 pdfbox:report-save

    Arities: #2

    Summary
    Convenience function to save report() data to file
    Signatures
    pdfbox:report-save ( - $data as map(*), $dest as xs:string ) as empty-sequence()
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Source ( 5 lines)
    function pdfbox:report-save($data as map(*),$dest as xs:string)
    +			$data as map(*), $dest as xs:string ) as empty-sequence
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 2 functions from 2 modules
    Source ( 5 lines)
    function pdfbox:report-save($data as map(*),$dest as xs:string)
     as empty-sequence(){
       let $opts := map {  "format":"xquery", "header":"yes", "separator" : "," }
       return file:write-text($dest,csv:serialize($data,$opts))
    -}

    4.31 pdfbox:specification

    Arities: #1

    Summary
    +}

    4.32 pdfbox:specification

    Arities: #1

    Summary
    The version of the PDF specification used by $pdf e.g "1.4" returned as string to avoid float rounding issues
    Signatures
    pdfbox:specification @@ -435,19 +447,19 @@ returned as string to avoid float rounding issues $pdf as item() ) as xs:string
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 1 functions from 1 modules
    Source ( 4 lines)
    function pdfbox:specification($pdf as item())
     as xs:string{
      PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string()
    -}

    4.32 pdfbox:version

    Arities: #0

    Summary
    +}

    4.33 pdfbox:version

    Arities: #0

    Summary
    Version of Apache Pdfbox in use e.g. "3.0.4"
    Signatures
    pdfbox:version ( ) as xs:string
    Return
    Referenced by 0 functions from 0 modules
    References 1 functions from 1 modules
    Source ( 4 lines)
    function pdfbox:version()
     as xs:string{
       Q{java:org.apache.pdfbox.util.Version}getVersion()
    -}

    4.33 pdfbox:with-pdf

    Arities: #2

    Summary
    +}

    4.34 pdfbox:with-pdf

    Arities: #2

    Summary
    "With-document" pattern: open pdf,apply $fn function, close pdf creates a local pdfobject and ensures it is closed after use e.g pdfbox:with-pdf("path...",pdfbox:page-text(?,5))
    Signatures
    pdfbox:with-pdf ( - $src as xs:string, $fn as function(item())as item()* ) as item()*
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 3 functions from 2 modules
    Source ( 11 lines)
    function pdfbox:with-pdf($src as xs:string,
    +			$src as xs:string, $fn as function(item())as item()* ) as item()*
    Parameters
    Return
    Referenced by 0 functions from 0 modules
    References 3 functions from 2 modules
    Source ( 11 lines)
    function pdfbox:with-pdf($src as xs:string,
                                     $fn as function(item())as item()*)
     as item()*{
      let $pdf:=pdfbox:open($src)
    @@ -457,7 +469,7 @@ as item()*{
                 pdfbox:close($pdf),fn:error($err:code,$src || " " || $err:description)
             }
     
    -}

    Namespaces

    The following namespaces are defined:

    Prefix -Type -Uri -
    arrayxpathhttp://www.w3.org/2005/xpath-functions/array
    convertbasexhttp://basex.org/modules/convert
    COSInputStreamjavajava:org.apache.pdfbox.cos.COSInputStream
    csvbasexhttp://basex.org/modules/csv
    dbbasexhttp://basex.org/modules/db
    errw3chttp://www.w3.org/2005/xqt-errors
    fetchbasexhttp://basex.org/modules/fetch
    Filejavajava:java.io.File
    file-http://expath.org/ns/file
    fnxpathhttp://www.w3.org/2005/xpath-functions
    Loaderjavajava:org.apache.pdfbox.Loader
    mapxpathhttp://www.w3.org/2005/xpath-functions/map
    PageExtractorjavajava:org.apache.pdfbox.multipdf.PageExtractor
    PDDocumentjavajava:org.apache.pdfbox.pdmodel.PDDocument
    PDDocumentCatalogjavajava:org.apache.pdfbox.pdmodel.PDDocumentCatalog
    PDDocumentInformationjavajava:org.apache.pdfbox.pdmodel.PDDocumentInformation
    PDDocumentOutlinejavajava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDDocumentOutline
    pdfbox-org.expkg_zone58.Pdfbox3
    PDFRendererjavajava:org.apache.pdfbox.rendering.PDFRenderer
    PDFTextStripperjavajava:org.apache.pdfbox.text.PDFTextStripper
    PDMetadatajavajava:org.apache.pdfbox.pdmodel.common.PDMetadata
    PDOutlineItemjavajava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem
    PDPagejavajava:org.apache.pdfbox.pdmodel.PDPage
    PDPageLabelRangejavajava:org.apache.pdfbox.pdmodel.common.PDPageLabelRange
    PDPageLabelsjavajava:org.apache.pdfbox.pdmodel.common.PDPageLabels
    PDPageTreejavajava:org.apache.pdfbox.pdmodel.PDPageTree
    PDRectanglejavajava:org.apache.pdfbox.pdmodel.common.PDRectangle
    RandomAccessReadBufferjavajava:org.apache.pdfbox.io.RandomAccessReadBuffer
    RandomAccessReadBufferedFilejavajava:org.apache.pdfbox.io.RandomAccessReadBufferedFile
    rdfw3chttp://www.w3.org/1999/02/22-rdf-syntax-ns#
    xsw3chttp://www.w3.org/2001/XMLSchema

    6 RestXQ

    None

    Source Code

    xquery version '3.1';
    +}

    Namespaces

    The following namespaces are defined:

    Prefix -Uri -
    arrayhttp://www.w3.org/2005/xpath-functions/array
    converthttp://basex.org/modules/convert
    COSInputStreamjava:org.apache.pdfbox.cos.COSInputStream
    csvhttp://basex.org/modules/csv
    dbhttp://basex.org/modules/db
    errhttp://www.w3.org/2005/xqt-errors
    fetchhttp://basex.org/modules/fetch
    Filejava:java.io.File
    filehttp://expath.org/ns/file
    fnhttp://www.w3.org/2005/xpath-functions
    Loaderjava:org.apache.pdfbox.Loader
    maphttp://www.w3.org/2005/xpath-functions/map
    PageExtractorjava:org.apache.pdfbox.multipdf.PageExtractor
    PDDocumentjava:org.apache.pdfbox.pdmodel.PDDocument
    PDDocumentCatalogjava:org.apache.pdfbox.pdmodel.PDDocumentCatalog
    PDDocumentInformationjava:org.apache.pdfbox.pdmodel.PDDocumentInformation
    PDDocumentOutlinejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDDocumentOutline
    pdfboxorg.expkg_zone58.Pdfbox3
    PDFRendererjava:org.apache.pdfbox.rendering.PDFRenderer
    PDFTextStripperjava:org.apache.pdfbox.text.PDFTextStripper
    PDMetadatajava:org.apache.pdfbox.pdmodel.common.PDMetadata
    PDOutlineItemjava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem
    PDPagejava:org.apache.pdfbox.pdmodel.PDPage
    PDPageLabelRangejava:org.apache.pdfbox.pdmodel.common.PDPageLabelRange
    PDPageLabelsjava:org.apache.pdfbox.pdmodel.common.PDPageLabels
    PDPageTreejava:org.apache.pdfbox.pdmodel.PDPageTree
    PDRectangleorg.apache.pdfbox.pdmodel.common.PDRectangle
    RandomAccessReadBufferjava:org.apache.pdfbox.io.RandomAccessReadBuffer
    RandomAccessReadBufferedFilejava:org.apache.pdfbox.io.RandomAccessReadBufferedFile
    rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
    xshttp://www.w3.org/2001/XMLSchema

    6 RestXQ

    None

    Source Code

    xquery version '3.1';
     (:~ 
     A BaseX 10.7+ interface to pdfbox3 https://pdfbox.apache.org/ , 
     requires pdfbox jars on classpath, in lib/custom or xar
    @@ -494,7 +506,7 @@ declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
     
     declare namespace RandomAccessReadBuffer="java:org.apache.pdfbox.io.RandomAccessReadBuffer";
     declare namespace RandomAccessReadBufferedFile = "java:org.apache.pdfbox.io.RandomAccessReadBufferedFile";
    -declare namespace PDRectangle="java:org.apache.pdfbox.pdmodel.common.PDRectangle";
    +declare namespace PDRectangle="org.apache.pdfbox.pdmodel.common.PDRectangle";
     
     declare namespace File ="java:java.io.File";
     
    @@ -517,6 +529,11 @@ as item()*{
     };
     
     
    +(:~ open pdf using fetch:binary, returns pdf object :)
    +declare function pdfbox:open($pdfsrc as item())
    +as item(){
    +pdfbox:open($pdfsrc, map{})
    +};
     
     (:~ open pdf from file/url/binary, opts may have password , returns pdf object 
     @param $pdfsrc a fetchable url or filepath, or xs:base64Binary item
    @@ -541,13 +558,6 @@ as item(){
     }
     };
     
    -(:~ open pdf from a location, returns pdf object :)
    -declare function pdfbox:open($pdfsrc as item())
    -as item(){
    -pdfbox:open($pdfsrc, map{})
    -};
    -
    -
     (:~ The version of the PDF specification used by $pdf  e.g "1.4"
     returned as string to avoid float rounding issues
      :)
    @@ -556,13 +566,13 @@ as xs:string{
      PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string()
     };
     
    -(:~ Save pdf <code>$pdf</code> to filesystem at <code>$savepath</code> , returns $savepath :)
    +(:~ Save pdf $pdf to filesystem at $savepath , returns $savepath :)
     declare function pdfbox:pdf-save($pdf as item(),$savepath as xs:string)
     as xs:string{
        PDDocument:save($pdf, File:new($savepath)),$savepath
     };
     
    -(:~ Create binary representation (xs:base64Binary) of <code>$pdf</code> object  :)
    +(:~ Create binary representation of $pdf object as xs:base64Binary :)
     declare function pdfbox:binary($pdf as item())
     as xs:base64Binary{
        let $bytes:=Q{java:java.io.ByteArrayOutputStream}new()
    @@ -659,6 +669,12 @@ as item()*{
              else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined."))
     };
     
    +(:~ summary CSV style info for all properties for $pdfpaths 
    +:)
    +declare function pdfbox:report($pdfpaths as xs:string*)
    +as map(*){
    + pdfbox:report($pdfpaths,pdfbox:property-names())
    +};
     
     (:~ summary CSV style info for named $properties for PDFs in $pdfpaths 
     @see https://docs.basex.org/main/CSV_Functions#xquery
    @@ -688,13 +704,6 @@ as map(*){
       }
     };
     
    -(:~ summary CSV style info for all properties for $pdfpaths 
    -:)
    -declare function pdfbox:report($pdfpaths as xs:string*)
    -as map(*){
    - pdfbox:report($pdfpaths,pdfbox:property-names())
    -};
    -
     (:~ Convenience function to save report() data to file :)
     declare function pdfbox:report-save($data as map(*),$dest as xs:string)
     as empty-sequence(){
    @@ -768,14 +777,9 @@ as map(*){
          map{"list":(),"this":$outlineItem},
     
          function($input,$pos ) { 
    -        let $bookmark:=$input?this
    -        let $bk:=map{ 
    -              "index":  PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf),
    -              "title":  (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)}
    -              }
    -
    -        let $bk:= if(PDOutlineItem:hasChildren($bookmark))
    -                  then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($bookmark))
    +        let $bk:= pdfbox:bookmark($input?this,$pdf)
    +        let $bk:= if($bk?hasChildren)
    +                  then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this))
                             return map:merge(($bk,map:entry("children",$kids)))
                       else $bk 
             return map{
    @@ -806,6 +810,21 @@ as element(bookmark)*
       </bookmark>
     };
     
    +(:~ Return bookmark info for $bookmark
    +@return map{index:..,title:..,hasChildren:..}
    +:)
    +declare %private function pdfbox:bookmark($bookmark as item(),$pdf as item())
    +as map(*)
    +{
    + map{ 
    +  "index":  PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf),
    +  "title":  (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)}
    +  (:=>translate("�",""), :),
    +  "hasChildren": PDOutlineItem:hasChildren($bookmark)
    +  }
    +};
    +
    +
     (:~ pageIndex of $page in $pdf :)
     declare function pdfbox:find-page(
        $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :),
    @@ -974,4 +993,4 @@ declare %private function pdfbox:do-until(
     };
     
    \ No newline at end of file +   on Wednesday, 4th June 2025

    \ No newline at end of file diff --git a/docs/xqdoc/modules/F000001/xqdoc.xml b/docs/xqdoc/modules/F000001/xqdoc.xml index f881bc3..1e807fb 100644 --- a/docs/xqdoc/modules/F000001/xqdoc.xml +++ b/docs/xqdoc/modules/F000001/xqdoc.xml @@ -1,4 +1,4 @@ -2025-06-09T21:09:05.833+01:001.1org.expkg_zone58.Pdfbox3pdfbox +2025-06-04T16:17:13.527+01:001.1org.expkg_zone58.Pdfbox3pdfbox A BaseX 10.7+ interface to pdfbox3 https://pdfbox.apache.org/ , requires pdfbox jars on classpath, in lib/custom or xar @@ -40,7 +40,7 @@ declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace RandomAccessReadBuffer="java:org.apache.pdfbox.io.RandomAccessReadBuffer"; declare namespace RandomAccessReadBufferedFile = "java:org.apache.pdfbox.io.RandomAccessReadBufferedFile"; -declare namespace PDRectangle="java:org.apache.pdfbox.pdmodel.common.PDRectangle"; +declare namespace PDRectangle="org.apache.pdfbox.pdmodel.common.PDRectangle"; declare namespace File ="java:java.io.File"; @@ -63,6 +63,11 @@ as item()*{ }; +(:~ open pdf using fetch:binary, returns pdf object :) +declare function pdfbox:open($pdfsrc as item()) +as item(){ +pdfbox:open($pdfsrc, map{}) +}; (:~ open pdf from file/url/binary, opts may have password , returns pdf object @param $pdfsrc a fetchable url or filepath, or xs:base64Binary item @@ -87,13 +92,6 @@ as item(){ } }; -(:~ open pdf from a location, returns pdf object :) -declare function pdfbox:open($pdfsrc as item()) -as item(){ -pdfbox:open($pdfsrc, map{}) -}; - - (:~ The version of the PDF specification used by $pdf e.g "1.4" returned as string to avoid float rounding issues :) @@ -102,13 +100,13 @@ as xs:string{ PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string() }; -(:~ Save pdf <code>$pdf</code> to filesystem at <code>$savepath</code> , returns $savepath :) +(:~ Save pdf $pdf to filesystem at $savepath , returns $savepath :) declare function pdfbox:pdf-save($pdf as item(),$savepath as xs:string) as xs:string{ PDDocument:save($pdf, File:new($savepath)),$savepath }; -(:~ Create binary representation (xs:base64Binary) of <code>$pdf</code> object :) +(:~ Create binary representation of $pdf object as xs:base64Binary :) declare function pdfbox:binary($pdf as item()) as xs:base64Binary{ let $bytes:=Q{java:java.io.ByteArrayOutputStream}new() @@ -205,6 +203,12 @@ as item()*{ else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined.")) }; +(:~ summary CSV style info for all properties for $pdfpaths +:) +declare function pdfbox:report($pdfpaths as xs:string*) +as map(*){ + pdfbox:report($pdfpaths,pdfbox:property-names()) +}; (:~ summary CSV style info for named $properties for PDFs in $pdfpaths @see https://docs.basex.org/main/CSV_Functions#xquery @@ -234,13 +238,6 @@ as map(*){ } }; -(:~ summary CSV style info for all properties for $pdfpaths -:) -declare function pdfbox:report($pdfpaths as xs:string*) -as map(*){ - pdfbox:report($pdfpaths,pdfbox:property-names()) -}; - (:~ Convenience function to save report() data to file :) declare function pdfbox:report-save($data as map(*),$dest as xs:string) as empty-sequence(){ @@ -314,14 +311,9 @@ as map(*){ map{"list":(),"this":$outlineItem}, function($input,$pos ) { - let $bookmark:=$input?this - let $bk:=map{ - "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), - "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} - } - - let $bk:= if(PDOutlineItem:hasChildren($bookmark)) - then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($bookmark)) + let $bk:= pdfbox:bookmark($input?this,$pdf) + let $bk:= if($bk?hasChildren) + then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this)) return map:merge(($bk,map:entry("children",$kids))) else $bk return map{ @@ -352,6 +344,21 @@ as element(bookmark)* </bookmark> }; +(:~ Return bookmark info for $bookmark +@return map{index:..,title:..,hasChildren:..} +:) +declare %private function pdfbox:bookmark($bookmark as item(),$pdf as item()) +as map(*) +{ + map{ + "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), + "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} + (:=>translate("�",""), :), + "hasChildren": PDOutlineItem:hasChildren($bookmark) + } +}; + + (:~ pageIndex of $page in $pdf :) declare function pdfbox:find-page( $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :), @@ -518,7 +525,7 @@ declare %private function pdfbox:do-until( else error(xs:QName('pdfbox:do-until'),"No implementation do-until found") }; -pdfbox:property-map +pdfbox:property-map Defines a map from property names to evaluation method. Keys are property names, values are sequences of functions to get property value starting from a $pdf object. @@ -562,7 +569,7 @@ values are sequences of functions to get property value starting from a $pdf obj "With-document" pattern: open pdf,apply $fn function, close pdf creates a local pdfobject and ensures it is closed after use e.g pdfbox:with-pdf("path...",pdfbox:page-text(?,5)) -pdfbox:with-pdffunction pdfbox:with-pdf ( $src as xs:string, $fn as function(item())as item()* ) as item()*srcxs:stringfnfunction(item())as item()*item()org.expkg_zone58.Pdfbox3openorg.expkg_zone58.Pdfbox3closeorg.expkg_zone58.Pdfbox3closehttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2005/xqt-errorscodehttp://www.w3.org/2005/xqt-errorsdescriptionfunction pdfbox:with-pdf($src as xs:string, +pdfbox:with-pdffunction pdfbox:with-pdf ( $src as xs:string, $fn as function(item())as item()* ) as item()* { let $pdf:=pdfbox:open($src) return try{ $fn($pdf),pdfbox:close($pdf) } catch *{ pdfbox:close($pdf),fn:error($err:code,$src || " " || $err:description) } }srcxs:stringfnfunction(item())as item()*item()org.expkg_zone58.Pdfbox3openorg.expkg_zone58.Pdfbox3closeorg.expkg_zone58.Pdfbox3closehttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2005/xqt-errorscodehttp://www.w3.org/2005/xqt-errorsdescriptionfunction pdfbox:with-pdf($src as xs:string, $fn as function(item())as item()*) as item()*{ let $pdf:=pdfbox:open($src) @@ -572,9 +579,13 @@ as item()*{ pdfbox:close($pdf),fn:error($err:code,$src || " " || $err:description) } +} +open pdf using fetch:binary, returns pdf objectpdfbox:openfunction pdfbox:open ( $pdfsrc as item() ) as item() { pdfbox:open($pdfsrc, map{}) }pdfsrcitem()item()org.expkg_zone58.Pdfbox3openfunction pdfbox:open($pdfsrc as item()) +as item(){ +pdfbox:open($pdfsrc, map{}) } open pdf from file/url/binary, opts may have password , returns pdf object -$pdfsrc a fetchable url or filepath, or xs:base64Binary item$opts options options include map {"password":}fetch:binary for https will use a lot of memory herepdfbox:openfunction pdfbox:open ( $pdfsrc as item(), $opts as map(*) ) as item()pdfsrcitem()optsmap(*)item()java:org.apache.pdfbox.LoaderloadPDFhttp://www.w3.org/2005/xpath-functionsstringhttp://www.w3.org/2005/xpath-functionsstarts-withjava:org.apache.pdfbox.LoaderloadPDFhttp://basex.org/modules/fetchbinaryhttp://www.w3.org/2005/xpath-functionsstringjava:org.apache.pdfbox.LoaderloadPDFjava:org.apache.pdfbox.io.RandomAccessReadBufferedFilenewhttp://www.w3.org/2005/xpath-functionsstringhttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2001/XMLSchemaQNamehttp://www.w3.org/2005/xqt-errorsdescriptionfunction pdfbox:open($pdfsrc as item(), $opts as map(*)) +$pdfsrc a fetchable url or filepath, or xs:base64Binary item$opts options options include map {"password":}fetch:binary for https will use a lot of memory herepdfbox:openfunction pdfbox:open ( $pdfsrc as item(), $opts as map(*) ) as item() { try{ if($pdfsrc instance of xs:base64Binary) then Loader:loadPDF( $pdfsrc,string($opts?password)) else if(starts-with($pdfsrc,"http")) then Loader:loadPDF( fetch:binary($pdfsrc),string($opts?password)) else Loader:loadPDF(RandomAccessReadBufferedFile:new($pdfsrc),string($opts?password)) } catch *{ let $loc:=if($pdfsrc instance of xs:base64Binary) then "xs:base64Binary" else $pdfsrc return error(xs:QName("pdfbox:open"),"Failed PDF load " || $loc || " " || $err:description) } }pdfsrcitem()optsmap(*)item()java:org.apache.pdfbox.LoaderloadPDFhttp://www.w3.org/2005/xpath-functionsstringhttp://www.w3.org/2005/xpath-functionsstarts-withjava:org.apache.pdfbox.LoaderloadPDFhttp://basex.org/modules/fetchbinaryhttp://www.w3.org/2005/xpath-functionsstringjava:org.apache.pdfbox.LoaderloadPDFjava:org.apache.pdfbox.io.RandomAccessReadBufferedFilenewhttp://www.w3.org/2005/xpath-functionsstringhttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2001/XMLSchemaQNamehttp://www.w3.org/2005/xqt-errorsdescriptionfunction pdfbox:open($pdfsrc as item(), $opts as map(*)) as item(){ try{ @@ -591,39 +602,35 @@ as item(){ return error(xs:QName("pdfbox:open"),"Failed PDF load " || $loc || " " || $err:description) } } -open pdf from a location, returns pdf objectpdfbox:openfunction pdfbox:open ( $pdfsrc as item() ) as item()pdfsrcitem()item()org.expkg_zone58.Pdfbox3openfunction pdfbox:open($pdfsrc as item()) -as item(){ -pdfbox:open($pdfsrc, map{}) -} The version of the PDF specification used by $pdf e.g "1.4" returned as string to avoid float rounding issues -pdfbox:specificationfunction pdfbox:specification ( $pdf as item() ) as xs:stringpdfitem()xs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetVersionfunction pdfbox:specification($pdf as item()) +pdfbox:specificationfunction pdfbox:specification ( $pdf as item() ) as xs:string { PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string() }pdfitem()xs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetVersionfunction pdfbox:specification($pdf as item()) as xs:string{ PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string() -} -Save pdf $pdf to filesystem at $savepath , returns $savepathpdfbox:pdf-savefunction pdfbox:pdf-save ( $pdf as item(),$savepath as xs:string ) as xs:stringpdfitem()savepathxs:stringxs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentsavejava:java.io.Filenewfunction pdfbox:pdf-save($pdf as item(),$savepath as xs:string) +} +Save pdf $pdf to filesystem at $savepath , returns $savepathpdfbox:pdf-savefunction pdfbox:pdf-save ( $pdf as item(),$savepath as xs:string ) as xs:string { PDDocument:save($pdf, File:new($savepath)),$savepath }pdfitem()savepathxs:stringxs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentsavejava:java.io.Filenewfunction pdfbox:pdf-save($pdf as item(),$savepath as xs:string) as xs:string{ PDDocument:save($pdf, File:new($savepath)),$savepath -} -Create binary representation (xs:base64Binary) of $pdf objectpdfbox:binaryfunction pdfbox:binary ( $pdf as item() ) as xs:base64Binarypdfitem()xs:base64Binaryjava:java.io.ByteArrayOutputStreamnewjava:org.apache.pdfbox.pdmodel.PDDocumentsavejava:java.io.ByteArrayOutputStreamtoByteArrayfunction pdfbox:binary($pdf as item()) +} +Create binary representation of $pdf object as xs:base64Binarypdfbox:binaryfunction pdfbox:binary ( $pdf as item() ) as xs:base64Binary { let $bytes:=Q{java:java.io.ByteArrayOutputStream}new() let $_:=PDDocument:save($pdf, $bytes) return Q{java:java.io.ByteArrayOutputStream}toByteArray($bytes) =>convert:integers-to-base64() }pdfitem()xs:base64Binaryjava:java.io.ByteArrayOutputStreamnewjava:org.apache.pdfbox.pdmodel.PDDocumentsavejava:java.io.ByteArrayOutputStreamtoByteArrayfunction pdfbox:binary($pdf as item()) as xs:base64Binary{ let $bytes:=Q{java:java.io.ByteArrayOutputStream}new() let $_:=PDDocument:save($pdf, $bytes) return Q{java:java.io.ByteArrayOutputStream}toByteArray($bytes) =>convert:integers-to-base64() } -Release any resources related to $pdfpdfbox:closefunction pdfbox:close ( $pdf as item() ) as empty-sequence()pdfitem()empty-sequence()java:org.apache.pdfbox.pdmodel.PDDocumentclosefunction pdfbox:close($pdf as item()) +Release any resources related to $pdfpdfbox:closefunction pdfbox:close ( $pdf as item() ) as empty-sequence() { (# db:wrapjava void #) { PDDocument:close($pdf) } }pdfitem()empty-sequencejava:org.apache.pdfbox.pdmodel.PDDocumentclosefunction pdfbox:close($pdf as item()) as empty-sequence(){ (# db:wrapjava void #) { PDDocument:close($pdf) } } -Number of pages in PDFpdfbox:number-of-pagesfunction pdfbox:number-of-pages ( $pdf as item() ) as xs:integerpdfitem()xs:integerjava:org.apache.pdfbox.pdmodel.PDDocumentgetNumberOfPagesfunction pdfbox:number-of-pages($pdf as item()) +Number of pages in PDFpdfbox:number-of-pagesfunction pdfbox:number-of-pages ( $pdf as item() ) as xs:integer { PDDocument:getNumberOfPages($pdf) }pdfitem()xs:integerjava:org.apache.pdfbox.pdmodel.PDDocumentgetNumberOfPagesfunction pdfbox:number-of-pages($pdf as item()) as xs:integer{ PDDocument:getNumberOfPages($pdf) } Pdf page as image (zero is cover) -options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi??pdfbox:page-renderfunction pdfbox:page-render ( $pdf as item(),$pageNo as xs:integer,$options as map(*) ) as xs:base64Binarypdfitem()pageNoxs:integeroptionsmap(*)xs:base64Binaryhttp://www.w3.org/2005/xpath-functions/mapmergejava:org.apache.pdfbox.rendering.PDFRenderernewjava:java.io.ByteArrayOutputStreamnewjava:javax.imageio.ImageIOwritejava:java.io.ByteArrayOutputStreamtoByteArrayfunction pdfbox:page-render($pdf as item(),$pageNo as xs:integer,$options as map(*)) +options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi??pdfbox:page-renderfunction pdfbox:page-render ( $pdf as item(),$pageNo as xs:integer,$options as map(*) ) as xs:base64Binary { let $options := map:merge(($options,map{"format":"jpg","scale":1})) let $bufferedImage := PDFRenderer:new($pdf) =>PDFRenderer:renderImage($pageNo,$options?scale) let $bytes := Q{java:java.io.ByteArrayOutputStream}new() let $_ := Q{java:javax.imageio.ImageIO}write($bufferedImage ,$options?format, $bytes) return Q{java:java.io.ByteArrayOutputStream}toByteArray($bytes) =>convert:integers-to-base64() }pdfitem()pageNoxs:integeroptionsmap(*)xs:base64Binaryhttp://www.w3.org/2005/xpath-functions/mapmergejava:org.apache.pdfbox.rendering.PDFRenderernewjava:java.io.ByteArrayOutputStreamnewjava:javax.imageio.ImageIOwritejava:java.io.ByteArrayOutputStreamtoByteArrayfunction pdfbox:page-render($pdf as item(),$pageNo as xs:integer,$options as map(*)) as xs:base64Binary{ let $options := map:merge(($options,map{"format":"jpg","scale":1})) let $bufferedImage := PDFRenderer:new($pdf) @@ -634,11 +641,11 @@ as xs:base64Binary{ =>convert:integers-to-base64() } -Defined property names, sortedpdfbox:property-namesfunction pdfbox:property-names ( ) as xs:string*xs:stringorg.expkg_zone58.Pdfbox3property-mapfunction pdfbox:property-names() +Defined property names, sortedpdfbox:property-namesfunction pdfbox:property-names ( ) as xs:string* { $pdfbox:property-map=>map:keys()=>sort() }xs:stringorg.expkg_zone58.Pdfbox3property-mapfunction pdfbox:property-names() as xs:string*{ $pdfbox:property-map=>map:keys()=>sort() } -Return the value of $property for $pdfpdfbox:propertyfunction pdfbox:property ( $pdf as item(),$property as xs:string ) as item()*pdfitem()propertyxs:stringitem()http://www.w3.org/2005/xpath-functionsexistshttp://www.w3.org/2005/xpath-functionsfold-lefthttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2001/XMLSchemaQNamehttp://www.w3.org/2005/xpath-functionsconcatorg.expkg_zone58.Pdfbox3property-mapfunction pdfbox:property($pdf as item(),$property as xs:string) +Return the value of $property for $pdfpdfbox:propertyfunction pdfbox:property ( $pdf as item(),$property as xs:string ) as item()* { let $fns:= $pdfbox:property-map($property) return if(exists($fns)) then fold-left($fns, $pdf, function($result,$this as function(*)){$result!$this(.)}) else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined.")) }pdfitem()propertyxs:stringitem()http://www.w3.org/2005/xpath-functionsexistshttp://www.w3.org/2005/xpath-functionsfold-lefthttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2001/XMLSchemaQNamehttp://www.w3.org/2005/xpath-functionsconcatorg.expkg_zone58.Pdfbox3property-mapfunction pdfbox:property($pdf as item(),$property as xs:string) as item()*{ let $fns:= $pdfbox:property-map($property) return if(exists($fns)) @@ -646,9 +653,14 @@ as item()*{ $pdf, function($result,$this as function(*)){$result!$this(.)}) else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined.")) +} +summary CSV style info for all properties for $pdfpaths +pdfbox:reportfunction pdfbox:report ( $pdfpaths as xs:string* ) as map(*) { pdfbox:report($pdfpaths,pdfbox:property-names()) }pdfpathsxs:stringmap(*)org.expkg_zone58.Pdfbox3reportorg.expkg_zone58.Pdfbox3property-namesfunction pdfbox:report($pdfpaths as xs:string*) +as map(*){ + pdfbox:report($pdfpaths,pdfbox:property-names()) } summary CSV style info for named $properties for PDFs in $pdfpaths -https://docs.basex.org/main/CSV_Functions#xquerypdfbox:reportfunction pdfbox:report ( $pdfpaths as item()*, $properties as xs:string* ) as map(*)pdfpathsitem()propertiesxs:stringmap(*)org.expkg_zone58.Pdfbox3openhttp://www.w3.org/2005/xpath-functionsfold-lefthttp://www.w3.org/2005/xpath-functions/arrayappendhttp://www.w3.org/2005/xpath-functionsstringorg.expkg_zone58.Pdfbox3propertyorg.expkg_zone58.Pdfbox3closehttp://www.w3.org/2005/xpath-functionsfold-lefthttp://www.w3.org/2005/xpath-functions/arrayappendfunction pdfbox:report($pdfpaths as item()*, $properties as xs:string*) +https://docs.basex.org/main/CSV_Functions#xquerypdfbox:reportfunction pdfbox:report ( $pdfpaths as item()*, $properties as xs:string* ) as map(*) { map{"names": array{"path",$properties}, "records": for $path in $pdfpaths let $name:=if($path instance of xs:base64Binary) then "binary" else $path return try{ let $pdf:=pdfbox:open($path) return (fold-left($properties, array{$name}, function($result as array(*),$prop as xs:string){ array:append($result, string(pdfbox:property($pdf, $prop)))} ), pdfbox:close($pdf) ) } catch *{ fold-left($properties, array{$name}, function($result as array(*),$prop as xs:string){ array:append($result, "#ERROR")} ) } } }pdfpathsitem()propertiesxs:stringmap(*)org.expkg_zone58.Pdfbox3openhttp://www.w3.org/2005/xpath-functionsfold-lefthttp://www.w3.org/2005/xpath-functions/arrayappendhttp://www.w3.org/2005/xpath-functionsstringorg.expkg_zone58.Pdfbox3propertyorg.expkg_zone58.Pdfbox3closehttp://www.w3.org/2005/xpath-functionsfold-lefthttp://www.w3.org/2005/xpath-functions/arrayappendfunction pdfbox:report($pdfpaths as item()*, $properties as xs:string*) as map(*){ map{"names": array{"path",$properties}, @@ -671,24 +683,19 @@ as map(*){ } } -} -summary CSV style info for all properties for $pdfpaths -pdfbox:reportfunction pdfbox:report ( $pdfpaths as xs:string* ) as map(*)pdfpathsxs:stringmap(*)org.expkg_zone58.Pdfbox3reportorg.expkg_zone58.Pdfbox3property-namesfunction pdfbox:report($pdfpaths as xs:string*) -as map(*){ - pdfbox:report($pdfpaths,pdfbox:property-names()) } -Convenience function to save report() data to filepdfbox:report-savefunction pdfbox:report-save ( $data as map(*),$dest as xs:string ) as empty-sequence()datamap(*)destxs:stringempty-sequence()http://expath.org/ns/filewrite-texthttp://basex.org/modules/csvserializefunction pdfbox:report-save($data as map(*),$dest as xs:string) +Convenience function to save report() data to filepdfbox:report-savefunction pdfbox:report-save ( $data as map(*),$dest as xs:string ) as empty-sequence() { let $opts := map { "format":"xquery", "header":"yes", "separator" : "," } return file:write-text($dest,csv:serialize($data,$opts)) }datamap(*)destxs:stringempty-sequencehttp://expath.org/ns/filewrite-texthttp://basex.org/modules/csvserializefunction pdfbox:report-save($data as map(*),$dest as xs:string) as empty-sequence(){ let $opts := map { "format":"xquery", "header":"yes", "separator" : "," } return file:write-text($dest,csv:serialize($data,$opts)) } -The number of outline items defined in $pdfpdfbox:number-of-bookmarksfunction pdfbox:number-of-bookmarks ( $pdf as item() ) as xs:integerpdfitem()xs:integerorg.expkg_zone58.Pdfbox3outline-xmlhttp://www.w3.org/2005/xpath-functionscountfunction pdfbox:number-of-bookmarks($pdf as item()) +The number of outline items defined in $pdfpdfbox:number-of-bookmarksfunction pdfbox:number-of-bookmarks ( $pdf as item() ) as xs:integer { let $xml:=pdfbox:outline-xml($pdf) return count($xml//bookmark) }pdfitem()xs:integerorg.expkg_zone58.Pdfbox3outline-xmlhttp://www.w3.org/2005/xpath-functionscountfunction pdfbox:number-of-bookmarks($pdf as item()) as xs:integer{ let $xml:=pdfbox:outline-xml($pdf) return count($xml//bookmark) } XMP metadata as "RDF" document -usually rdf:RDF root, but sometimes x:xmpmetapdfbox:metadatafunction pdfbox:metadata ( $pdf as item() ) as document-node(element(*))?pdfitem()document-node(element(*))java:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCataloghttp://www.w3.org/2005/xpath-functionsexistsjava:org.apache.pdfbox.pdmodel.common.PDMetadataexportXMPMetadataorg.expkg_zone58.Pdfbox3do-untilorg.expkg_zone58.Pdfbox3read-streamfunction pdfbox:metadata($pdf as item()) +usually rdf:RDF root, but sometimes x:xmpmetapdfbox:metadatafunction pdfbox:metadata ( $pdf as item() ) as document-node(element(*))? { let $m:=PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getMetadata() return if(exists($m)) then let $is:=PDMetadata:exportXMPMetadata($m) return pdfbox:do-until( map{"n":0,"data":""}, function($input,$pos ) { pdfbox:read-stream($is,$input?data)}, function($output,$pos) { $output?n eq -1 } )?data=>parse-xml() else () }pdfitem()document-node(element(*))java:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCataloghttp://www.w3.org/2005/xpath-functionsexistsjava:org.apache.pdfbox.pdmodel.common.PDMetadataexportXMPMetadataorg.expkg_zone58.Pdfbox3do-untilorg.expkg_zone58.Pdfbox3read-streamfunction pdfbox:metadata($pdf as item()) as document-node(element(*))? { let $m:=PDDocument:getDocumentCatalog($pdf) @@ -705,7 +712,7 @@ as document-node(element(*))? )?data=>parse-xml() else () } -read next block from XMP streampdfbox:read-streamfunction pdfbox:read-stream ( $is,$read as xs:string ) as map(*)isreadxs:stringmap(*)java:java.util.ArrayscopyOfhttp://www.w3.org/2001/XMLSchemabytejava:org.apache.pdfbox.cos.COSInputStreamreadhttp://www.w3.org/2001/XMLSchemainthttp://www.w3.org/2001/XMLSchemainthttp://basex.org/modules/convertintegers-to-base64http://www.w3.org/2005/xpath-functionssubsequencefunction pdfbox:read-stream($is,$read as xs:string) +read next block from XMP streampdfbox:read-streamfunction pdfbox:read-stream ( $is,$read as xs:string ) as map(*) { let $blen:=4096 let $buff:=Q{java:java.util.Arrays}copyOf(array{xs:byte(0)},$blen) let $n:= COSInputStream:read($is,$buff,xs:int(0),xs:int($blen)) let $data:=convert:integers-to-base64(subsequence($buff,1,$n))=>convert:binary-to-string() return map{"n":$n, "data": $read || $data} }isreadxs:stringmap(*)java:java.util.ArrayscopyOfhttp://www.w3.org/2001/XMLSchemabytejava:org.apache.pdfbox.cos.COSInputStreamreadhttp://www.w3.org/2001/XMLSchemainthttp://www.w3.org/2001/XMLSchemainthttp://basex.org/modules/convertintegers-to-base64http://www.w3.org/2005/xpath-functionssubsequencefunction pdfbox:read-stream($is,$read as xs:string) as map(*){ let $blen:=4096 let $buff:=Q{java:java.util.Arrays}copyOf(array{xs:byte(0)},$blen) @@ -713,7 +720,7 @@ as map(*){ let $data:=convert:integers-to-base64(subsequence($buff,1,$n))=>convert:binary-to-string() return map{"n":$n, "data": $read || $data} } -Return outline for $pdf as map()*pdfbox:outlinefunction pdfbox:outline ( $pdf as item() ) as map(*)*pdfitem()map(*)java:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCataloghttp://www.w3.org/2005/xpath-functionsexistsorg.expkg_zone58.Pdfbox3outlinejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetFirstChildfunction pdfbox:outline($pdf as item()) +Return outline for $pdf as map()*pdfbox:outlinefunction pdfbox:outline ( $pdf as item() ) as map(*)* { (# db:wrapjava some #) { let $outline:= PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getDocumentOutline() return if(exists($outline)) then pdfbox:outline($pdf,PDOutlineItem:getFirstChild($outline)) } }pdfitem()map(*)java:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCataloghttp://www.w3.org/2005/xpath-functionsexistsorg.expkg_zone58.Pdfbox3outlinejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetFirstChildfunction pdfbox:outline($pdf as item()) as map(*)*{ (# db:wrapjava some #) { let $outline:= @@ -724,26 +731,21 @@ as map(*)*{ then pdfbox:outline($pdf,PDOutlineItem:getFirstChild($outline)) } } -return bookmark info for children of $outlineItem as seq of mapspdfbox:outlinefunction pdfbox:outline ( $pdf as item(),$outlineItem as item()? ) as map(*)*pdfitem()outlineItemitem()map(*)org.expkg_zone58.Pdfbox3outline_http://www.w3.org/2005/xpath-functions/mapgetfunction pdfbox:outline($pdf as item(),$outlineItem as item()?) +return bookmark info for children of $outlineItem as seq of mapspdfbox:outlinefunction pdfbox:outline ( $pdf as item(),$outlineItem as item()? ) as map(*)* { let $find as map(*):=pdfbox:outline_($pdf ,$outlineItem) return map:get($find,"list") }pdfitem()outlineItemitem()map(*)org.expkg_zone58.Pdfbox3outline_http://www.w3.org/2005/xpath-functions/mapgetfunction pdfbox:outline($pdf as item(),$outlineItem as item()?) as map(*)*{ let $find as map(*):=pdfbox:outline_($pdf ,$outlineItem) return map:get($find,"list") } -outline helper. BaseX bug 10.7? error if inlined in outlinepdfbox:outline_function pdfbox:outline_ ( $pdf as item(),$outlineItem as item()? ) as map(*)pdfitem()outlineItemitem()map(*)org.expkg_zone58.Pdfbox3do-untiljava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemfindDestinationPagejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetTitlejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemhasChildrenorg.expkg_zone58.Pdfbox3outlinejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetFirstChildhttp://www.w3.org/2005/xpath-functions/mapmergehttp://www.w3.org/2005/xpath-functions/mapentryjava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetNextSiblinghttp://www.w3.org/2005/xpath-functionsemptyfunction pdfbox:outline_($pdf as item(),$outlineItem as item()?) +outline helper. BaseX bug 10.7? error if inlined in outlinepdfbox:outline_function pdfbox:outline_ ( $pdf as item(),$outlineItem as item()? ) as map(*) { pdfbox:do-until( map{"list":(),"this":$outlineItem}, function($input,$pos ) { let $bk:= pdfbox:bookmark($input?this,$pdf) let $bk:= if($bk?hasChildren) then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this)) return map:merge(($bk,map:entry("children",$kids))) else $bk return map{ "list": ($input?list, $bk), "this": PDOutlineItem:getNextSibling($input?this)} }, function($output,$pos) { empty($output?this) } ) }pdfitem()outlineItemitem()map(*)org.expkg_zone58.Pdfbox3do-untilorg.expkg_zone58.Pdfbox3bookmarkorg.expkg_zone58.Pdfbox3outlinejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetFirstChildhttp://www.w3.org/2005/xpath-functions/mapmergehttp://www.w3.org/2005/xpath-functions/mapentryjava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetNextSiblinghttp://www.w3.org/2005/xpath-functionsemptyfunction pdfbox:outline_($pdf as item(),$outlineItem as item()?) as map(*){ pdfbox:do-until( map{"list":(),"this":$outlineItem}, function($input,$pos ) { - let $bookmark:=$input?this - let $bk:=map{ - "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), - "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} - } - - let $bk:= if(PDOutlineItem:hasChildren($bookmark)) - then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($bookmark)) + let $bk:= pdfbox:bookmark($input?this,$pdf) + let $bk:= if($bk?hasChildren) + then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this)) return map:merge(($bk,map:entry("children",$kids))) else $bk return map{ @@ -754,14 +756,14 @@ as map(*){ function($output,$pos) { empty($output?this) } ) } -PDF outline in xml formatpdfbox:outline-xmlfunction pdfbox:outline-xml ( $pdf as item() ) as element(outline)?pdfitem()element(outline)org.expkg_zone58.Pdfbox3outlinehttp://www.w3.org/2005/xpath-functionsexistsorg.expkg_zone58.Pdfbox3bookmark-xmlfunction pdfbox:outline-xml($pdf as item()) +PDF outline in xml formatpdfbox:outline-xmlfunction pdfbox:outline-xml ( $pdf as item() ) as element(outline)? { let $outline:=pdfbox:outline($pdf) return if(exists($outline)) then <outline>{$outline!pdfbox:bookmark-xml(.)}</outline> else () }pdfitem()element(outline)org.expkg_zone58.Pdfbox3outlinehttp://www.w3.org/2005/xpath-functionsexistsorg.expkg_zone58.Pdfbox3bookmark-xmlfunction pdfbox:outline-xml($pdf as item()) as element(outline)?{ let $outline:=pdfbox:outline($pdf) return if(exists($outline)) then <outline>{$outline!pdfbox:bookmark-xml(.)}</outline> else () } -Convert outline map to XMLpdfbox:bookmark-xmlfunction pdfbox:bookmark-xml ( $outline as map(*)* ) as element(bookmark)*outlinemap(*)element(bookmark)org.expkg_zone58.Pdfbox3bookmark-xmlfunction pdfbox:bookmark-xml($outline as map(*)*) +Convert outline map to XMLpdfbox:bookmark-xmlfunction pdfbox:bookmark-xml ( $outline as map(*)* ) as element(bookmark)* { $outline! <bookmark title="{?title}" index="{?index}"> {?children!pdfbox:bookmark-xml(.)} </bookmark> }outlinemap(*)element(bookmark)org.expkg_zone58.Pdfbox3bookmark-xmlfunction pdfbox:bookmark-xml($outline as map(*)*) as element(bookmark)* { $outline! @@ -769,7 +771,18 @@ as element(bookmark)* {?children!pdfbox:bookmark-xml(.)} </bookmark> } -pageIndex of $page in $pdfpdfbox:find-pagefunction pdfbox:find-page ( $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :), $pdf as item() ) as item()?pageitem()pdfitem()item()http://www.w3.org/2005/xpath-functionsexistsjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogfunction pdfbox:find-page( +Return bookmark info for $bookmark +map{index:..,title:..,hasChildren:..}pdfbox:bookmarkfunction pdfbox:bookmark ( $bookmark as item(),$pdf as item() ) as map(*) { map{ "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} (:=>translate("�",""), :), "hasChildren": PDOutlineItem:hasChildren($bookmark) } }bookmarkitem()pdfitem()map(*)java:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemfindDestinationPagejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemgetTitlejava:org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItemhasChildrenfunction pdfbox:bookmark($bookmark as item(),$pdf as item()) +as map(*) +{ + map{ + "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), + "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} + (:=>translate("�",""), :), + "hasChildren": PDOutlineItem:hasChildren($bookmark) + } +} +pageIndex of $page in $pdfpdfbox:find-pagefunction pdfbox:find-page ( $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :), $pdf as item() ) as item()? { if(exists($page)) then PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPages() =>PDPageTree:indexOf($page) }pageitem()pdfitem()item()http://www.w3.org/2005/xpath-functionsexistsjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogfunction pdfbox:find-page( $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :), $pdf as item()) as item()? @@ -780,14 +793,14 @@ as item()? =>PDPageTree:indexOf($page) } Return new PDF doc with pages from $start to $end as xs:base64Binary, (1 based) -$start first page to include$end last page to includepdfbox:extract-rangefunction pdfbox:extract-range ( $pdf as item(), $start as xs:integer,$end as xs:integer ) as xs:base64Binarypdfitem()startxs:integerendxs:integerxs:base64Binaryjava:org.apache.pdfbox.multipdf.PageExtractorneworg.expkg_zone58.Pdfbox3binaryorg.expkg_zone58.Pdfbox3closefunction pdfbox:extract-range($pdf as item(), +$start first page to include$end last page to includepdfbox:extract-rangefunction pdfbox:extract-range ( $pdf as item(), $start as xs:integer,$end as xs:integer ) as xs:base64Binary { let $a:=PageExtractor:new($pdf, $start, $end) =>PageExtractor:extract() return (pdfbox:binary($a),pdfbox:close($a)) }pdfitem()startxs:integerendxs:integerxs:base64Binaryjava:org.apache.pdfbox.multipdf.PageExtractorneworg.expkg_zone58.Pdfbox3binaryorg.expkg_zone58.Pdfbox3closefunction pdfbox:extract-range($pdf as item(), $start as xs:integer,$end as xs:integer) as xs:base64Binary { let $a:=PageExtractor:new($pdf, $start, $end) =>PageExtractor:extract() return (pdfbox:binary($a),pdfbox:close($a)) } -The number of labels defined in PDFpdfbox:number-of-labelsfunction pdfbox:number-of-labels ( $pdf as item() ) as xs:integerpdfitem()xs:integerjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCataloghttp://www.w3.org/2005/xpath-functionsexistsjava:org.apache.pdfbox.pdmodel.common.PDPageLabelsgetPageRangeCountfunction pdfbox:number-of-labels($pdf as item()) +The number of labels defined in PDFpdfbox:number-of-labelsfunction pdfbox:number-of-labels ( $pdf as item() ) as xs:integer { let $labels:=PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() return if(exists($labels)) then PDPageLabels:getPageRangeCount($labels) else 0 }pdfitem()xs:integerjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCataloghttp://www.w3.org/2005/xpath-functionsexistsjava:org.apache.pdfbox.pdmodel.common.PDPageLabelsgetPageRangeCountfunction pdfbox:number-of-labels($pdf as item()) as xs:integer { let $labels:=PDDocument:getDocumentCatalog($pdf) @@ -798,7 +811,7 @@ as xs:integer } pageLabel for every page from derived from page-ranges The returned sequence will contain at MOST as much entries as the document has pages. -https://www.w3.org/TR/WCAG20-TECHS/PDF17.html#PDF17-exampleshttps://codereview.stackexchange.com/questions/286078/java-code-showing-page-labels-from-pdf-filespdfbox:labels-by-pagefunction pdfbox:labels-by-page ( $pdf as item() ) as xs:string*pdfitem()xs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogfunction pdfbox:labels-by-page($pdf as item()) +https://www.w3.org/TR/WCAG20-TECHS/PDF17.html#PDF17-exampleshttps://codereview.stackexchange.com/questions/286078/java-code-showing-page-labels-from-pdf-filespdfbox:labels-by-pagefunction pdfbox:labels-by-page ( $pdf as item() ) as xs:string* { PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() =>PDPageLabels:getLabelsByPageIndices() }pdfitem()xs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogfunction pdfbox:labels-by-page($pdf as item()) as xs:string* { PDDocument:getDocumentCatalog($pdf) @@ -806,7 +819,7 @@ as xs:string* =>PDPageLabels:getLabelsByPageIndices() } sequence of label ranges defined in PDF as formatted strings -a custom representation of the labels e.g "0-*Cover,1r,11D"pdfbox:labels-as-stringfunction pdfbox:labels-as-string ( $pdf as item() ) as xs:stringpdfitem()xs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogorg.expkg_zone58.Pdfbox3number-of-pagesorg.expkg_zone58.Pdfbox3label-as-stringfunction pdfbox:labels-as-string($pdf as item()) +a custom representation of the labels e.g "0-*Cover,1r,11D"pdfbox:labels-as-stringfunction pdfbox:labels-as-string ( $pdf as item() ) as xs:string { let $pagelabels:=PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() return $pagelabels !(0 to pdfbox:number-of-pages($pdf)-1) !pdfbox:label-as-string($pagelabels,.)=>string-join("&#10;") }pdfitem()xs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogorg.expkg_zone58.Pdfbox3number-of-pagesorg.expkg_zone58.Pdfbox3label-as-stringfunction pdfbox:labels-as-string($pdf as item()) as xs:string{ let $pagelabels:=PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() @@ -815,12 +828,12 @@ as xs:string{ !pdfbox:label-as-string($pagelabels,.)=>string-join("&#10;") } -get pagelabels existpdfbox:page-labelsfunction pdfbox:page-labels ( $pdf )pdfjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogfunction pdfbox:page-labels($pdf) +get pagelabels existpdfbox:page-labelsfunction pdfbox:page-labels ( $pdf ) { PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() }pdfjava:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogfunction pdfbox:page-labels($pdf) { PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() } -label for $page formated as string, empty if nonepdfbox:label-as-stringfunction pdfbox:label-as-string ( $pagelabels,$page as xs:integer ) as xs:string?pagelabelspagexs:integerxs:stringjava:org.apache.pdfbox.pdmodel.common.PDPageLabelsgetPageLabelRangehttp://www.w3.org/2005/xpath-functionsemptyjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStartjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStylejava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetPrefixhttp://www.w3.org/2005/xpath-functionsstring-joinhttp://www.w3.org/2005/xpath-functionsemptyhttp://www.w3.org/2005/xpath-functionsexistsfunction pdfbox:label-as-string($pagelabels,$page as xs:integer) +label for $page formated as string, empty if nonepdfbox:label-as-stringfunction pdfbox:label-as-string ( $pagelabels,$page as xs:integer ) as xs:string? { let $label:=PDPageLabels:getPageLabelRange($pagelabels,$page) return if(empty($label)) then () else let $start:= PDPageLabelRange:getStart($label) let $style := PDPageLabelRange:getStyle($label) let $prefix:= PDPageLabelRange:getPrefix($label) return string-join(($page, if(empty($style)) then "-" else $style, if(($start eq 1)) then "" else $start, if(exists($prefix)) then '*' || $prefix (:TODO double " :) )) }pagelabelspagexs:integerxs:stringjava:org.apache.pdfbox.pdmodel.common.PDPageLabelsgetPageLabelRangehttp://www.w3.org/2005/xpath-functionsemptyjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStartjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStylejava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetPrefixhttp://www.w3.org/2005/xpath-functionsstring-joinhttp://www.w3.org/2005/xpath-functionsemptyhttp://www.w3.org/2005/xpath-functionsexistsfunction pdfbox:label-as-string($pagelabels,$page as xs:integer) as xs:string?{ let $label:=PDPageLabels:getPageLabelRange($pagelabels,$page) return if(empty($label)) @@ -835,7 +848,7 @@ as xs:string?{ if(exists($prefix)) then '*' || $prefix (:TODO double " :) )) } -sequence of maps for each label/page range defined in $pdfpdfbox:labels-as-mapfunction pdfbox:labels-as-map ( $pdf as item() ) as map(*)*pdfitem()map(*)java:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogorg.expkg_zone58.Pdfbox3number-of-pagesorg.expkg_zone58.Pdfbox3label-as-mapfunction pdfbox:labels-as-map($pdf as item()) +sequence of maps for each label/page range defined in $pdfpdfbox:labels-as-mapfunction pdfbox:labels-as-map ( $pdf as item() ) as map(*)* { let $pagelabels:=PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() return $pagelabels !(0 to pdfbox:number-of-pages($pdf)-1) !pdfbox:label-as-map($pagelabels,.) }pdfitem()map(*)java:org.apache.pdfbox.pdmodel.PDDocumentgetDocumentCatalogorg.expkg_zone58.Pdfbox3number-of-pagesorg.expkg_zone58.Pdfbox3label-as-mapfunction pdfbox:labels-as-map($pdf as item()) as map(*)*{ let $pagelabels:=PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPageLabels() @@ -843,7 +856,7 @@ as map(*)*{ !(0 to pdfbox:number-of-pages($pdf)-1) !pdfbox:label-as-map($pagelabels,.) } -label/page-range for $page as mappdfbox:label-as-mapfunction pdfbox:label-as-map ( $pagelabels,$page as xs:integer ) as map(*)pagelabelspagexs:integermap(*)java:org.apache.pdfbox.pdmodel.common.PDPageLabelsgetPageLabelRangehttp://www.w3.org/2005/xpath-functionsemptyjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetPrefixjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStartjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStylefunction pdfbox:label-as-map($pagelabels,$page as xs:integer) +label/page-range for $page as mappdfbox:label-as-mapfunction pdfbox:label-as-map ( $pagelabels,$page as xs:integer ) as map(*) { let $label:=PDPageLabels:getPageLabelRange($pagelabels,$page) return if(empty($label)) then () else map{ "index": $page, "prefix": PDPageLabelRange:getPrefix($label), "start": PDPageLabelRange:getStart($label), "style": PDPageLabelRange:getStyle($label) } }pagelabelspagexs:integermap(*)java:org.apache.pdfbox.pdmodel.common.PDPageLabelsgetPageLabelRangehttp://www.w3.org/2005/xpath-functionsemptyjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetPrefixjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStartjava:org.apache.pdfbox.pdmodel.common.PDPageLabelRangegetStylefunction pdfbox:label-as-map($pagelabels,$page as xs:integer) as map(*) { let $label:=PDPageLabels:getPageLabelRange($pagelabels,$page) @@ -856,7 +869,7 @@ as map(*) "style": PDPageLabelRange:getStyle($label) } } -return text on $pageNopdfbox:page-textfunction pdfbox:page-text ( $pdf as item(), $pageNo as xs:integer ) as xs:stringpdfitem()pageNoxs:integerxs:stringjava:org.apache.pdfbox.text.PDFTextStrippernewjava:org.apache.pdfbox.text.PDFTextStrippergetTextfunction pdfbox:page-text($pdf as item(), $pageNo as xs:integer) +return text on $pageNopdfbox:page-textfunction pdfbox:page-text ( $pdf as item(), $pageNo as xs:integer ) as xs:string { let $tStripper := (# db:wrapjava instance #) { PDFTextStripper:new() => PDFTextStripper:setStartPage($pageNo) => PDFTextStripper:setEndPage($pageNo) } return (# db:checkstrings #) {PDFTextStripper:getText($tStripper,$pdf)} }pdfitem()pageNoxs:integerxs:stringjava:org.apache.pdfbox.text.PDFTextStrippernewjava:org.apache.pdfbox.text.PDFTextStrippergetTextfunction pdfbox:page-text($pdf as item(), $pageNo as xs:integer) as xs:string{ let $tStripper := (# db:wrapjava instance #) { PDFTextStripper:new() @@ -866,17 +879,17 @@ as xs:string{ return (# db:checkstrings #) {PDFTextStripper:getText($tStripper,$pdf)} } Return size of $pageNo (zero based) -e.g. [0.0,0.0,168.0,239.52]pdfbox:page-media-boxfunction pdfbox:page-media-box ( $pdf as item(), $pageNo as xs:integer ) as xs:stringpdfitem()pageNoxs:integerxs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetPagefunction pdfbox:page-media-box($pdf as item(), $pageNo as xs:integer) +e.g. [0.0,0.0,168.0,239.52]pdfbox:page-media-boxfunction pdfbox:page-media-box ( $pdf as item(), $pageNo as xs:integer ) as xs:string { PDDocument:getPage($pdf, $pageNo) =>PDPage:getMediaBox() =>PDRectangle:toString() }pdfitem()pageNoxs:integerxs:stringjava:org.apache.pdfbox.pdmodel.PDDocumentgetPagefunction pdfbox:page-media-box($pdf as item(), $pageNo as xs:integer) as xs:string{ PDDocument:getPage($pdf, $pageNo) =>PDPage:getMediaBox() =>PDRectangle:toString() } -Version of Apache Pdfbox in use e.g. "3.0.4"pdfbox:versionfunction pdfbox:version ( ) as xs:stringxs:stringjava:org.apache.pdfbox.util.VersiongetVersionfunction pdfbox:version() +Version of Apache Pdfbox in use e.g. "3.0.4"pdfbox:versionfunction pdfbox:version ( ) as xs:string { Q{java:org.apache.pdfbox.util.Version}getVersion() }xs:stringjava:org.apache.pdfbox.util.VersiongetVersionfunction pdfbox:version() as xs:string{ Q{java:org.apache.pdfbox.util.Version}getVersion() } -Convert datepdfbox:gregToISOfunction pdfbox:gregToISO ( $item as item()? ) as xs:string?itemitem()xs:stringhttp://www.w3.org/2005/xpath-functionsexistsjava:java.util.GregorianCalendartoZonedDateTimefunction pdfbox:gregToISO($item as item()?) +Convert datepdfbox:gregToISOfunction pdfbox:gregToISO ( $item as item()? ) as xs:string? { if(exists($item)) then Q{java:java.util.GregorianCalendar}toZonedDateTime($item)=>string() else () }itemitem()xs:stringhttp://www.w3.org/2005/xpath-functionsexistsjava:java.util.GregorianCalendartoZonedDateTimefunction pdfbox:gregToISO($item as item()?) as xs:string?{ if(exists($item)) then Q{java:java.util.GregorianCalendar}toZonedDateTime($item)=>string() @@ -884,7 +897,7 @@ as xs:string?{ } fn:do-until shim for BaseX 9+10 if fn:do-until not found use hof:until, note: $pos always zero -pdfbox:do-untilfunction pdfbox:do-until ( $input as item()*, $action as function(item()*, xs:integer) as item()*, $predicate as function(item()*, xs:integer) as xs:boolean? ) as item()*inputitem()actionfunction(item()*, xs:integer) as item()*predicatefunction(item()*, xs:integer) as xs:boolean?item()http://www.w3.org/2005/xpath-functionsfunction-lookuphttp://www.w3.org/2005/xpath-functionsQNamehttp://www.w3.org/2005/xpath-functionsexistshttp://www.w3.org/2005/xpath-functionsfunction-lookuphttp://www.w3.org/2005/xpath-functionsQNamehttp://www.w3.org/2005/xpath-functionsexistshttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2001/XMLSchemaQNamefunction pdfbox:do-until( +pdfbox:do-untilfunction pdfbox:do-until ( $input as item()*, $action as function(item()*, xs:integer) as item()*, $predicate as function(item()*, xs:integer) as xs:boolean? ) as item()* { let $fn:=function-lookup(QName('http://www.w3.org/2005/xpath-functions','do-until'), 3) return if(exists($fn)) then $fn($input,$action,$predicate) else let $hof:=function-lookup(QName('http://basex.org/modules/hof','until'), 3) return if(exists($hof)) then $hof($predicate(?,0),$action(?,0),$input) else error(xs:QName('pdfbox:do-until'),"No implementation do-until found") }inputitem()actionfunction(item()*, xs:integer) as item()*predicatefunction(item()*, xs:integer) as xs:boolean?item()http://www.w3.org/2005/xpath-functionsfunction-lookuphttp://www.w3.org/2005/xpath-functionsQNamehttp://www.w3.org/2005/xpath-functionsexistshttp://www.w3.org/2005/xpath-functionsfunction-lookuphttp://www.w3.org/2005/xpath-functionsQNamehttp://www.w3.org/2005/xpath-functionsexistshttp://www.w3.org/2005/xpath-functionserrorhttp://www.w3.org/2001/XMLSchemaQNamefunction pdfbox:do-until( $input as item()*, $action as function(item()*, xs:integer) as item()*, $predicate as function(item()*, xs:integer) as xs:boolean? diff --git a/docs/xqdoc/modules/F000001/xqparse.xml b/docs/xqdoc/modules/F000001/xqparse.xml index 48057f6..c1564e9 100644 --- a/docs/xqdoc/modules/F000001/xqparse.xml +++ b/docs/xqdoc/modules/F000001/xqparse.xml @@ -35,7 +35,7 @@ refer to the same concept. Also label and (page)range are used interchangably&#x declare namespace RandomAccessReadBuffer="java:org.apache.pdfbox.io.RandomAccessReadBuffer"; declare namespace RandomAccessReadBufferedFile = "java:org.apache.pdfbox.io.RandomAccessReadBufferedFile"; -declare namespace PDRectangle="java:org.apache.pdfbox.pdmodel.common.PDRectangle"; +declare namespace PDRectangle="org.apache.pdfbox.pdmodel.common.PDRectangle"; declare namespace File ="java:java.io.File"; @@ -58,6 +58,11 @@ e.g pdfbox:with-pdf("path...",pdfbox:page-text(?,5)) }; +(:~ open pdf using fetch:binary, returns pdf object :) +declare function pdfbox:open($pdfsrc as item()) +as item(){ +pdfbox:open($pdfsrc, map{}) +}; (:~ open pdf from file/url/binary, opts may have password , returns pdf object @param $pdfsrc a fetchable url or filepath, or xs:base64Binary item @@ -82,13 +87,6 @@ e.g pdfbox:with-pdf("path...",pdfbox:page-text(?,5)) } }; -(:~ open pdf from a location, returns pdf object :) -declare function pdfbox:open($pdfsrc as item()) -as item(){ -pdfbox:open($pdfsrc, map{}) -}; - - (:~ The version of the PDF specification used by $pdf e.g "1.4" returned as string to avoid float rounding issues :) @@ -97,13 +95,13 @@ returned as string to avoid float rounding issues PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string() }; -(:~ Save pdf <code>$pdf</code> to filesystem at <code>$savepath</code> , returns $savepath :) +(:~ Save pdf $pdf to filesystem at $savepath , returns $savepath :) declare function pdfbox:pdf-save($pdf as item(),$savepath as xs:string) as xs:string{ PDDocument:save($pdf, File:new($savepath)),$savepath }; -(:~ Create binary representation (xs:base64Binary) of <code>$pdf</code> object :) +(:~ Create binary representation of $pdf object as xs:base64Binary :) declare function pdfbox:binary($pdf as item()) as xs:base64Binary{ let $bytes:=Q{java:java.io.ByteArrayOutputStream}new() @@ -200,6 +198,12 @@ options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi?? :) else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined.")) }; +(:~ summary CSV style info for all properties for $pdfpaths +:) +declare function pdfbox:report($pdfpaths as xs:string*) +as map(*){ + pdfbox:report($pdfpaths,pdfbox:property-names()) +}; (:~ summary CSV style info for named $properties for PDFs in $pdfpaths @see https://docs.basex.org/main/CSV_Functions#xquery @@ -229,13 +233,6 @@ options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi?? :) } }; -(:~ summary CSV style info for all properties for $pdfpaths -:) -declare function pdfbox:report($pdfpaths as xs:string*) -as map(*){ - pdfbox:report($pdfpaths,pdfbox:property-names()) -}; - (:~ Convenience function to save report() data to file :) declare function pdfbox:report-save($data as map(*),$dest as xs:string) as empty-sequence(){ @@ -309,14 +306,9 @@ options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi?? :) map{"list":(),"this":$outlineItem}, function($input,$pos ) { - let $bookmark:=$input?this - let $bk:=map{ - "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), - "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} - } - - let $bk:= if(PDOutlineItem:hasChildren($bookmark)) - then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($bookmark)) + let $bk:= pdfbox:bookmark($input?this,$pdf) + let $bk:= if($bk?hasChildren) + then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this)) return map:merge(($bk,map:entry("children",$kids))) else $bk return map{ @@ -347,6 +339,21 @@ options.format="bmp jpg png gif" etc, options.scale= 1 is 72 dpi?? :) </bookmark> }; +(:~ Return bookmark info for $bookmark +@return map{index:..,title:..,hasChildren:..} +:) +declare %private function pdfbox:bookmark($bookmark as item(),$pdf as item()) +as map(*) +{ + map{ + "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), + "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} + (:=>translate("�",""), :), + "hasChildren": PDOutlineItem:hasChildren($bookmark) + } +}; + + (:~ pageIndex of $page in $pdf :) declare function pdfbox:find-page( $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :), diff --git a/docs/xqdoc/restxq.html b/docs/xqdoc/restxq.html index 00b1d5f..0584079 100644 --- a/docs/xqdoc/restxq.html +++ b/docs/xqdoc/restxq.html @@ -7,4 +7,4 @@ Contents
    1. 1 Summary
    2. 2 Rest Paths

    Summary

    No RESTXQ usage

    Related documents
    ViewDescriptionFormat
    reportIndex of sourcesxhtml
    importsSummary of import usagexhtml
    imports-diagProject wide module imports as html mermaid class diagramhtml5
    imports-diag.mmdProject wide module imports as a mermaid class diagramtext
    annotationsSummary of XQuery annotation usexhtml
    xqdoca.xmlxqDocA run configuration report (XML)xml
    xqdoc-validatevalidate generated xqdoc filesxml

    Rest interface paths

    \ No newline at end of file +   on Wednesday, 4th June 2025

    \ No newline at end of file diff --git a/docs/xqdoc/validation-report.xml b/docs/xqdoc/validation-report.xml index af1570e..4ee2c65 100644 --- a/docs/xqdoc/validation-report.xml +++ b/docs/xqdoc/validation-report.xml @@ -1 +1 @@ -valid \ No newline at end of file +valid \ No newline at end of file diff --git a/docs/xqdoc/xqdoca.xml b/docs/xqdoc/xqdoca.xml index 736c3e1..1bacc45 100644 --- a/docs/xqdoc/xqdoca.xml +++ b/docs/xqdoc/xqdoca.xml @@ -1,4 +1,4 @@ -0.9.1docs/xqdoc/ +0.9.1docs/xqdoc/ report restxq imports @@ -10,4 +10,4 @@ module xqdoc xqparse - basextrue*.xqm,*.xq,*.xquerysrcsrc/truetrue1.1true \ No newline at end of file + basex*.xqm,*.xq,*.xquerysrcsrc/truetrue1.1true \ No newline at end of file diff --git a/package.json b/package.json index a8c2def..ccaffd3 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "pdfbox", - "version": "0.5.0", + "version": "0.4.0", "description": "A BaseX interface to Apache Pdfbox version 3", "main": "src/Pdfbox3.xqm", "homepage": "https://github.com/expkg-zone58/pdfbox#readme", @@ -8,9 +8,9 @@ "doc": "docs" }, "scripts": { - "xar build": "%BASEX10%/bin/basex scripts/make-xar.xq", "test": "%BASEX10%/bin/basex -Wt tests", - "docs build": "xqdoca" + "docs": "xqdoca", + "build": "%BASEX10%/bin/basex scripts/make-xar.xq" }, "keywords": [ "pdf", diff --git a/readme.md b/readme.md index d3d53ac..17d1a35 100644 --- a/readme.md +++ b/readme.md @@ -29,7 +29,7 @@ The features focus on extracting information from PDFs rather than creation or e * Form processing ## Documentation -* Function [documentation](docs/guide.md) +* Function [documentation](doc.md) * The Apache Pdfbox 3 [FAQ](https://pdfbox.apache.org/3.0/faq.html) may be useful. # Install diff --git a/samples.pdf/readme.md b/samples.pdf/readme.md index 6d57d71..1cc64e0 100644 --- a/samples.pdf/readme.md +++ b/samples.pdf/readme.md @@ -5,8 +5,8 @@ |------|-----------|--------|----------|---| |[BaseX100.pdf](BaseX100.pdf)||✅||https://files.basex.org/releases/10.0/BaseX100.pdf| |[icelandic-dictionary.pdf](icelandic-dictionary.pdf)|✅|| |http://css4.pub/2015/icelandic/dictionary.pdf| -|[page-numbers.pdf](page-numbers.pdf)||✅||https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers.pdf| -|[page-numbers-password.pdf](page-numbers-password.pdf)||✅|✅(password)|https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers.pdf| +|[page-numbers.pdf](page-numbers.pdf)||✅||https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers| +|[page-numbers-password.pdf](page-numbers-password.pdf)||✅|✅(password)|https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers| |[Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans](Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans-Final-Report-November-2021.pdf)|✅|||https://www.lse.ac.uk/News/News-Assets/PDFs/2021/Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans-Final-Report-November-2021.pdf| |[Legal RAG Hallucinations](Legal_RAG_Hallucinations.pdf)|✅|||https://law.stanford.edu/wp-content/uploads/2024/05/Legal_RAG_Hallucinations.pdf| diff --git a/src/Pdfbox3.xqm b/src/Pdfbox3.xqm index dd2217a..572d067 100644 --- a/src/Pdfbox3.xqm +++ b/src/Pdfbox3.xqm @@ -1,21 +1,9 @@ xquery version '3.1'; (:~ -A BaseX 10.7+ interface for Apache PDFBox® - A Java PDF Library, -It requires the Pdfbox jars to be on the classpath, or a EXPath package (xar) installation. -

    Terms

    -The following terms are used: -
    - -
    bookmark
    -
    A bookmark has a title and a pageindex. It may contain nested bookmarks.
    -
    outline
    -
    The outline is the tree of bookmarks defined in the PDF. It may be empty.
    -
    page range
    -
    A page range defines the page numbering schema in operation from a certain pageIndex until a subsequent range is set.
    -
    page label
    -
    A page label defines style: Roman, Decimal etc, start: the index to start from (default 1) and prefix: an optional string to prefix to the page label e.g "Vol1:"
    -
    - +A BaseX 10.7+ interface to pdfbox3 https://pdfbox.apache.org/ , +requires pdfbox jars on classpath, in lib/custom or xar +@note following the java source the terms outline and bookmark +refer to the same concept. Also label and (page)range are used interchangably @note tested with pdfbox-app-3.0.5.jar @see https://pdfbox.apache.org/download.cgi @javadoc https://javadoc.io/static/org.apache.pdfbox/pdfbox/3.0.5/ @@ -41,16 +29,19 @@ declare namespace PDFRenderer="java:org.apache.pdfbox.rendering.PDFRenderer"; declare namespace PDMetadata="java:org.apache.pdfbox.pdmodel.common.PDMetadata"; declare namespace COSInputStream="java:org.apache.pdfbox.cos.COSInputStream"; -declare namespace RandomAccessReadBuffer="java:org.apache.pdfbox.io.RandomAccessReadBuffer"; -declare namespace RandomAccessReadBufferedFile = "java:org.apache.pdfbox.io.RandomAccessReadBufferedFile"; -declare namespace PDRectangle="java:org.apache.pdfbox.pdmodel.common.PDRectangle"; - -declare namespace File ="java:java.io.File"; declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; -(:~ open $pdf,apply $fn function, close pdf ("With-document" pattern) +declare namespace RandomAccessReadBuffer="java:org.apache.pdfbox.io.RandomAccessReadBuffer"; +declare namespace RandomAccessReadBufferedFile = "java:org.apache.pdfbox.io.RandomAccessReadBufferedFile"; +declare namespace PDRectangle="org.apache.pdfbox.pdmodel.common.PDRectangle"; + +declare namespace File ="java:java.io.File"; + + + +(:~ "With-document" pattern: open pdf,apply $fn function, close pdf creates a local pdfobject and ensures it is closed after use e.g pdfbox:with-pdf("path...",pdfbox:page-text(?,5)) :) @@ -67,6 +58,11 @@ as item()*{ }; +(:~ open pdf using fetch:binary, returns pdf object :) +declare function pdfbox:open($pdfsrc as item()) +as item(){ +pdfbox:open($pdfsrc, map{}) +}; (:~ open pdf from file/url/binary, opts may have password , returns pdf object @param $pdfsrc a fetchable url or filepath, or xs:base64Binary item @@ -91,13 +87,6 @@ as item(){ } }; -(:~ open pdf from a location, returns pdf object :) -declare function pdfbox:open($pdfsrc as item()) -as item(){ -pdfbox:open($pdfsrc, map{}) -}; - - (:~ The version of the PDF specification used by $pdf e.g "1.4" returned as string to avoid float rounding issues :) @@ -106,17 +95,13 @@ as xs:string{ PDDocument:getVersion($pdf)=>xs:decimal()=>round(4)=>string() }; -(:~ Save pdf $pdf to filesystem at $savepath , returns $savepath :) +(:~ Save pdf $pdf to filesystem at $savepath , returns $savepath :) declare function pdfbox:pdf-save($pdf as item(),$savepath as xs:string) as xs:string{ PDDocument:save($pdf, File:new($savepath)),$savepath }; -(:~ Create binary representation (xs:base64Binary) of $pdf object -@param $pdf pdf object, created by pdfbox:open -@see #pdfbox:open -@see #pdfbox:with-pdf -:) +(:~ Create binary representation of $pdf object as xs:base64Binary :) declare function pdfbox:binary($pdf as item()) as xs:base64Binary{ let $bytes:=Q{java:java.io.ByteArrayOutputStream}new() @@ -213,6 +198,12 @@ as item()*{ else error(xs:QName('pdfbox:property'),concat("Property '",$property,"' not defined.")) }; +(:~ summary CSV style info for all properties for $pdfpaths +:) +declare function pdfbox:report($pdfpaths as xs:string*) +as map(*){ + pdfbox:report($pdfpaths,pdfbox:property-names()) +}; (:~ summary CSV style info for named $properties for PDFs in $pdfpaths @see https://docs.basex.org/main/CSV_Functions#xquery @@ -242,13 +233,6 @@ as map(*){ } }; -(:~ summary CSV style info for all properties for $pdfpaths -:) -declare function pdfbox:report($pdfpaths as xs:string*) -as map(*){ - pdfbox:report($pdfpaths,pdfbox:property-names()) -}; - (:~ Convenience function to save report() data to file :) declare function pdfbox:report-save($data as map(*),$dest as xs:string) as empty-sequence(){ @@ -303,8 +287,7 @@ as map(*)*{ =>PDDocumentCatalog:getDocumentOutline() return if(exists($outline)) - then pdfbox:outline($pdf,PDOutlineItem:getFirstChild($outline)) - else () + then pdfbox:outline($pdf,PDOutlineItem:getFirstChild($outline)) } }; @@ -323,14 +306,9 @@ as map(*){ map{"list":(),"this":$outlineItem}, function($input,$pos ) { - let $bookmark:=$input?this - let $bk:=map{ - "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), - "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} - } - - let $bk:= if(PDOutlineItem:hasChildren($bookmark)) - then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($bookmark)) + let $bk:= pdfbox:bookmark($input?this,$pdf) + let $bk:= if($bk?hasChildren) + then let $kids:=pdfbox:outline($pdf,PDOutlineItem:getFirstChild($input?this)) return map:merge(($bk,map:entry("children",$kids))) else $bk return map{ @@ -361,6 +339,21 @@ as element(bookmark)* }; +(:~ Return bookmark info for $bookmark +@return map{index:..,title:..,hasChildren:..} +:) +declare %private function pdfbox:bookmark($bookmark as item(),$pdf as item()) +as map(*) +{ + map{ + "index": PDOutlineItem:findDestinationPage($bookmark,$pdf)=>pdfbox:find-page($pdf), + "title": (# db:checkstrings #) {PDOutlineItem:getTitle($bookmark)} + (:=>translate("�",""), :), + "hasChildren": PDOutlineItem:hasChildren($bookmark) + } +}; + + (:~ pageIndex of $page in $pdf :) declare function pdfbox:find-page( $page as item()? (: as java:org.apache.pdfbox.pdmodel.PDPage :), @@ -371,7 +364,6 @@ as item()? then PDDocument:getDocumentCatalog($pdf) =>PDDocumentCatalog:getPages() =>PDPageTree:indexOf($page) - else () }; (:~ Return new PDF doc with pages from $start to $end as xs:base64Binary, (1 based) @@ -443,7 +435,7 @@ as xs:string?{ return string-join(($page, if(empty($style)) then "-" else $style, if(($start eq 1)) then "" else $start, - if(exists($prefix)) then '*' || $prefix else "" (:TODO double " :) + if(exists($prefix)) then '*' || $prefix (:TODO double " :) )) }; @@ -524,7 +516,7 @@ declare %private function pdfbox:do-until( then $fn($input,$action,$predicate) else let $hof:=function-lookup(QName('http://basex.org/modules/hof','until'), 3) return if(exists($hof)) - then $hof($predicate(?,0),$action(?,0),$input) - else error(xs:QName('pdfbox:do-until'),"No implementation do-until found") + then $hof($predicate(?,0),$action(?,0),$input) + else error(xs:QName('pdfbox:do-until'),"No implementation do-until found") };