diff --git a/doc.md b/doc.md index d63321e..1029ff9 100644 --- a/doc.md +++ b/doc.md @@ -45,18 +45,12 @@ import module namespace pdfbox="org.expkg_zone58.Pdfbox3"; --- ### Opening a PDF Document -To open a PDF document, use the `pdfbox:open` function. This function can handle local files, URLs, or binary data. +To open a PDF document, use the `pdfbox:open` function. This function can handle local files, URLs, or binary data. ```xquery let $pdf := pdfbox:open("path/to/document.pdf") ``` -If the PDF is encrypted, you can provide a password: - -```xquery -let $pdf := pdfbox:open("path/to/encrypted.pdf", map{"password": "your_password"}) -``` - --- ### Closing a PDF Document @@ -109,15 +103,19 @@ let $author := pdfbox:property($pdf, "author") ``` Supported properties include: -- `pageCount`: Number of pages. -- `title`: Document title. +- `#bookmarks` :Number of bookmarks +- `#labels` :Number of labels +- `#pages` : Number of pages - `author`: Document author. -- `creator`: Document creator. -- `producer`: Document producer. -- `subject`: Document subject. -- `keywords`: Document keywords. - `creationDate`: Document creation date. -- `modificationDate`: Document modification date. +- `creator`: Document creator. +- `keywords`: Document keywords. +- `labels`: Document labels formated as a string. + `modificationDate`: Document modification date. +- `producer`: Document producer. +- `specification` PDF spec version used in the document. +- `subject`: Document subject. +- `title`: Document title. --- @@ -133,22 +131,15 @@ The outline is returned as a sequence of maps, where each map represents a bookm --- ### Saving a PDF Document -To save a PDF document to the filesystem, use the `pdfbox:save` function. +To save a PDF document to the filesystem, use the `pdfbox:pdf-save` function. ```xquery -let $savedPath := pdfbox:save($pdf, "path/to/save/document.pdf") +let $savedPath := pdfbox:pdf-save($pdf, "path/to/save/document.pdf") ``` --- -## Advanced Usage -### Handling Encrypted PDFs -If the PDF is encrypted, you can provide a password when opening the document. - -```xquery -let $pdf := pdfbox:open("path/to/encrypted.pdf", map{"password": "your_password"}) -``` --- @@ -165,7 +156,7 @@ let $labels := pdfbox:labels($pdf) To get the size of a specific page, use the `pdfbox:page-media-box` function. ```xquery -let $size := pdfbox:page-media-box($pdf, 1) (: Get size of page 1 :) +let $size := pdfbox:page-media-box($pdf, 1) (: Get size of page 0, the cover :) ``` --- @@ -177,10 +168,17 @@ You can generate a CSV-style report of properties for multiple PDFs using the `p let $report := pdfbox:report(("path/to/doc1.pdf", "path/to/doc2.pdf")) ``` -The report includes properties like `title`, `author`, `pageCount`, etc., for each PDF. +The report includes all properties by default, such as `title`, `author`, `#pages` , etc., for each PDF. --- +## Advanced Usage +### Handling Encrypted PDFs +If the PDF is encrypted, you can provide a password when opening the document. + +```xquery +let $pdf := pdfbox:open("path/to/encrypted.pdf", map{"password": "your_password"}) +``` ## Error Handling The library includes error handling to manage issues such as failed PDF loads or unsupported operations. Errors are thrown with descriptive messages to help diagnose problems. @@ -194,12 +192,3 @@ try { ``` --- - -## Conclusion -The `Pdfbox3.xqm` library, distributed as a XAR file with included PDFBox JAR files, provides a comprehensive interface for working with PDF documents in XQuery. By leveraging Apache PDFBox, it offers powerful features for text extraction, image rendering, and document manipulation. With this guide, you should be able to integrate PDF processing into your XQuery applications effectively. - -For more detailed information, refer to the [Apache PDFBox documentation](https://pdfbox.apache.org/docs/3.0.0/javadocs/) and the [BaseX documentation](https://docs.basex.org/). - ---- - -This user guide provides a starting point for using the `Pdfbox3.xqm` library. For further assistance, consult the official documentation or reach out to the community for support. \ No newline at end of file diff --git a/readme.md b/readme.md index bb310bb..17d1a35 100644 --- a/readme.md +++ b/readme.md @@ -51,6 +51,11 @@ pdfbox:with-pdf("...path/to/pdf.pdf", * `scripts/make-xar.xq` packages the required `jar`s and `xqm` files to a `xar` file in the `dist` folder. The `package.json` is (ab)used as a configuration source. Non standard information is held in the `expkg_zone58` section. This is experimental and may change. + +`package.json` contains script to run +1. The XAR build. +2. The tests +3. The documentation ### Action support The workflow `ci-basex.yaml` builds and tests the package. This can be used as an action on [github](https://github.com/features/actions), or on a local [gitea](https://docs.gitea.com/usage/actions/overview) or [forgejo](https://forgejo.org/) installation. diff --git a/samples.pdf/readme.md b/samples.pdf/readme.md index 76c6499..02eabfd 100644 --- a/samples.pdf/readme.md +++ b/samples.pdf/readme.md @@ -1,9 +1,13 @@ # Example PDFs with pageLabels and outlines ## Sources -* [BaseX100.pdf](https://files.basex.org/releases/10.0/BaseX100.pdf) -* [icelandic-dictionary.pdf](http://css4.pub/2015/icelandic/dictionary.pdf) -* [page-numbers.pdf](https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers). -* [page-numbers-password.pdf](https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers). -* [Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans](https://www.lse.ac.uk/News/News-Assets/PDFs/2021/Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans-Final-Report-November-2021.pdf) -* [Legal RAG Hallucinations](https://law.stanford.edu/wp-content/uploads/2024/05/Legal_RAG_Hallucinations.pdf) +| Name | bookmarks | labels | password |source | +|------|-----------|--------|----------|---| +|[BaseX100.pdf](BaseX100.pdf)||☑||https://files.basex.org/releases/10.0/BaseX100.pdf| +|[icelandic-dictionary.pdf](icelandic-dictionary.pdf)|☑|| |http://css4.pub/2015/icelandic/dictionary.pdf| +|[page-numbers.pdf](https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers)||☑||https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers| +|[page-numbers-password.pdf](page-numbers-password.pdf)||☑|☑(password)|https://www.w3.org/WAI/WCAG22/working-examples/pdf-page-numbers/page-numbers| +|[Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans](Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans.pdf)|☑|||https://www.lse.ac.uk/News/News-Assets/PDFs/2021/Sentience-in-Cephalopod-Molluscs-and-Decapod-Crustaceans-Final-Report-November-2021.pdf| +|[Legal RAG Hallucinations](Legal_RAG_Hallucinations.pdf)|☑|||https://law.stanford.edu/wp-content/uploads/2024/05/Legal_RAG_Hallucinations.pdf| + + diff --git a/src/Pdfbox3.xqm b/src/Pdfbox3.xqm index 7db1add..572d067 100644 --- a/src/Pdfbox3.xqm +++ b/src/Pdfbox3.xqm @@ -177,7 +177,8 @@ declare %private variable $pdfbox:property-map:=map{ "modificationDate": (PDDocument:getDocumentInformation#1, PDDocumentInformation:getModificationDate#1, pdfbox:gregToISO#1), - "labels": pdfbox:labels-as-strings#1 + + "labels": pdfbox:labels-as-string#1 }; (:~ Defined property names, sorted :)