Export command

The export subcommands allow you to convert an sqlite export of an Arkindex project to other formats.

There are only two ways to export data from Arkindex:

  • using the sqlite export, which is the most efficient and recommended approach;
  • making API calls to retrieve the data you want.

There are no other formats for Arkindex exports, which is why the export subcommands exist to transform sqlite exports.

PDF export

The pdf subcommand creates PDF files from the sqlite export of an Arkindex project.

PDF export specific requirements

For the PDF export to work, the Arkindex CLI should be installed with some extra dependencies, by running the following command:

pip install arkindex-cli[export]

If you have installed the Arkindex CLI already, you can run this command anyway and it will only install the missing PDF export related dependencies.

Basic usage

arkindex export $PATH/TO/database.sqlite pdf --output $PATH/TO/FOLDER

This will export the entire project into PDF files named after each folder element found in the SQLite database. Each PDF will have one page for each page element, and a transcription from each text_line element found in the page recursively will be added so that text becomes searchable.

The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the pdf subcommand and its options. You can specify the path to the output directory using the --output argument; if unspecified, it defaults to the current working directory.

PDF options

You can change the elements types used to build the PDF export using the following arguments:

  • --folder-type: specify the type of the elements, containing pages, for which PDF files will be created; defaults to folder.
  • --page-type: specify the type of the elements from which PDF pages will be created; defaults to page.
  • --line-type: specify the type of the elements containing transcriptions; defaults to text_line.
  • --use-page-transcriptions: only export the transcriptions of the specified page-type elements. Defaults to False.

The --use-page-transcriptions and --line-type arguments are mutually exclusive, as line-level transcriptions are ignored when exporting only page-level transcriptions.

arkindex export $PATH/TO/database.sqlite pdf --folder-type volume --page-type folio

You can restrict the PDF creation to only part of your export/project using the --folder-ids argument; the command will only create PDF files from the folder elements whose IDs were given using this argument, ignoring the others.

arkindex export $PATH/TO/database.sqlite pdf --folder-ids $FOLDER_ID_1 $FOLDER_ID_2 $FOLDER_ID_3

The --debug flag makes both the transcription text and boundings boxes visible on the PDF pages, which can be useful for testing the export itself or for transcription troubleshooting.

Example PDF

Using the following command

arkindex export './demo-book-of-hours-20220524-104657.sqlite' pdf --folder-ids 6661cc31-c437-4a35-8fd5-e34a0d3a638e

generated this PDF from this volume in the sqlite export of the the Demo | Book of Hours project on demo.arkindex.org (only the 10 first pages were preserved in the PDF).

ALTO XML export

The alto subcommand creates Alto XML files from the sqlite export of an Arkindex project.

Basic usage

arkindex export $PATH/TO/database.sqlite alto --output $PATH/TO/FOLDER

This command exports the entire project into Alto XML files. One directory in the specified output directory is created for each folder, named after the folder's UUID. One file is created for each page in each folder, and is named after the page's UUID. The files include <TextLine> nodes for each transcription found in a text_line element, and use <Processing> nodes to store the worker versions associated with the elements and transcriptions.

The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the alto subcommand and its options. You can specify the path to the output directory using the --output argument; if unspecified, it defaults to the current working directory.

Optional METS file

You can generate a METS file alongside your Alto XML files, linking the generated files (and their paths) to their corresponding images (with their IIIF URLs) using the --mets flag. One METS file is generated per exported folder element, and it is saved as mets_entrypoint.xml in the corresponding folder, with the Alto XML files.

arkindex export $PATH/TO/database.sqlite alto --output $PATH/TO/FOLDER --mets

Alto XML options

You can change the elements types used to build the Alto XML files using the following arguments:

  • --folder-type: specify the type of the elements, containing pages, for which folders will be created in the output directory; defaults to folder.
  • --page-type: specify the type of the elements from which Alto XML files will be created; defaults to page.
  • --line-type: specify the type of the elements containing transcriptions, from which <TextLine> nodes will be created; defaults to text_line.
  • --layout-tag: an optional argument, which allows you to specify one type of elements (without transcriptions) to export along with the line-type elements. The elements are exported as GraphicalElement nodes in the Alto XML files.
arkindex export $PATH/TO/database.sqlite alto --line-type print_line --layout-tag barcode

You can restrict the Alto XML conversion to only part of your export/project using the --folder-ids argument; the command will only create Alto XML files from the page elements contained in the folder elements whose IDs were given using this argument, ignoring the others.

arkindex export $PATH/TO/database.sqlite alto --folder-ids $FOLDER_ID_1 $FOLDER_ID_2 $FOLDER_ID_3

CSV export

The csv subcommand creates a CSV file from the sqlite export of an Arkindex project.

Basic usage

arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv

This command creates a CSV file with one line for each element contained in the exported project.

The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the csv subcommand and its options. You can specify the path to the output file using the --output argument; if unspecified, it defaults to an elements.csv file created in the current working directory.

CSV options

The csv export subcommand can take the following arguments:

  • --parent: only export to CSV the elements which are the children of the specified element. If you want to get these elements recursively, and not only the direct children of that parent, you need to use it in conjunction with the --recursive flag.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --parent $ELEMENT_ID
  • --recursive: this flag can only be used in in conjunction with the --parent argument. If no parent element ID is specified, the export is recursive by default.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --parent $ELEMENT_ID --recursive
  • --type: restrict the export to elements of a given type.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --type $ELEMENT_TYPE
  • --field: restrict the CSV columns to the given fields (supports Unix shell style wildcards, see documentation).
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --field $FIELD_1 $FIELD_2

Classifications

  • --with-classes: add the exported elements' classifications to the output CSV. In the CSV file, there will be one column per class, filled with the classification confidence on the corresponding line if the class is set on the element, and left empty if it is not.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-classes
  • --classification-worker-version: only export classifications created by a given worker version. To export classifications created manually, set it to manual instead of a worker version UUID.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-classes --classification-worker-version $UUID

Metadata

  • --with-metadata: add the exported elements' metadata to the output CSV. As the same metadata can be set multiple times with different values on one element, there can be more than one column for each metadata in the CSV file. On each line the metadata column is filled with the corresponding metadata value if that metadata is set on the element, or left blank. See the CSV output subsection for details.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-metadata
  • --with-parent-metadata: complementary to --with-metadata. Load metadata from the element and all of its parents, recursively. One column will be used by metadata, following the same method as explained above for the --with-metadata option.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-metadata --with-parent-metadata

Entities

  • --with-entities: add the entities attached to the exported elements (through their transcriptions) to the output CSV. In the CSV file, there will be one column (or more, see the CSV output subsection) per entity type present in the exported data, filled with the entity name/value on the element's line if it is present on the element, and the cell left blank if it is not.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-entities
  • --entities-worker-version: only export the entities placed on the transcription by a given worker version. If you want to export entities created manually, set this parameter to manual instead of a worker version UUID.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-entities --entities-worker-version $UUID

CSV output

By default, the output CSV contains the following columns/information:

  • the id, name and type of the element (columns id, name and type);
  • the id and url of the element's image (columns image_id and image_url);
  • the element's polygon (polygon);
  • the id of the worker version that created this element, if it was created by a worker; if the element was not created by a worker this column is left blank (worker_version_id);
  • the element's creation date (created).

Classifications

If using the --with-classes flag, one column per classification that is present on at least one of the exported elements is added. The classification columns are prefixed with classification_. These columns contain the classification confidence for these classes on each element (left blank if the class is not present on the element).

Metadata

If using the --with-metadata flag, at least one column per metadata that is present on at least one of the exported elements is added. The metadata columns in the CSV file are prefixed with metadata_. If some metadata are set multiple times with different values on some elements, then there are multiple columns for these metadata. These columns contain the metadata value(s) for these metadata on each element (left blank if the metadata is not present on the element).

The number of columns for one metadata in the exported CSV is the maximum number of times that metadata appears on a single element, amongst all the exported elements, and according to the export options.

For example, if 3 elements are exported, and their metadata are the following:

  • {"fruit": "apple", "fruit": "apricot"}
  • {"vegetable": "potato"}
  • {"fruit": "apple", "fruit": "apricot", "fruit": "banana", "vegetable": "artichoke"}

then in the output CSV, there will be 3 "metadata_fruit" columns, and 1 "metadata_vegetable" column. The lines for these elements in the CSV will look like this:

id,name,[...],metadata_fruit,metadata_fruit,metadata_fruit,metadata_vegetable
element_id_1,element_name_1,[...],apple,apricot,,
element_id_2,element_name_2,[...],,,,potato
element_id_3,element_name_3,[...],apple,apricot,banana,artichoke

Entities

If using the --with-entities flag, at least one column per entity type for which an entity is present on at least one of the export elements is added. The entity columns are prefixed with entity_. These columns contain the name/value of the exported entities. If multiple entities of a same type are set on at least one of the exported elements, then there are multiple columns for each of these entities.

The number of columns for one entity type in the exported CSV is the maximum number of times that entity type appears on a single element, amongst all the exported elements, and according to the export options.

For example, if 3 elements are exported, and their exported entities are the following:

  • {"city": "new york", "city": "chicago", "year": 1986}
  • {"state": "illinois", "year": 2003, "year": 2005, "year": 2020}
  • {"city": "paris", "state": "texas"}

then in the output CSV, there will be 2 "entity_city" columns, 1 "entity_state" column and 3 "entity_year" columns. The lines for these elements in the CSV will look like this:

id,name,[...],entity_city,entity_city,entity_state,entity_year,entity_year,entity_year
element_id_1,element_name_1,[...],new york,chicago,,1986,,
element_id_2,element_name_2,[...],,,illinois,2003,2005,2020
element_id_3,element_name_3,[...],paris,,texas,,,

Entities export

The entities subcommand creates an export of the transcription entities of an Arkindex project, from its exported sqlite database.

Usage

arkindex export $PATH/TO/database.sqlite entities --instance-url http://arkindex.teklia.com --output $PATH/TO/FOLDER/transcription_entities.csv

This command generates a CSV file with one line per transcription entity in the project.

  • ⚠️ The path to the sqlite export is a positional argument, and comes before the entities subcommand and its options.
  • --instance-url is a required argument, necessary to build working URLs to the transcription entities' parent elements, as the instance URL information is not present in the exported database. ⚠️ Enter it carefully, to avoid creating invalid URLs.
  • You can specify the path to an output file using the --output argument; if unspecified, the data is written to stdout. This output can also be redirected to a file without using the output argument, like this: arkindex export $PATH/TO/database.sqlite csv --instance-url http://arkindex.teklia.com > some_file.csv.

CSV output

The output CSV contains the following columns/information:

  • entity_id: the ID of the entity the transcription entity links to;
  • entity_value: the name/value of that entity (for example, a person name);
  • entity_type: this entity's type;
  • entity_metas: additional data from the entity's metas field;
  • offset and length: the position and length of the entity in its parent transcription;
  • transcription_id: the ID of the parent transcription;
  • element_id and element_url: the ID and URL of the transcription's element.

DOCX export

The docx subcommand creates DOCX files from the sqlite export of an Arkindex project.

Basic usage

arkindex export $PATH/TO/database.sqlite docx --output $PATH/TO/FOLDER

This command creates one DOCX file for each page element in the export, containing this page's transcription. The DOCX files are saved in the specified output folder.

The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the docx subcommand and its options. You can specify the path to the output folder using the --output argument; if unspecified, it defaults to a docx folder created in the current working directory.

DOCX options

Folders to export

You can specify which folders to export using two mutually exclusive arguments:

  • --folder-type: only export folders of the specified type.
arkindex export $PATH/TO/database.sqlite docx --folder-type carton
  • --folder-ids: only export the specified folders (one or more UUIDs).
arkindex export $PATH/TO/database.sqlite docx --folder-ids aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbb

If none of these arguments is used, then element transcriptions are exported from all the folders in the export.

Elements to export

  • --element-type: specify the type of the elements to export. Defaults to page.
arkindex export $PATH/TO/database.sqlite docx --element-type single_page

Transcriptions to export

  • --line-type: specify the type of the elements from which to export the transcriptions. If unspecified, then the transcriptions are exported from the elements defined by the element-type parameter. For example, if you want to export single_page elements and there are transcriptions on these elements themselves, then you don't need to specify a line-type. However, if the transcriptions you want to retrieve for the single pages can be found on their children text_line elements, then you should use --line-type text_line: all the transcriptions from the children text_line elements will be concatenated and exported in a file for each parent single_page.
arkindex export $PATH/TO/database.sqlite docx --element-type single_page --line_type text_line

You can also filter the transcriptions to be exported by source using the following mutually exclusive parameters:

  • --worker-run-id: only export transcriptions created by a given worker run (UUID) or manual transcriptions ("manual").
arkindex export $PATH/TO/database.sqlite docx --worker-run-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
  • --worker-version-id: only export transcriptions created by a given worker version (UUID) or manual transcriptions ("manual").
arkindex export $PATH/TO/database.sqlite docx --worker-version-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa

⚠️ If the parameters you have specified return more than one transcription for a given element, the export will fail.

Merge files in folders

  • --merge: use this flag to create one DOCX file per folder, instead of one file per page / element-type element.
arkindex export $PATH/TO/database.sqlite docx --merge

DOCX output

The created DOCX files contain the exported element's name as a header, and the transcription below. If using the --merge flag, then the created DOCX files contain each exported element's name as a header and their transcription below, one after the other.