Export command
The export
subcommands allow you to convert an sqlite export of an Arkindex project to other formats.
There are only two ways to export data from Arkindex:
- using the sqlite export, which is the most efficient and recommended approach;
- making API calls to retrieve the data you want.
There are no other formats for Arkindex exports, which is why the export
subcommands exist to transform sqlite exports.
PDF export
The pdf
subcommand creates PDF files from the sqlite export of an Arkindex project.
PDF export specific requirements
For the PDF export to work, the Arkindex CLI should be installed with some extra dependencies, by running the following command:
pip install arkindex-cli[export]
If you have installed the Arkindex CLI already, you can run this command anyway and it will only install the missing PDF export related dependencies.
Basic usage
arkindex export $PATH/TO/database.sqlite pdf --output $PATH/TO/FOLDER
This will export the entire project into PDF files named after each folder
element found in the SQLite database. Each PDF will have one page for each page
element, and a transcription from each text_line
element found in the page recursively will be added so that text becomes searchable.
The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the pdf
subcommand and its options. You can specify the path to the output directory using the --output
argument; if unspecified, it defaults to the current working directory.
PDF options
You can change the elements types used to build the PDF export using the following arguments:
--folder-type
: specify the type of the elements, containing pages, for which PDF files will be created; defaults tofolder
.--page-type
: specify the type of the elements from which PDF pages will be created; defaults topage
.--line-type
: specify the type of the elements containing transcriptions; defaults totext_line
.--use-page-transcriptions
: only export the transcriptions of the specified page-type elements. Defaults toFalse
.
The --use-page-transcriptions
and --line-type
arguments are mutually exclusive, as line-level transcriptions are ignored when exporting only page-level transcriptions.
arkindex export $PATH/TO/database.sqlite pdf --folder-type volume --page-type folio
You can restrict the PDF creation to only part of your export/project using the --folder-ids
argument; the command will only create PDF files from the folder elements whose IDs were given using this argument, ignoring the others.
arkindex export $PATH/TO/database.sqlite pdf --folder-ids $FOLDER_ID_1 $FOLDER_ID_2 $FOLDER_ID_3
The --debug
flag makes both the transcription text and boundings boxes visible on the PDF pages, which can be useful for testing the export itself or for transcription troubleshooting.
Example PDF
Using the following command
arkindex export './demo-book-of-hours-20220524-104657.sqlite' pdf --folder-ids 6661cc31-c437-4a35-8fd5-e34a0d3a638e
generated this PDF from this volume in the sqlite export of the the Demo | Book of Hours project on demo.arkindex.org (only the 10 first pages were preserved in the PDF).
ALTO XML export
The alto
subcommand creates Alto XML files from the sqlite export of an Arkindex project.
Basic usage
arkindex export $PATH/TO/database.sqlite alto --output $PATH/TO/FOLDER
This command exports the entire project into Alto XML files. One directory in the specified output directory is created for each folder
, named after the folder's UUID. One file is created for each page
in each folder, and is named after the page's UUID. The files include <TextLine>
nodes for each transcription found in a text_line
element, and use <Processing>
nodes to store the worker versions associated with the elements and transcriptions.
The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the alto
subcommand and its options. You can specify the path to the output directory using the --output
argument; if unspecified, it defaults to the current working directory.
Optional METS file
You can generate a METS file alongside your Alto XML files, linking the generated files (and their paths) to their corresponding images (with their IIIF URLs) using the --mets
flag. One METS file is generated per exported folder element, and it is saved as mets_entrypoint.xml
in the corresponding folder, with the Alto XML files.
arkindex export $PATH/TO/database.sqlite alto --output $PATH/TO/FOLDER --mets
Alto XML options
You can change the elements types used to build the Alto XML files using the following arguments:
--folder-type
: specify the type of the elements, containing pages, for which folders will be created in the output directory; defaults tofolder
.--page-type
: specify the type of the elements from which Alto XML files will be created; defaults topage
.--line-type
: specify the type of the elements containing transcriptions, from which<TextLine>
nodes will be created; defaults totext_line
.--layout-tag
: an optional argument, which allows you to specify one type of elements (without transcriptions) to export along with theline-type
elements. The elements are exported asGraphicalElement
nodes in the Alto XML files.
arkindex export $PATH/TO/database.sqlite alto --line-type print_line --layout-tag barcode
You can restrict the Alto XML conversion to only part of your export/project using the --folder-ids
argument; the command will only create Alto XML files from the page elements contained in the folder elements whose IDs were given using this argument, ignoring the others.
arkindex export $PATH/TO/database.sqlite alto --folder-ids $FOLDER_ID_1 $FOLDER_ID_2 $FOLDER_ID_3
CSV export
The csv
subcommand creates a CSV file from the sqlite export of an Arkindex project.
Basic usage
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv
This command creates a CSV file with one line for each element contained in the exported project.
The only required argument is the path to the sqlite export, which is a positional argument. ⚠️ This path comes before the csv
subcommand and its options. You can specify the path to the output file using the --output
argument; if unspecified, it defaults to an elements.csv
file created in the current working directory.
CSV options
The csv
export subcommand can take the following arguments:
--parent
: only export to CSV the elements which are the children of the specified element. If you want to get these elements recursively, and not only the direct children of that parent, you need to use it in conjunction with the--recursive
flag.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --parent $ELEMENT_ID
--recursive
: this flag can only be used in in conjunction with the--parent
argument. If no parent element ID is specified, the export is recursive by default.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --parent $ELEMENT_ID --recursive
--type
: restrict the export to elements of a given type.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --type $ELEMENT_TYPE
--field
: restrict the CSV columns to the given fields (support regex fields).
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --field $FIELD_1 $FIELD_2
Classifications
--with-classes
: add the exported elements' classifications to the output CSV. In the CSV file, there will be one column per class, filled with the classification confidence on the corresponding line if the class is set on the element, and left empty if it is not.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-classes
--classification-worker-version
: only export classifications created by a given worker version. To export classifications created manually, set it tomanual
instead of a worker version UUID.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-classes --classification-worker-version $UUID
Metadata
--with-metadata
: add the exported elements' metadata to the output CSV. As the same metadata can be set multiple times with different values on one element, there can be more than one column for each metadata in the CSV file; in that case, the metadata columns are named like this:{metadata_name}_1
,{metadata_name}_2
… On each line the metadata column is filled with the corresponding metadata value if that metadata is set on the element, or left blank.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-metadata
--with-parent-metadata
: complementary to--with-metadata
. Load metadata from the element and all of its parents, recursively. One column will be used by metadata, following the same method as explained above for the--with-metadata
option.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-metadata --with-parent-metadata
Entities
--with-entities
: add the entities attached to the exported elements (through their transcriptions) to the output CSV. In the CSV file, there will be one column (or more, see the CSV output subsection) per entity type present in the exported data, filled with the entity name/value on the element's line if it is present on the element, and the cell left blank if it is not. If there are multiple entities attached to an element, they are concatenated together with a;
separator.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-entities
--entities-worker-version
: only export the entities placed on the transcription by a given worker version. If you want to export entities created manually, set this parameter tomanual
instead of a worker version UUID.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-entities --entities-worker-version $UUID
CSV output
By default, the output CSV contains the following columns/information:
- the id, name and type of the element (columns
id
,name
andtype
); - the id and url of the element's image (columns
image_id
andimage_url
); - the element's polygon (
polygon
); - the id of the worker version that created this element, if it was created by a worker; if the element was not created by a worker this column is left blank (
worker_version_id
); - the element's creation date (
created
).
If using the --with-classes
flag, one column per classification that is present on at least one of the exported elements is added. These columns contain the classification confidence for these classes on each element (left blank if the class is not present on the element).
If using the --with-metadata
flag, at least one columns per metadata that is present on at least one of the exported elements is added. If some metadata are set multiple times with different values on some elements, then there are multiple columns for these metadata, named like this: {metadata_name}_1
, {metadata_name}_2
… These columns contain the metadata value(s) for these metadata on each element (left blank if the metadata is not present on the element).
If using the with-entities
flag, at least one column per entity type for which an entity is present on at least one of the export elements is added. These columns contain the name/value of the exported entities. If multiple entities of a same type are set on at least one of the exported elements, then there are multiple columns for each of these entities, named like this: entity_{type_name}_1
, entity_{type_name}_2
…
Entities export
The entities
subcommand creates an export of the transcription entities of an Arkindex project, from its exported sqlite database.
Usage
arkindex export $PATH/TO/database.sqlite entities --instance-url http://arkindex.teklia.com --output $PATH/TO/FOLDER/transcription_entities.csv
This command generates a CSV file with one line per transcription entity in the project.
- ⚠️ The path to the sqlite export is a positional argument, and comes before the
entities
subcommand and its options. --instance-url
is a required argument, necessary to build working URLs to the transcription entities' parent elements, as the instance URL information is not present in the exported database. ⚠️ Enter it carefully, to avoid creating invalid URLs.- You can specify the path to an output file using the
--output
argument; if unspecified, the data is written to stdout. This output can also be redirected to a file without using theoutput
argument, like this:arkindex export $PATH/TO/database.sqlite csv --instance-url http://arkindex.teklia.com > some_file.csv
.
CSV output
The output CSV contains the following columns/information:
entity_id
: the ID of the entity the transcription entity links to;entity_value
: the name/value of that entity (for example, a person name);entity_type
: this entity's type;entity_metas
: additional data from the entity'smetas
field;offset
andlength
: the position and length of the entity in its parent transcription;transcription_id
: the ID of the parent transcription;element_id
andelement_url
: the ID and URL of the transcription's element.