Export command¶
The export
subcommands allow you to convert an sqlite export of an Arkindex project to other formats.
There are only two ways to export data from Arkindex:
- using the sqlite export, which is the most efficient and recommended approach;
- making API calls to retrieve the data you want.
There are no other formats for Arkindex exports, which is why the export
subcommands exist to transform sqlite exports.
PDF export¶
The pdf
subcommand creates PDF files from the sqlite export of an Arkindex project.
PDF export specific requirements¶
For the PDF export to work, the Arkindex CLI should be installed with some extra dependencies, by running the following command:
pip install arkindex-cli[export]
If you have installed the Arkindex CLI already, you can run this command anyway and it will only install the missing PDF export related dependencies.
Basic usage¶
arkindex export $PATH/TO/database.sqlite pdf --output $PATH/TO/FOLDER
This will export the entire project into PDF files named after each folder
element found in the SQLite database. Each PDF will have one page for each page
element, and a transcription from each text_line
element found in the page recursively will be added so that text becomes searchable.
The only required argument is the path to the sqlite export, which is a positional argument.
Path to the export
This path comes before the pdf
subcommand and its options.
You can specify the path to the output directory using the --output
argument; if unspecified, it defaults to the current working directory.
PDF options¶
You can change the elements types used to build the PDF export using the following arguments:
--folder-type
: specify the type of the elements, containing pages, for which PDF files will be created; defaults tofolder
.--page-type
: specify the type of the elements from which PDF pages will be created; defaults topage
.--line-type
: specify the type of the elements containing transcriptions; defaults totext_line
.--use-page-transcriptions
: only export the transcriptions of the specified page-type elements. Defaults toFalse
.--order-by-name
: order elements to export by their name instead of their internal position on Arkindex. Defaults toFalse
.--transcription-worker-version
: only export transcriptions created by a given worker version.--name-pdf-with-id
: name exported PDF files after the folder Arkindex ID instead of its name. Defaults toFalse
.
The --use-page-transcriptions
and --line-type
arguments are mutually exclusive, as line-level transcriptions are ignored when exporting only page-level transcriptions.
Info
If multiple folders have the same name in your project (e.g. you have two test folders to export), you should definitely use the --name-pdf-with-id
option to prevent output files from overwriting each other.
arkindex export $PATH/TO/database.sqlite pdf --folder-type volume --page-type folio
You can restrict the PDF creation to only part of your export/project using the --folder-ids
argument; the command will only create PDF files from the folder elements whose IDs were given using this argument, ignoring the others.
arkindex export $PATH/TO/database.sqlite pdf --folder-ids $FOLDER_ID_1 $FOLDER_ID_2 $FOLDER_ID_3
The --debug
flag makes both the transcription text and boundings boxes visible on the PDF pages, which can be useful for testing the export itself or for transcription troubleshooting.
Example PDF¶
Using the following command
arkindex export './demo-book-of-hours-20220524-104657.sqlite' pdf --folder-ids 6661cc31-c437-4a35-8fd5-e34a0d3a638e
generated this PDF from this volume in the sqlite export of the the Demo | Book of Hours project on demo.arkindex.org (only the 10 first pages were preserved in the PDF).
ALTO XML export¶
The alto
subcommand creates Alto XML files from the sqlite export of an Arkindex project.
Basic usage¶
arkindex export $PATH/TO/database.sqlite alto --output $PATH/TO/FOLDER
This command exports the entire project into Alto XML files. One directory in the specified output directory is created for each folder
, named after the folder’s UUID. One file is created for each page
in each folder, and is named after the page’s UUID. The files include <TextLine>
nodes for each transcription found in a text_line
element, and use <Processing>
nodes to store the worker versions associated with the elements and transcriptions.
The only required argument is the path to the sqlite export, which is a positional argument.
Path to the export
This path comes before the alto
subcommand and its options.
You can specify the path to the output directory using the --output
argument; if unspecified, it defaults to the current working directory.
Optional METS file¶
You can generate a METS file alongside your Alto XML files, linking the generated files (and their paths) to their corresponding images (with their IIIF URLs) using the --mets
flag. One METS file is generated per exported folder element, and it is saved as mets_entrypoint.xml
in the corresponding folder, with the Alto XML files.
arkindex export $PATH/TO/database.sqlite alto --output $PATH/TO/FOLDER --mets
Alto XML options¶
You can change the elements types used to build the Alto XML files using the following arguments:
--folder-type
: specify the type of the elements, containing pages, for which folders will be created in the output directory; defaults tofolder
.--page-type
: specify the type of the elements from which Alto XML files will be created; defaults topage
.--line-type
: specify the type of the elements containing transcriptions, from which<TextLine>
nodes will be created; defaults totext_line
.--layout-tag
: an optional argument, which allows you to specify one type of elements (without transcriptions) to export along with theline-type
elements. The elements are exported asGraphicalElement
nodes in the Alto XML files.
arkindex export $PATH/TO/database.sqlite alto --line-type print_line --layout-tag barcode
You can restrict the Alto XML conversion to only part of your export/project using the --folder-ids
argument; the command will only create Alto XML files from the page elements contained in the folder elements whose IDs were given using this argument, ignoring the others.
arkindex export $PATH/TO/database.sqlite alto --folder-ids $FOLDER_ID_1 $FOLDER_ID_2 $FOLDER_ID_3
CSV export¶
The csv
subcommand creates a CSV file from the sqlite export of an Arkindex project.
Basic usage¶
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv
This command creates a CSV file with one line for each element contained in the exported project.
The only required argument is the path to the sqlite export, which is a positional argument.
Path to the export
This path comes before the csv
subcommand and its options.
You can specify the path to the output file using the --output
argument; if unspecified, it defaults to an elements.csv
file created in the current working directory.
CSV options¶
The csv
export subcommand can take the following arguments:
--parent
: only export to CSV the elements which are the children of the specified element. If you want to get these elements recursively, and not only the direct children of that parent, you need to use it in conjunction with the--recursive
flag.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --parent $ELEMENT_ID
--recursive
: this flag can only be used in in conjunction with the--parent
argument. If no parent element ID is specified, the export is recursive by default.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --parent $ELEMENT_ID --recursive
--type
: restrict the export to elements of a given type.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --type $ELEMENT_TYPE
--field
: restrict the CSV columns to the given fields (supports Unix shell style wildcards, see documentation).
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --field $FIELD_1 $FIELD_2
Classifications¶
--with-classes
: add the exported elements’ classifications to the output CSV. In the CSV file, there will be one column per class, filled with the classification confidence on the corresponding line if the class is set on the element, and left empty if it is not.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-classes
--classification-worker-version
: only export classifications created by a given worker version. To export classifications created manually, set it tomanual
instead of a worker version UUID.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-classes --classification-worker-version $UUID
Metadata¶
--with-metadata
: add the exported elements’ metadata to the output CSV. As the same metadata can be set multiple times with different values on one element, there can be more than one column for each metadata in the CSV file. On each line the metadata column is filled with the corresponding metadata value if that metadata is set on the element, or left blank. See the CSV output subsection for details.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-metadata
--with-parent-metadata
: complementary to--with-metadata
. Load metadata from the element and all of its parents, recursively. One column will be used by metadata, following the same method as explained above for the--with-metadata
option.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-metadata --with-parent-metadata
Entities¶
--with-entities
: add the entities attached to the exported elements (through their transcriptions) to the output CSV. In the CSV file, there will be one column (or more, see the CSV output subsection) per entity type present in the exported data, filled with the entity name/value on the element’s line if it is present on the element, and the cell left blank if it is not.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-entities
--entities-worker-version
: only export the entities placed on the transcription by a given worker version. If you want to export entities created manually, set this parameter tomanual
instead of a worker version UUID.
arkindex export $PATH/TO/database.sqlite csv --output $PATH/TO/FOLDER/elements.csv --with-entities --entities-worker-version $UUID
CSV output¶
By default, the output CSV contains the following columns/information:
- the id, name and type of the element (columns
id
,name
andtype
); - the id and url of the element’s image (columns
image_id
andimage_url
); - the element’s rotation angle and polygon (
rotation_angle
andpolygon
); - the id of the worker version that created this element, if it was created by a worker; if the element was not created by a worker this column is left blank (
worker_version_id
); - the element’s creation date (
created
).
Classifications¶
If using the --with-classes
flag, one column per classification that is present on at least one of the exported elements is added. The classification columns are prefixed with classification_
. These columns contain the classification confidence for these classes on each element (left blank if the class is not present on the element).
Metadata¶
If using the --with-metadata
flag, at least one column per metadata that is present on at least one of the exported elements is added. The metadata columns in the CSV file are prefixed with metadata_
. If some metadata are set multiple times with different values on some elements, then there are multiple columns for these metadata. These columns contain the metadata value(s) for these metadata on each element (left blank if the metadata is not present on the element).
The number of columns for one metadata in the exported CSV is the maximum number of times that metadata appears on a single element, amongst all the exported elements, and according to the export options.
For example, if 3 elements are exported, and their metadata are the following:
- {“fruit”: “apple”, “fruit”: “apricot”}
- {“vegetable”: “potato”}
- {“fruit”: “apple”, “fruit”: “apricot”, “fruit”: “banana”, “vegetable”: “artichoke”}
then in the output CSV, there will be 3 “metadata_fruit” columns, and 1 “metadata_vegetable” column. The lines for these elements in the CSV will look like this:
id,name,[...],metadata_fruit,metadata_fruit,metadata_fruit,metadata_vegetable
element_id_1,element_name_1,[...],apple,apricot,,
element_id_2,element_name_2,[...],,,,potato
element_id_3,element_name_3,[...],apple,apricot,banana,artichoke
Entities¶
If using the --with-entities
flag, at least one column per entity type for which an entity is present on at least one of the export elements is added. The entity columns are prefixed with entity_
. These columns contain the name/value of the exported entities. If multiple entities of a same type are set on at least one of the exported elements, then there are multiple columns for each of these entities.
The number of columns for one entity type in the exported CSV is the maximum number of times that entity type appears on a single element, amongst all the exported elements, and according to the export options.
For example, if 3 elements are exported, and their exported entities are the following:
- {“city”: “new york”, “city”: “chicago”, “year”: 1986}
- {“state”: “illinois”, “year”: 2003, “year”: 2005, “year”: 2020}
- {“city”: “paris”, “state”: “texas”}
then in the output CSV, there will be 2 “entity_city” columns, 1 “entity_state” column and 3 “entity_year” columns. The lines for these elements in the CSV will look like this:
id,name,[...],entity_city,entity_city,entity_state,entity_year,entity_year,entity_year
element_id_1,element_name_1,[...],new york,chicago,,1986,,
element_id_2,element_name_2,[...],,,illinois,2003,2005,2020
element_id_3,element_name_3,[...],paris,,texas,,,
Entities only export¶
If you want to export only the transcription entities of an Arkindex project from its exported sqlite database, then you must use the entities
subcommand instead of the csv
one.
Basic Usage¶
arkindex export $PATH/TO/database.sqlite entities --instance-url http://arkindex.teklia.com --output $PATH/TO/FOLDER/transcription_entities.csv
This command generates a CSV file with one line per transcription entity in the project.
This command takes two requirement arguments:
- The path to the sqlite export, which is a positional argument.
Path to the export
This path comes before the entities
subcommand and its options.
--instance-url
, which is necessary to build working URLs to the transcription entities’ parent elements, as the instance URL information is not present in the exported database. Enter it carefully, to avoid creating invalid URLs.
Optional arguments¶
-
--output
: specify the path to an output file. If unspecified, the data is written to stdout. This output can also be redirected to a file without using theoutput
argument, like this:arkindex export $PATH/TO/database.sqlite csv --instance-url http://arkindex.teklia.com > some_file.csv
. -
--type
: specify the type of the elements from which to export entities. For example, if you want to export entities only from transcriptions on page type elements, use the following command:arkindex export $PATH/TO/database.sqlite entities --instance-url http://arkindex.teklia.com --type page
-
--worker-version-id
: specify one or more worker version UUIDs, so that only the entities produced by these worker versions are exported. If an element has entities produced by more than one of the specified worker versions, the order in which you have specified the worker version UUIDs is used as a preference order. You can also specifymanual
using this argument, to only export manually produced entities.arkindex export $PATH/TO/database.sqlite entities --instance-url http://arkindex.teklia.com --worker-version-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbb
arkindex export $PATH/TO/database.sqlite entities --instance-url http://arkindex.teklia.com --worker-version-id manual
CSV output¶
The output CSV contains the following columns/information:
entity_id
: the ID of the entity the transcription entity links to;entity_value
: the name/value of that entity (for example, a person name);entity_type
: this entity’s type;confidence
: the transcription entity’s confidence;entity_metas
: additional data from the entity’smetas
field;offset
andlength
: the position and length of the entity in its parent transcription;transcription_id
: the ID of the parent transcription;element_id
andelement_url
: the ID and URL of the transcription’s element.
DOCX export¶
The docx
subcommand creates DOCX files from the sqlite export of an Arkindex project.
Basic usage¶
arkindex export $PATH/TO/database.sqlite docx --output $PATH/TO/FOLDER
This command creates one DOCX file for each page element in the export, containing this page’s transcription. The DOCX files are saved in the specified output
folder.
The only required argument is the path to the sqlite export, which is a positional argument.
Path to the export
This path comes before the docx
subcommand and its options.
You can specify the path to the output folder using the --output
argument; if unspecified, it defaults to a docx
folder created in the current working directory.
DOCX options¶
Folders to export¶
You can specify which folders to export using two mutually exclusive arguments:
--folder-type
: only export folders of the specified type.
arkindex export $PATH/TO/database.sqlite docx --folder-type carton
--folder-ids
: only export the specified folders (one or more UUIDs).
arkindex export $PATH/TO/database.sqlite docx --folder-ids aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbb
If none of these arguments is used, then element transcriptions are exported from all the folders in the export.
Elements to export¶
--element-type
: specify the type of the elements to export. Defaults topage
.
arkindex export $PATH/TO/database.sqlite docx --element-type single_page
Transcriptions to export¶
--line-type
: specify the type of the elements from which to export the transcriptions. If unspecified, then the transcriptions are exported from the elements defined by theelement-type
parameter. For example, if you want to exportsingle_page
elements and there are transcriptions on these elements themselves, then you don’t need to specify aline-type
. However, if the transcriptions you want to retrieve for the single pages can be found on their childrentext_line
elements, then you should use--line-type text_line
: all the transcriptions from the children text_line elements will be concatenated and exported in a file for each parentsingle_page
.
arkindex export $PATH/TO/database.sqlite docx --element-type single_page --line_type text_line
You can also filter the transcriptions to be exported by source using the following mutually exclusive parameters:
--worker-run-id
: only export transcriptions created by a given worker run (UUID) or manual transcriptions (manual
).
arkindex export $PATH/TO/database.sqlite docx --worker-run-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
--worker-version-id
: only export transcriptions created by a given worker version (UUID) or manual transcriptions (manual
).
arkindex export $PATH/TO/database.sqlite docx --worker-version-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
Warning
If the parameters you have specified return more than one transcription for a given element, the export will fail.
Merge files in folders¶
--merge
: use this flag to create one DOCX file per folder, instead of one file per page /element-type
element.
arkindex export $PATH/TO/database.sqlite docx --merge
DOCX output¶
The created DOCX files contain the exported element’s name as a header, and the transcription below. If using the --merge
flag, then the created DOCX files contain each exported element’s name as a header and their transcription below, one after the other.
PageXML export¶
The pagexml
subcommand creates PageXML files from the sqlite export of an Arkindex project.
Basic usage¶
arkindex export $PATH/TO/database.sqlite pagexml --output $PATH/TO/FOLDER
This command exports all the page
in the project into PageXML files. One file is created for each page
in the output directory, and is named after the page’s UUID. The files include <TextRegion>
nodes filled with transcriptions, which are wrapped in a <TextEquiv>
node. Those are from:
- either a single
text_line
, - or multiple
text_line
elements grouped in aparagraph
, which are themselves wrapped in<TextLine>
nodes.
The only required argument is the path to the sqlite export, which is a positional argument.
Path to the export
This path comes before the pagexml
subcommand and its options.
You can specify the path to the output directory using the --output
argument; if unspecified, it defaults to the current working directory.
PageXML options¶
You can change the elements types used to build the PageXML files using the following arguments:
--page-type
: specify the type of the elements from which PageXML files will be created; defaults topage
.--paragraph-type
: if set, specify the type of the elements from which to grouptext_line
elements in paragraphs.--line-type
: specify the type of the elements containing transcriptions, from which<TextEquiv>
nodes will be created; defaults totext_line
.--transcription-source
: only export transcriptions created by a given worker run (UUID) or manual transcriptions (manual
).
arkindex export $PATH/TO/database.sqlite pagexml \
--line-type line \
--paragraph-type paragraph \
--page-type single_page
You can restrict the PageXML conversion to only part of your export/project using the --parent
argument; the command will only create PageXML files from the page
elements contained in the parent whose ID was given using this argument, ignoring the others.
arkindex export $PATH/TO/database.sqlite pagexml --parent $FOLDER_ID