Upload commands¶
The upload
subcommands allow you to import data to Arkindex.
Images stored on an S3-compatible bucket (MinIO)¶
The minio
subcommand generates IIIF image URLs for images that are stored on a given S3-compatible (AWS, MinIO, Ceph…) bucket, which can then been uploaded to Arkindex using the IIIF import subcommand.
arkindex upload minio -b $BUCKET_NAME
If there are multiple folders on the target bucket, the subcommand will output one file per folder, containing the URLs for the images in this folder.
Authentication¶
Before running the minio
subcommand, you need to authenticate yourself with credentials for an account that has access to the bucket you’re targeting. This authentication is done through environment variables.
export MINIO_ACCESS_KEY=$YOUR_ACCESS_KEY
export MINIO_SECRET_KEY=$YOUR_SECRET_KEY
Required arguments¶
The only required argument for the minio
upload subcommand is the name of the targeted bucket, provided using -b
or --bucket-name
.
Optional arguments and default parameters¶
Other arguments are not required, but have default values that are used by the subcommand.
--iiif-server
: the URL of the IIIF server through which the images on the bucket are exposed. By default, the IIIF server used ishttps://europe-gamma.iiif.teklia.com/iiif/2/
. Example usage:
arkindex upload minio -b $BUCKET_NAME --iiif-server https://some.iiif-server.com/iiif/
--minio-url
: the URL of the server on which the target bucket is located. By default, the server URL isceph.iiif.teklia.com
. Example usage:
arkindex upload minio -b $BUCKET_NAME --minio-url a-storage-server.domain.com
--out-dir
: the path to the output directory where the IIIF URLs lists will be created. By default, aiiif_urls_output
folder is created in the current working directory.
arkindex upload minio -b $BUCKET_NAME --out-dir $PATH/TO/FOLDER/
--prefix
: the path to the location of the files, if you do not wish to list all the files inside a bucket. This path does not include the bucket name, as it is already provided by the--bucket-name
argument. In order to list the files located inBUCKET_NAME/folder/subfolder
, you must use the following command:
arkindex upload minio -b $BUCKET_NAME --prefix $folder/$subfolder
Shortcut command for importing images from multiple files¶
Before using this command, you might want to read the IIIF import subcommand documentation to be aware of all the available options.
corpus_id=<CORPUS ID>
cd iiif_urls_output/
for file in * ; do
echo $file
folder_name=`basename $file .txt`
arkindex -p demo upload iiif-images $file --corpus-id $corpus_id --import-folder-name $folder_name
done
Page XML documents¶
The pagexml
subcommand allows you to upload elements and transcriptions from Page XML documents to their corresponding Page
elements on Arkindex, as long as the imageFilename
attribute in the Page XML files matches the name of the Arkindex Page
elements.
arkindex upload pagexml --xml-path $PATH/TO/FOLDER --parent $ARKINDEX_ELEMENT_ID
This command takes three arguments:
-
--worker-run-id
: the ID of a worker run to publish children elements, transcriptions and metadata using bulk endpoints. -
--xml-path
: either the path to a folder containing the Page XML files, or the path to a file containing a list of the paths to the XML files, one file per line.arkindex upload pagexml --xml-path $PATH/TO/paths_file.txt --parent $ARKINDEX_ELEMENT_ID
<paths_file.txt> /PATH/TO/FOLDER/filename_1.xml /PATH/TO/ANOTHER_FOLDER/filename_2.xml /PATH/TO/ANOTHER_FOLDER/filename_3.xml ...
-
--parent
: the ID of a folder-type Arkindex element, which contains the target ArkindexPage
elements onto which the elements and transcriptions will be uploaded.
arkindex upload pagexml --worker-run-id $ARKINDEX_WORKER_RUN_ID --xml-path $PATH/TO/FOLDER --parent $ARKINDEX_ELEMENT_ID
Alto XML documents¶
There are 2 options to upload alto XML documents:
Alto XML documents with images that are on a normal IIIF server¶
The alto
subcommand allows you to upload images, elements, metadata and transcriptions from Alto XML documents to Arkindex, as long as the images are available from a IIIF server and the image filenames on this server match either:
- the content of the
Description::sourceImageInformation::fileIdentifier
node in the corresponding ALTO XML files, - or the content of the
Description::sourceImageInformation::fileName
node in the corresponding Alto XML files.
If the images have already been imported into Arkindex, they will be retrieved and used to create new Page
elements.
arkindex upload alto --iiif-base-url http://some-server.domain.com/iiif/folder-name/ --parent-id $ARKINDEX_ELEMENT_ID --create-types
A limited subset of ALTO 1.4 documents is officially supported:
- The Alto XML documents must have their
MeasurementUnit
set to pixel. - Shapes are not supported; only the
HPOS
,VPOS
,WIDTH
andHEIGHT
attributes are used to build rectangles. String
elements within other nodes are only imported as transcriptions for these nodes, not as elements; theirHPOS
,VPOS
,WIDTH
andHEIGHT
attributes are ignored, and only theCONTENT
attribute is used.
Importing Alto XML files describing multiple pages/images is supported, as long as each Page
node has a PHYSICAL_IMG_NR
attribute that can be used to build a IIIF URL just like the Description::sourceImageInformation::fileName
node for single-page documents.
Path to files¶
The alto
command takes one optional positional argument: the path to the Alto XML files. If no path is specified, this defaults to the current working directory.
arkindex upload alto $PATH/TO/FOLDER/ --iiif-base-url http://some-server.domain.com/iiif/folder/ --parent-id $ARKINDEX_ELEMENT_ID --create-types
Required arguments¶
The alto
command takes two required arguments.
--iiif-base-url
: the base URL on a IIIF image server from which the image URLs are built ($IIIF_BASE_URL{xmlPath}{imageFilename}
). It must include both the IIIF server address and the encoded path to the target images (with%2F
as/
),--worker-run-id
: the ID of a worker run to publish children elements using bulk endpoints.
There are two ways to import the Alto XML documents. You can choose an option using one of these two (required, mutually exclusive) arguments:
--parent-id
: the ID of an existing folder-type element on Arkindex, into which the images, elements, metadata and transcriptions will be imported. This will use the Arkindex API to upload all information provided by the Alto XML documents.--db
: the path to a SQLite database into which the images, elements, metadata and transcriptions will be imported. Using a SQLite database will limit the use of the Arkindex API to retrieving the worker run and creating images. This will also improve the upload speed. All other API calls (for the creation of elements, metadata and transcriptions) will be avoided in order to populate the database. This database will respect the Arkindex export format version 9 and can then be imported directly into Arkindex. ⚠ As image creation is always done using the Arkindex API, the database can only be imported on the instance used by the command (see the-p
/--profile
argument on the dedicated section).
There are two ways the Alto XML import can deal with the elements found in the Alto XML files. You can choose an option using one of these two (required, mutually exclusive) arguments:
--create-types
: the import will create element types in the target Arkindex corpus for each element type found in the XML files (unless a type with that slug already exists, in which case it will use the existing type). For example, if the Alto XML files contain Page, TextBlock, Paragraph and TextLine nodes, and none of these already exist within the target corpus, then thepage
,textblock
,paragraph
andtextline
element types will be created and used for the import. This is the recommended approach, as it ensures that all the information from the Alto XML files will be imported into Arkindex.--existing-types
: specify a correspondence between the Alto XML nodes and existing element types in the target Arkindex corpus. Any nodes for which no corresponding element type has been set will be ignored and not imported. The types matching must follow the following format:"alto_type:arkindex_type alto_type_2:arkindex_type_2"
(within double quotation marks, both Alto XML and Arkindex element types in lowercase).
The full command to upload a ALTO XML file’s data would look like
arkindex upload alto \
$PATH/TO/FOLDER/ \
--worker-run-id $ARKINDEX_WORKER_RUN_ID \
--iiif-base-url http://some-server.domain.com/iiif/folder/ \
[--parent-id $ARKINDEX_ELEMENT_ID | --db alto_upload.db] \
[--create-types | --existing-types="alto_type:arkindex_type alto_type_2:arkindex_type_2"]
Optional arguments¶
There are other optional arguments that will be useful to handle the documents.
--alto-namespace
: this is used to set the XML namespace of the files, useful if it’s not clearly mentioned already, e.g. http://schema.ccs-gmbh.com/docworks/version20/alto-1-4.xsd,--dpi-x
: the horizontal resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters,--dpi-y
: the vertical resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters.
To manage Arkindex publication, you can specify two more options:
--skip-metadata
will skip publishing metadata (attributes on XML nodes), which improves the upload speed as metadata must be published element by element.--ignore-types
will ignore XML nodes of the given types, which improves the upload speed.--parent-name
: Name of a parent folder under which page elements will be created (defaults toALTO upload from CLI
). Only used when using a SQLite database with the--db
argument.
Alto XML documents with images that are on a Gallica IIIF server¶
The gallica
subcommand allows you to upload images that on the Gallica server. If that is the case, you should use the gallica
subcommand instead of the alto
subcommand and use the arguments in the same way listed above (for the alto
subcommand) with some additional arguments:
--metadata-file
: this required argument should be used to provide the path to a csv file that contains two columns:- the first column should be the folder’s name
- the second column should contain its corresponding ark id. This file will be used to create the url to get the images from the Gallica IIIF server.
METS XML documents¶
The mets
subcommand allows you to upload images, elements, metadata and transcriptions from ALTO XML documents as well as a hierarchy described in a METS XML file. Only ALTO XML files are supported for import. If the ALTO XML files have already been imported into Arkindex and the cache is still present, the associated elements will be reused to add the missing data (parent, children…).
Required arguments¶
The mets
command takes two required arguments.
- the path to the METS XML file, the first positional argument,
--worker-run-id
, the ID of a worker run to publish children elements using bulk endpoints.
There are two ways to import the METS XML file. You can choose an option using one of these two (required, mutually exclusive) arguments:
--parent-id
: the ID of an existing folder-type element on Arkindex, into which the images, elements, metadata and transcriptions will be imported. This will use the Arkindex API to upload all information provided by the METS XML file.--db
: the path to a SQLite database into which the images, elements, metadata and transcriptions will be imported. Using a SQLite database will limit the use of the Arkindex API to retrieving the worker run and creating images. This will also improve the upload speed. All other API calls (for the creation of elements, metadata and transcriptions) will be avoided in order to populate the database. This database will respect the Arkindex export format version 9 and can then be imported directly into Arkindex. ⚠ As image creation is always done using the Arkindex API, the database can only be imported on the instance used by the command (see the-p
/--profile
argument on the dedicated section).
Optional arguments¶
There are other optional arguments that will be useful to handle the images.
--iiif-base-url
, the base URL of the IIIF server where the images used by the METS file are exposed (defaults tohttps://europe-gamma.iiif.teklia.com/iiif/2/
),--iiif-prefix
, the prefix on the IIIF server behind which the images used by the METS file are exposed,--dpi-x
, the horizontal resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters,--dpi-y
, the vertical resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters.
Note
The DPI related arguments are ignored for files using coordinates in pixels.
The full command to upload a METS XML file’s data would look like
arkindex upload mets \
$PATH/TO/mets.xml \
$ARKINDEX_CORPUS_ID \
[--parent-id $ARKINDEX_ELEMENT_ID | --db mets_upload.db] \
--worker-run-id $ARKINDEX_WORKER_RUN_ID \
--iiif-prefix http://some-server.domain.com/iiif/folder/ \
--dpi-x 300 \
--dpi-y 300
To manage Arkindex publication, you can specify two more options:
--alto
will publish ALTO files in the correct order according to itsfileSec::fileGrp
node, before uploading METS file.--skip-metadata
will skip publishing metadata (attributes on XML nodes), which improves the upload speed as metadata must be published element by element.--ignore-types
will ignore XML nodes of the given types, which improves the upload speed.--parent-name
: Name of a parent folder under which page elements will be created (defaults toMETS upload from CLI
). Only used when using a SQLite database with the--db
argument.
Limitations and requirements¶
- While this command imports the elements described in each ALTO XML file and all the structural elements described in the METS XML file, this does not link the structural elements to Arkindex pages.
- The images must be available on the IIIF server with the same file hierarchy, before using this command.
- Supported metadata are:
- the ID of the node in the METS structure (imported as
METS ID
), - any metadata specified through the
DMDID
attribute that links towards admdSec::mdWrap[MDTYPE="DC"]
node will be imported on the related element. - You must have an administrator access to the corpus to run the import.
- The command will store the ID of the METS and ALTO files as well as the LANG as element metadata. The following metadata must be allowed in your corpus, before running this command:
METS ID
, with typereference
,Alto ID
, with typereference
,Lang
, with typetext
.
IIIF images¶
The iiif-images
subcommand allows you to create elements on Arkindex from a text file containing a list of IIIF images URIs (such as those generated by the MinIO upload subcommand).
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --corpus-id $ARKINDEX_CORPUS_ID
Required arguments¶
The minimal required arguments for running the iiif-images
command are:
- the path to the file containing the list of IIIF image URIs, which is a positional argument.
- the ID of either an Arkindex corpus or an Arkindex folder-type element into which the images will be imported, specified using one of two (mutually exclusive) arguments:
--corpus-id
--parent-folder
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --parent-folder $ARKINDEX_ELEMENT_ID
Optional arguments¶
Whether you import your images into a corpus or into a folder-type element, the IIIF import command will create a folder for your imported images. You can specify its name and type using the following arguments:
--import-folder-name
: defaults to “IIIF import”.-
--import-folder-type
: an existing element type in the target corpus; defaults tofolder
.arkindex upload iiif-images $PATH/TO/uris_list_file.txt --import-folder-name $FOLDER_NAME --import-folder-type $FOLDER_TYPE --corpus-id $ARKINDEX_CORPUS_ID
-
--element-type
: you can use this argument to specify the type (an existing element type in the target corpus) of the elements that will be created from your IIIF images; defaults topage
.arkindex upload iiif-images $PATH/TO/uris_list_file.txt --element-type $ELEMENT_TYPE --corpus-id $ARKINDEX_CORPUS_ID
-
--image-name-delimiter
: define the delimiter for the last part of the image URI which will be used as the element name on Arkindex; defaults to/
. For example, if your image’s URI ishttp://some-server.domain.com/iiif/folder/date%category%filename.jpg
, if you do not specify a delimiter the import will use/
and the element’s name will bedate%category%filename.jpg
; if you set the delimiter to%
then the created element’s name will befilename.jpg
.arkindex upload iiif-images $PATH/TO/uris_list_file.txt --image-name-delimiter % --corpus-id $ARKINDEX_CORPUS_ID
Elements hierarchy¶
If you want to import your images to Arkindex with a given hierarchy, not importing all the images in one folder, you can use the following arguments:
- Either one of the mutually exclusive:
--keep-hierarchy
: recreate on Arkindex the hierarchy contained in the IIIF image URIs. For example, if your URIs look like this:http://some-server.domain.com/iiif/FOLDER1/SUBFOLDER1/SUBFOLDER2/filename.jpg
then the import command will create, inside the import folder, aFOLDER1
element, and inside it aSUBFOLDER1
element, and inside it aSUBFOLDER2
element and inside it your image.--group-prefix-delimiter
: create sub-folders grouping IIIF images by name prefix, splitting file names between group prefix and image names according to the group prefix delimiter. For example, if you have images with URIs that look likehttp://some-server.domain.com/iiif/folder/subfolder/date%location1%filename1.jpg
and others with URIs likehttp://some-server.domain.com/iiif/folder/subfolder/date%location2%filename2.jpg
, you can putfilename1.jpg
into adate%location1
sub-folder, andfilename2.jpg
into adate%location2
sub-folder with the following command:
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --group-prefix-delimiter % --corpus-id $ARKINDEX_CORPUS_ID
--group-folder-type
: define the type of the sub-folders that will be created to contain your grouped images; defaults to the type set byimport-folder-type
.
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --keep-hierarchy --group-folder-type $ELEMENT_TYPE --corpus-id $ARKINDEX_CORPUS_ID
Usage examples¶
Grouping images by prefix¶
You want to import the images whose URIs are listed in the following my_iiif_images.txt
file:
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page1.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page2.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page3.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%tarn%page1.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%tarn%page2.jpg
You do not care about numgrp10
, which corresponds to a digitization campaign, or about the folder hierarchy before it. You want to import these images inside a folder
element called occitanie
, within a corpus with the ID aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
, and you want the images to be grouped inside subfolders of type departement
(you have created this element type, ticking the “folder” checkbox, in your corpus from the project administration page) based on the france%aveyron
-like prefixes.
You should then run the following command:
arkindex import upload iiif-images ./my_iiif_images.txt --corpus-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa --import-folder-name occitanie --image-name-delimiter - --group-prefix-delimiter % --group-folder-type departement
This command results in the creation, in the targeted corpus, of an occitanie
folder
element, within which are two france%aveyron
and france%tarn
departement
elements, containing their respective page
elements.
Recreating a folder hierarchy¶
You want to import the images whose URIs are listed in the following my_iiif_images.txt
file:
http://some-server.domain.com/iiif/cork/cork/file1.jpg
http://some-server.domain.com/iiif/cork/cork/file2.jpg
http://some-server.domain.com/iiif/cork/cork/file3.jpg
http://some-server.domain.com/iiif/cork/mallow/file1.jpg
http://some-server.domain.com/iiif/cork/mallow/file2.jpg
http://some-server.domain.com/iiif/limerick/kilmallock/file1.jpg
http://some-server.domain.com/iiif/limerick/kilmallock/file2.jpg
You want to import these images inside a folder
element called Ireland
, within a corpus with the ID aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
, and you want to reproduce on Arkindex the hierarchy from the URIs. You do not want to import your images as page
elements, but as double_page
elements (you have created this element type in your corpus from the project administration page).
You should then run the following command:
arkindex import upload iiif-images ./my_iiif_images.txt --corpus-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa --import-folder-name Ireland --element-type double_page --keep-hierarchy
This command results in the creation, in the targeted corpus, of an Ireland
folder
element, within which are two cork
and limerick
folder
elements. The cork
folder contains two more sub-folders, called cork
and mallow
; the limerick
folder contains one kilmallock
folder. Inside those sub-folders, your images have been imported as double_page
elements.