Upload commands

The upload subcommands allow you to import data to Arkindex.

Images stored on an S3-compatible bucket (MinIO)

The minio subcommand generates IIIF image URLs for images that are stored on a given S3-compatible (AWS, MinIO, Ceph...) bucket, which can then been uploaded to Arkindex using the IIIF import subcommand.

arkindex upload minio -b $BUCKET_NAME

If there are multiple folders on the target bucket, the subcommand will output one file per folder, containing the URLs for the images in this folder.

Authentication

Before running the minio subcommand, you need to authenticate yourself with credentials for an account that has access to the bucket you're targeting. This authentication is done through environment variables.

export MINIO_ACCESS_KEY=$YOUR_ACCESS_KEY
export MINIO_SECRET_KEY=$YOUR_SECRET_KEY

Required arguments

The only required argument for the minio upload subcommand is the name of the targeted bucket, provided using -b or --bucket-name.

Optional arguments and default parameters

Other arguments are not required, but have default values that are used by the subcommand.

  • --iiif-server: the URL of the IIIF server through which the images on the bucket are exposed. By default, the IIIF server used is https://europe-gamma.iiif.teklia.com/iiif/2/. Example usage:
arkindex upload minio -b $BUCKET_NAME --iiif-server https://some.iiif-server.com/iiif/
  • --minio-url: the URL of the server on which the target bucket is located. By default, the server URL is ceph.iiif.teklia.com. Example usage:
arkindex upload minio -b $BUCKET_NAME --minio-url a-storage-server.domain.com
  • --out-dir: the path to the output directory where the IIIF URLs lists will be created. By default, a iiif_urls_output folder is created in the current working directory.
arkindex upload minio -b $BUCKET_NAME --out-dir $PATH/TO/FOLDER/
  • --prefix: the path to the location of the files, if you do not wish to list all the files inside a bucket. This path does not include the bucket name, as it is already provided by the --bucket-name argument. In order to list the files located in BUCKET_NAME/folder/subfolder, you must use the following command:
arkindex upload minio -b $BUCKET_NAME --prefix $folder/$subfolder

Shortcut command for importing images from multiple files

Before using this command, you might want to read the IIIF import subcommand documentation to be aware of all the available options.

corpus_id=<CORPUS ID>
cd iiif_urls_output/
for file in * ; do
  echo $file
  folder_name=`basename $file .txt`
  arkindex -p demo upload iiif-images $file --corpus-id $corpus_id --import-folder-name $folder_name
done

Page XML documents

The pagexml subcommand allows you to upload elements and transcriptions from Page XML documents to their corresponding Page elements on Arkindex, as long as the imageFilename attribute in the Page XML files matches the name of the Arkindex Page elements.

arkindex upload pagexml --xml-path $PATH/TO/FOLDER --parent $ARKINDEX_ELEMENT_ID

This command takes two arguments:

  • --xml-path: either the path to a folder containing the Page XML files, or the path to a file containing a list of the paths to the XML files, one file per line.
arkindex upload pagexml --xml-path $PATH/TO/paths_file.txt --parent $ARKINDEX_ELEMENT_ID
<paths_file.txt>

/PATH/TO/FOLDER/filename_1.xml
/PATH/TO/ANOTHER_FOLDER/filename_2.xml
/PATH/TO/ANOTHER_FOLDER/filename_3.xml
...
  • --parent: the ID of a folder-type Arkindex element, which contains the target Arkindex Page elements onto which the elements and transcriptions will be uploaded.
arkindex upload pagexml --xml-path $PATH/TO/FOLDER --parent $ARKINDEX_ELEMENT_ID

Alto XML documents

The alto subcommand allows you to upload images, elements and transcriptions from Alto XML documents to Arkindex, as long as the images are available from a IIIF server and the image filenames on this server match the content of the fileName node in the corresponding Alto XML files. If the images have already been imported into Arkindex, they will be retrieved and used to create new Page elements.

arkindex upload alto --iiif-base-url http://some-server.domain.com/iiif/folder-name/ --parent-id $ARKINDEX_ELEMENT_ID --create-types

A limited subset of ALTO 1.4 documents is officially supported:

  • The Alto XML documents must have their MeasurementUnit set to pixel.
  • Shapes are not supported; only the HPOS, VPOS, WIDTH and HEIGHT attributes are used to build rectangles.
  • String elements within other nodes are only imported as transcriptions for these nodes, not as elements; their HPOS, VPOS, WIDTH and HEIGHT attributes are ignored, and only the CONTENT attribute is used.

Importing Alto XML files describing multiple pages/images is supported, as long as each Page node has a PHYSICAL_IMG_NR attribute that can be used to build a IIIF URL just like the fileName node for single-page documents.

Path to files

The alto command takes one optional positional argument: the path to the Alto XML files. If no path is specified, this defaults to the current working directory.

arkindex upload alto $PATH/TO/FOLDER/ --iiif-base-url http://some-server.domain.com/iiif/folder/ --parent-id $ARKINDEX_ELEMENT_ID --create-types

Required arguments

The alto command takes three required arguments.

  • --iiif-base-url: the base URL on a IIIF image server from which the image URLs are built ($IIIF_BASE_URL/{imageFilename}). It must include both the IIIF server address and the encoded path to the target images (with %2F as /).
arkindex upload alto --iiif-base-url https://some-server.domain.com/iiif/public%2Fsomedate%2Ffolder/ --parent-id $ARKINDEX_ELEMENT_ID --create-types
  • --parent-id: the ID of an existing folder-type element on Arkindex, into which the pages, elements and transcriptions will be imported.
arkindex upload alto --parent-id $ARKINDEX_ELEMENT_ID --iiif-base-url http://some-server.domain.com/iiif/folder/ --create-types

There are two ways the Alto XML import can deal with the elements found in the Alto XML files. You can choose an option using one of these two (required, mutually exclusive) arguments:

  • --create-types: the import will create element types in the target Arkindex corpus for each element type found in the XML files (unless a type with that slug already exists, in which case it will use the existing type). For example, if the Alto XML files contain Page, TextBlock, Paragraph and TextLine nodes, and none of these already exist within the target corpus, then the page, textblock, paragraph and textline element types will be created and used for the import. This is the recommended approach, as it ensures that all the information from the Alto XML files will be imported into Arkindex.
arkindex upload alto $PATH/TO/FOLDER/ --create-types --iiif-base-url http://some-server.domain.com/iiif/folder/ --parent-id $ARKINDEX_ELEMENT_ID
  • --existing-types: specify a correspondence between the Alto XML nodes and existing element types in the target Arkindex corpus. Any nodes for which no corresponding element type has been set will be ignored and not imported. The types matching must follow the following format: "alto_type:arkindex_type alto_type_2:arkindex_type_2" (within double quotation marks, both Alto XML and Arkindex element types in lowercase).
arkindex upload alto --existing-types="alto_type:arkindex_type alto_type_2:arkindex_type_2" --iiif-base-url http://some-server.domain.com/iiif/folder/ --parent-id $ARKINDEX_ELEMENT_ID

IIIF images

The iiif-images subcommand allows you to create elements on Arkindex from a text file containing a list of IIIF images URIs (such as those generated by the MinIO upload subcommand).

arkindex upload iiif-images $PATH/TO/uris_list_file.txt --corpus-id $ARKINDEX_CORPUS_ID

Required arguments

The minimal required arguments for running the iiif-images command are:

  • the path to the file containing the list of IIIF image URIs, which is a positional argument.
  • the ID of either an Arkindex corpus or an Arkindex folder-type element into which the images will be imported, specified using one of two (mutually exclusive) arguments:
    • --corpus-id
    • --parent-folder
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --parent-folder $ARKINDEX_ELEMENT_ID

Optional arguments

Whether you import your images into a corpus or into a folder-type element, the IIIF import command will create a folder for your imported images. You can specify its name and type using the following arguments:

  • --import-folder-name: defaults to "IIIF import".
  • --import-folder-type: an existing element type in the target corpus; defaults to folder.
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --import-folder-name $FOLDER_NAME --import-folder-type $FOLDER_TYPE --corpus-id $ARKINDEX_CORPUS_ID
  • --element-type: you can use this argument to specify the type (an existing element type in the target corpus) of the elements that will be created from your IIIF images; defaults to page.
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --element-type $ELEMENT_TYPE --corpus-id $ARKINDEX_CORPUS_ID
  • --image-name-delimiter: define the delimiter for the last part of the image URI which will be used as the element name on Arkindex; defaults to /. For example, if your image's URI is http://some-server.domain.com/iiif/folder/date%category%filename.jpg, if you do not specify a delimiter the import will use / and the element's name will be date%category%filename.jpg; if you set the delimiter to % then the created element's name will be filename.jpg.
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --image-name-delimiter % --corpus-id $ARKINDEX_CORPUS_ID

Elements hierarchy

If you want to import your images to Arkindex with a given hierarchy, not importing all the images in one folder, you can use the following arguments:

  • Either one of the mutually exclusive:
    • --keep-hierarchy: recreate on Arkindex the hierarchy contained in the IIIF image URIs. For example, if your URIs look like this: http://some-server.domain.com/iiif/FOLDER1/SUBFOLDER1/SUBFOLDER2/filename.jpg then the import command will create, inside the import folder, a FOLDER1 element, and inside it a SUBFOLDER1 element, and inside it a SUBFOLDER2 element and inside it your image.
    • --group-prefix-delimiter: create sub-folders grouping IIIF images by name prefix, splitting file names between group prefix and image names according to the group prefix delimiter. For example, if you have images with URIs that look like http://some-server.domain.com/iiif/folder/subfolder/date%location1%filename1.jpg and others with URIs like http://some-server.domain.com/iiif/folder/subfolder/date%location2%filename2.jpg, you can put filename1.jpg into a date%location1 sub-folder, and filename2.jpg into a date%location2 sub-folder with the following command:
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --group-prefix-delimiter % --corpus-id $ARKINDEX_CORPUS_ID
  • --group-folder-type: define the type of the sub-folders that will be created to contain your grouped images; defaults to the type set by import-folder-type.
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --keep-hierarchy --group-folder-type $ELEMENT_TYPE --corpus-id $ARKINDEX_CORPUS_ID

Usage examples

Grouping images by prefix

You want to import the images whose URIs are listed in the following my_iiif_images.txt file:

http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page1.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page2.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page3.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%tarn%page1.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%tarn%page2.jpg

You do not care about numgrp10, which corresponds to a digitization campaign, or about the folder hierarchy before it. You want to import these images inside a folder element called occitanie, within a corpus with the ID aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa, and you want the images to be grouped inside subfolders of type departement (you have created this element type, ticking the "folder" checkbox, in your corpus from the project administration page) based on the france%aveyron-like prefixes.

You should then run the following command:

arkindex import upload iiif-images ./my_iiif_images.txt --corpus-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa --import-folder-name occitanie --image-name-delimiter - --group-prefix-delimiter % --group-folder-type departement

This command results in the creation, in the targeted corpus, of an occitanie folder element, within which are two france%aveyron and france%tarn departement elements, containing their respective page elements.

Recreating a folder hierarchy

You want to import the images whose URIs are listed in the following my_iiif_images.txt file:

http://some-server.domain.com/iiif/cork/cork/file1.jpg
http://some-server.domain.com/iiif/cork/cork/file2.jpg
http://some-server.domain.com/iiif/cork/cork/file3.jpg
http://some-server.domain.com/iiif/cork/mallow/file1.jpg
http://some-server.domain.com/iiif/cork/mallow/file2.jpg
http://some-server.domain.com/iiif/limerick/kilmallock/file1.jpg
http://some-server.domain.com/iiif/limerick/kilmallock/file2.jpg

You want to import these images inside a folder element called Ireland, within a corpus with the ID aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa, and you want to reproduce on Arkindex the hierarchy from the URIs. You do not want to import your images as page elements, but as double_page elements (you have created this element type in your corpus from the project administration page).

You should then run the following command:

arkindex import upload iiif-images ./my_iiif_images.txt --corpus-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa --import-folder-name Ireland --element-type double_page --keep-hierarchy

This command results in the creation, in the targeted corpus, of an Ireland folder element, within which are two cork and limerick folder elements. The cork folder contains two more sub-folders, called cork and mallow; the limerick folder contains one kilmallock folder. Inside those sub-folders, your images have been imported as double_page elements.