Skip to content

Upload commands

The upload subcommands allow you to import data to Arkindex.

Images stored on an S3-compatible bucket (MinIO)

The minio subcommand generates IIIF image URLs for images that are stored on a given S3-compatible (AWS, MinIO, Ceph…) bucket, which can then been uploaded to Arkindex using the IIIF import subcommand.

arkindex upload minio -b $BUCKET_NAME

If there are multiple folders on the target bucket, the subcommand will output one file per folder, containing the URLs for the images in this folder.

Authentication

Before running the minio subcommand, you need to authenticate yourself with credentials for an account that has access to the bucket you’re targeting. This authentication is done through environment variables.

export MINIO_ACCESS_KEY=$YOUR_ACCESS_KEY
export MINIO_SECRET_KEY=$YOUR_SECRET_KEY

Required arguments

The only required argument for the minio upload subcommand is the name of the targeted bucket, provided using -b or --bucket-name.

Optional arguments and default parameters

Other arguments are not required, but have default values that are used by the subcommand.

  • --iiif-server: the URL of the IIIF server through which the images on the bucket are exposed. By default, the IIIF server used is https://europe-gamma.iiif.teklia.com/iiif/2/. Example usage:
arkindex upload minio -b $BUCKET_NAME --iiif-server https://some.iiif-server.com/iiif/
  • --minio-url: the URL of the server on which the target bucket is located. By default, the server URL is ceph.iiif.teklia.com. Example usage:
arkindex upload minio -b $BUCKET_NAME --minio-url a-storage-server.domain.com
  • --out-dir: the path to the output directory where the IIIF URLs lists will be created. By default, a iiif_urls_output folder is created in the current working directory.
arkindex upload minio -b $BUCKET_NAME --out-dir $PATH/TO/FOLDER/
  • --prefix: the path to the location of the files, if you do not wish to list all the files inside a bucket. This path does not include the bucket name, as it is already provided by the --bucket-name argument. In order to list the files located in BUCKET_NAME/folder/subfolder, you must use the following command:
arkindex upload minio -b $BUCKET_NAME --prefix $folder/$subfolder

Shortcut command for importing images from multiple files

Before using this command, you might want to read the IIIF import subcommand documentation to be aware of all the available options.

corpus_id=<CORPUS ID>
cd iiif_urls_output/
for file in * ; do
  echo $file
  folder_name=`basename $file .txt`
  arkindex -p demo upload iiif-images $file --corpus-id $corpus_id --import-folder-name $folder_name
done

Page XML documents

The pagexml subcommand allows you to upload elements and transcriptions from Page XML documents to their corresponding Page elements on Arkindex, as long as the imageFilename attribute in the Page XML files matches the name of the Arkindex Page elements.

arkindex upload pagexml --xml-path $PATH/TO/FOLDER --parent $ARKINDEX_ELEMENT_ID

This command takes three arguments:

  • --worker-run-id: the ID of a worker run to publish children elements, transcriptions and metadata using bulk endpoints.

  • --xml-path: either the path to a folder containing the Page XML files, or the path to a file containing a list of the paths to the XML files, one file per line.

    arkindex upload pagexml --xml-path $PATH/TO/paths_file.txt --parent $ARKINDEX_ELEMENT_ID
    
    <paths_file.txt>
    
    /PATH/TO/FOLDER/filename_1.xml
    /PATH/TO/ANOTHER_FOLDER/filename_2.xml
    /PATH/TO/ANOTHER_FOLDER/filename_3.xml
    ...
    

  • --parent: the ID of a folder-type Arkindex element, which contains the target Arkindex Page elements onto which the elements and transcriptions will be uploaded.

arkindex upload pagexml --worker-run-id $ARKINDEX_WORKER_RUN_ID --xml-path $PATH/TO/FOLDER --parent $ARKINDEX_ELEMENT_ID

Alto XML documents

There are 2 options to upload alto XML documents:

Alto XML documents with images that are on a normal IIIF server

The alto subcommand allows you to upload images, elements, metadata and transcriptions from Alto XML documents to Arkindex, as long as the images are available from a IIIF server and the image filenames on this server match either:

  • the content of the Description::sourceImageInformation::fileIdentifier node in the corresponding ALTO XML files,
  • or the content of the Description::sourceImageInformation::fileName node in the corresponding Alto XML files.

If the images have already been imported into Arkindex, they will be retrieved and used to create new Page elements.

arkindex upload alto --iiif-base-url http://some-server.domain.com/iiif/folder-name/ --parent-id $ARKINDEX_ELEMENT_ID --create-types

A limited subset of ALTO 1.4 documents is officially supported:

  • The Alto XML documents must have their MeasurementUnit set to pixel.
  • Shapes are not supported; only the HPOS, VPOS, WIDTH and HEIGHT attributes are used to build rectangles.
  • String elements within other nodes are only imported as transcriptions for these nodes, not as elements; their HPOS, VPOS, WIDTH and HEIGHT attributes are ignored, and only the CONTENT attribute is used.

Importing Alto XML files describing multiple pages/images is supported, as long as each Page node has a PHYSICAL_IMG_NR attribute that can be used to build a IIIF URL just like the Description::sourceImageInformation::fileName node for single-page documents.

Path to files

The alto command takes one optional positional argument: the path to the Alto XML files. If no path is specified, this defaults to the current working directory.

arkindex upload alto $PATH/TO/FOLDER/ --iiif-base-url http://some-server.domain.com/iiif/folder/ --parent-id $ARKINDEX_ELEMENT_ID --create-types

Required arguments

The alto command takes two required arguments.

  • --iiif-base-url: the base URL on a IIIF image server from which the image URLs are built ($IIIF_BASE_URL{xmlPath}{imageFilename}). It must include both the IIIF server address and the encoded path to the target images (with %2F as /),
  • --worker-run-id: the ID of a worker run to publish children elements using bulk endpoints.

There are two ways to import the Alto XML documents. You can choose an option using one of these two (required, mutually exclusive) arguments:

  • --parent-id: the ID of an existing folder-type element on Arkindex, into which the images, elements, metadata and transcriptions will be imported. This will use the Arkindex API to upload all information provided by the Alto XML documents.
  • --db: the path to a SQLite database into which the images, elements, metadata and transcriptions will be imported. Using a SQLite database will limit the use of the Arkindex API to retrieving the worker run and creating images. This will also improve the upload speed. All other API calls (for the creation of elements, metadata and transcriptions) will be avoided in order to populate the database. This database will respect the Arkindex export format version 9 and can then be imported directly into Arkindex. ⚠ As image creation is always done using the Arkindex API, the database can only be imported on the instance used by the command (see the -p/--profile argument on the dedicated section).

There are two ways the Alto XML import can deal with the elements found in the Alto XML files. You can choose an option using one of these two (required, mutually exclusive) arguments:

  • --create-types: the import will create element types in the target Arkindex corpus for each element type found in the XML files (unless a type with that slug already exists, in which case it will use the existing type). For example, if the Alto XML files contain Page, TextBlock, Paragraph and TextLine nodes, and none of these already exist within the target corpus, then the page, textblock, paragraph and textline element types will be created and used for the import. This is the recommended approach, as it ensures that all the information from the Alto XML files will be imported into Arkindex.
  • --existing-types: specify a correspondence between the Alto XML nodes and existing element types in the target Arkindex corpus. Any nodes for which no corresponding element type has been set will be ignored and not imported. The types matching must follow the following format: "alto_type:arkindex_type alto_type_2:arkindex_type_2" (within double quotation marks, both Alto XML and Arkindex element types in lowercase).

The full command to upload a ALTO XML file’s data would look like

arkindex upload alto \
  $PATH/TO/FOLDER/ \
  --worker-run-id $ARKINDEX_WORKER_RUN_ID \
  --iiif-base-url http://some-server.domain.com/iiif/folder/ \
  [--parent-id $ARKINDEX_ELEMENT_ID | --db alto_upload.db] \
  [--create-types | --existing-types="alto_type:arkindex_type alto_type_2:arkindex_type_2"]

Optional arguments

There are other optional arguments that will be useful to handle the documents.

  • --alto-namespace: this is used to set the XML namespace of the files, useful if it’s not clearly mentioned already, e.g. http://schema.ccs-gmbh.com/docworks/version20/alto-1-4.xsd,
  • --dpi-x: the horizontal resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters,
  • --dpi-y: the vertical resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters.

To manage Arkindex publication, you can specify two more options:

  • --skip-metadata will skip publishing metadata (attributes on XML nodes), which improves the upload speed as metadata must be published element by element.
  • --ignore-types will ignore XML nodes of the given types, which improves the upload speed.
  • --parent-name: Name of a parent folder under which page elements will be created (defaults to ALTO upload from CLI). Only used when using a SQLite database with the --db argument.

Alto XML documents with images that are on a Gallica IIIF server

The gallica subcommand allows you to upload images that on the Gallica server. If that is the case, you should use the gallica subcommand instead of the alto subcommand and use the arguments in the same way listed above (for the alto subcommand) with some additional arguments:

  • --metadata-file: this required argument should be used to provide the path to a csv file that contains two columns:
    • the first column should be the folder’s name
    • the second column should contain its corresponding ark id. This file will be used to create the url to get the images from the Gallica IIIF server.

METS XML documents

The mets subcommand allows you to upload images, elements, metadata and transcriptions from ALTO XML documents as well as a hierarchy described in a METS XML file. Only ALTO XML files are supported for import. If the ALTO XML files have already been imported into Arkindex and the cache is still present, the associated elements will be reused to add the missing data (parent, children…).

Required arguments

The mets command takes two required arguments.

  • the path to the METS XML file, the first positional argument,
  • --worker-run-id, the ID of a worker run to publish children elements using bulk endpoints.

There are two ways to import the METS XML file. You can choose an option using one of these two (required, mutually exclusive) arguments:

  • --parent-id: the ID of an existing folder-type element on Arkindex, into which the images, elements, metadata and transcriptions will be imported. This will use the Arkindex API to upload all information provided by the METS XML file.
  • --db: the path to a SQLite database into which the images, elements, metadata and transcriptions will be imported. Using a SQLite database will limit the use of the Arkindex API to retrieving the worker run and creating images. This will also improve the upload speed. All other API calls (for the creation of elements, metadata and transcriptions) will be avoided in order to populate the database. This database will respect the Arkindex export format version 9 and can then be imported directly into Arkindex. ⚠ As image creation is always done using the Arkindex API, the database can only be imported on the instance used by the command (see the -p/--profile argument on the dedicated section).

Optional arguments

There are other optional arguments that will be useful to handle the images.

  • --iiif-base-url, the base URL of the IIIF server where the images used by the METS file are exposed (defaults to https://europe-gamma.iiif.teklia.com/iiif/2/),
  • --iiif-prefix, the prefix on the IIIF server behind which the images used by the METS file are exposed,
  • --dpi-x, the horizontal resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters,
  • --dpi-y, the vertical resolution of the image, in dots per inch, to be used for ALTO files using coordinates in tenths of millimeters.

Note

The DPI related arguments are ignored for files using coordinates in pixels.

The full command to upload a METS XML file’s data would look like

arkindex upload mets \
  $PATH/TO/mets.xml \
  $ARKINDEX_CORPUS_ID \
  [--parent-id $ARKINDEX_ELEMENT_ID | --db mets_upload.db] \
  --worker-run-id $ARKINDEX_WORKER_RUN_ID \
  --iiif-prefix http://some-server.domain.com/iiif/folder/ \
  --dpi-x 300 \
  --dpi-y 300

To manage Arkindex publication, you can specify two more options:

  • --alto will publish ALTO files in the correct order according to its fileSec::fileGrp node, before uploading METS file.
  • --skip-metadata will skip publishing metadata (attributes on XML nodes), which improves the upload speed as metadata must be published element by element.
  • --ignore-types will ignore XML nodes of the given types, which improves the upload speed.
  • --parent-name: Name of a parent folder under which page elements will be created (defaults to METS upload from CLI). Only used when using a SQLite database with the --db argument.

Limitations and requirements

  • While this command imports the elements described in each ALTO XML file and all the structural elements described in the METS XML file, this does not link the structural elements to Arkindex pages.
  • The images must be available on the IIIF server with the same file hierarchy, before using this command.
  • Supported metadata are:
  • the ID of the node in the METS structure (imported as METS ID),
  • any metadata specified through the DMDID attribute that links towards a dmdSec::mdWrap[MDTYPE="DC"] node will be imported on the related element.
  • You must have an administrator access to the corpus to run the import.
  • The command will store the ID of the METS and ALTO files as well as the LANG as element metadata. The following metadata must be allowed in your corpus, before running this command:
    • METS ID, with type reference,
    • Alto ID, with type reference,
    • Lang, with type text.

IIIF images

The iiif-images subcommand allows you to create elements on Arkindex from a text file containing a list of IIIF images URIs (such as those generated by the MinIO upload subcommand).

arkindex upload iiif-images $PATH/TO/uris_list_file.txt --corpus-id $ARKINDEX_CORPUS_ID

Required arguments

The minimal required arguments for running the iiif-images command are:

  • the path to the file containing the list of IIIF image URIs, which is a positional argument.
  • the ID of either an Arkindex corpus or an Arkindex folder-type element into which the images will be imported, specified using one of two (mutually exclusive) arguments:
    • --corpus-id
    • --parent-folder
      arkindex upload iiif-images $PATH/TO/uris_list_file.txt --parent-folder $ARKINDEX_ELEMENT_ID
      

Optional arguments

Whether you import your images into a corpus or into a folder-type element, the IIIF import command will create a folder for your imported images. You can specify its name and type using the following arguments:

  • --import-folder-name: defaults to “IIIF import”.
  • --import-folder-type: an existing element type in the target corpus; defaults to folder.

    arkindex upload iiif-images $PATH/TO/uris_list_file.txt --import-folder-name $FOLDER_NAME --import-folder-type $FOLDER_TYPE --corpus-id $ARKINDEX_CORPUS_ID
    

  • --element-type: you can use this argument to specify the type (an existing element type in the target corpus) of the elements that will be created from your IIIF images; defaults to page.

    arkindex upload iiif-images $PATH/TO/uris_list_file.txt --element-type $ELEMENT_TYPE --corpus-id $ARKINDEX_CORPUS_ID
    

  • --image-name-delimiter: define the delimiter for the last part of the image URI which will be used as the element name on Arkindex; defaults to /. For example, if your image’s URI is http://some-server.domain.com/iiif/folder/date%category%filename.jpg, if you do not specify a delimiter the import will use / and the element’s name will be date%category%filename.jpg; if you set the delimiter to % then the created element’s name will be filename.jpg.

    arkindex upload iiif-images $PATH/TO/uris_list_file.txt --image-name-delimiter % --corpus-id $ARKINDEX_CORPUS_ID
    

Elements hierarchy

If you want to import your images to Arkindex with a given hierarchy, not importing all the images in one folder, you can use the following arguments:

  • Either one of the mutually exclusive:
    • --keep-hierarchy: recreate on Arkindex the hierarchy contained in the IIIF image URIs. For example, if your URIs look like this: http://some-server.domain.com/iiif/FOLDER1/SUBFOLDER1/SUBFOLDER2/filename.jpg then the import command will create, inside the import folder, a FOLDER1 element, and inside it a SUBFOLDER1 element, and inside it a SUBFOLDER2 element and inside it your image.
    • --group-prefix-delimiter: create sub-folders grouping IIIF images by name prefix, splitting file names between group prefix and image names according to the group prefix delimiter. For example, if you have images with URIs that look like http://some-server.domain.com/iiif/folder/subfolder/date%location1%filename1.jpg and others with URIs like http://some-server.domain.com/iiif/folder/subfolder/date%location2%filename2.jpg, you can put filename1.jpg into a date%location1 sub-folder, and filename2.jpg into a date%location2 sub-folder with the following command:
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --group-prefix-delimiter % --corpus-id $ARKINDEX_CORPUS_ID
  • --group-folder-type: define the type of the sub-folders that will be created to contain your grouped images; defaults to the type set by import-folder-type.
arkindex upload iiif-images $PATH/TO/uris_list_file.txt --keep-hierarchy --group-folder-type $ELEMENT_TYPE --corpus-id $ARKINDEX_CORPUS_ID

Usage examples

Grouping images by prefix

You want to import the images whose URIs are listed in the following my_iiif_images.txt file:

http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page1.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page2.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%aveyron%page3.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%tarn%page1.jpg
http://some-server.domain.com/iiif/folder/subfolder/numgrp10-france%tarn%page2.jpg

You do not care about numgrp10, which corresponds to a digitization campaign, or about the folder hierarchy before it. You want to import these images inside a folder element called occitanie, within a corpus with the ID aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa, and you want the images to be grouped inside subfolders of type departement (you have created this element type, ticking the “folder” checkbox, in your corpus from the project administration page) based on the france%aveyron-like prefixes.

You should then run the following command:

arkindex import upload iiif-images ./my_iiif_images.txt --corpus-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa --import-folder-name occitanie --image-name-delimiter - --group-prefix-delimiter % --group-folder-type departement

This command results in the creation, in the targeted corpus, of an occitanie folder element, within which are two france%aveyron and france%tarn departement elements, containing their respective page elements.

Recreating a folder hierarchy

You want to import the images whose URIs are listed in the following my_iiif_images.txt file:

http://some-server.domain.com/iiif/cork/cork/file1.jpg
http://some-server.domain.com/iiif/cork/cork/file2.jpg
http://some-server.domain.com/iiif/cork/cork/file3.jpg
http://some-server.domain.com/iiif/cork/mallow/file1.jpg
http://some-server.domain.com/iiif/cork/mallow/file2.jpg
http://some-server.domain.com/iiif/limerick/kilmallock/file1.jpg
http://some-server.domain.com/iiif/limerick/kilmallock/file2.jpg

You want to import these images inside a folder element called Ireland, within a corpus with the ID aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa, and you want to reproduce on Arkindex the hierarchy from the URIs. You do not want to import your images as page elements, but as double_page elements (you have created this element type in your corpus from the project administration page).

You should then run the following command:

arkindex import upload iiif-images ./my_iiif_images.txt --corpus-id aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa --import-folder-name Ireland --element-type double_page --keep-hierarchy

This command results in the creation, in the targeted corpus, of an Ireland folder element, within which are two cork and limerick folder elements. The cork folder contains two more sub-folders, called cork and mallow; the limerick folder contains one kilmallock folder. Inside those sub-folders, your images have been imported as double_page elements.