Cookbook¶

Overview
Batch operations
- Using a non system-user
Locating files
- Finding all the files which have changed recently
Access via the command line
- Getting paths that match a query

Overview ¶

This page presents a number of “recipes”, to show how to achieve simple tasks using the REST API. These are not supported as such, and are provided to give hints and ideas, rather than to be used as is.

Batch operations ¶

Using a non system-user ¶

Whilst many environments have an accepted “robot” user to perform batch operations, typically with an obfuscated password, in some circumstances, this is either not possible, or not desirable.

If you are using the sample authentication server, you can achieve passwordless batch operations by getting a token using a special user which is shared between your script and the authentication server.

Firstly, add a user to the authentication server - you will need to restart the authentication server after doing this. In this example, we use a username of “specialuser”, but it should be chosen on a site specific basis.

passwords = config.get("arcapix.search.server.authserver.passwords", {})
passwords.update({'specialuser': 'specialpassword'})
config['arcapix.search.server.authserver.passwords'] = passwords

Warning

You may not be able to read or write that configuration property on a properly configured system without amending the group memberships/filesystem permissions/ACL’s to grant your script user access to it.

Then, utilise that user’s credentials in your scripts

import requests
from arcapix.config import config

pwd = config.get("arcapix.search.server.authserver.passwords")['specialuser']
authserverurl = config.get("arcapix.search.server.authserver.url")

resp = requests.post(
  authserverurl + "oauth2/token",
  data={"grant_type": "password", "username": "specialuser", "password": pwd})

token = resp.json()['access_token']

Note

This approach will only work if the user running the batch operation has access to the arcapix.search.server.authserver.passwords configuration key. In a properly configured system, such access will be limited to a specific group, quite possibly the apsearch user only.

This example is available in the samples/cookbook directory, as batch_token.py and update_authserver.py

Locating files ¶

Finding all the files which have changed recently ¶

This is most easily done by searching for all files which have a last modification date since the time you last checked. In essence

import requests
from time import time, sleep

def emitter(pathname):
   # Send pathname to your third party watch process
   # If you are very concerned at never emitting the same pathname twice
   # you can add logic to do that here
   print pathname

token = ...  # Follow method above or use system standardised robot user.
server = ...  # e.g. https://mypixsearchserver/api

since = time()

while True:
    # Get all files which have been modified more recently than (since - 600) seconds ago
    data = requests.get(
        '%s/files/?where={"core.modificationtime": {"gte": %r}}' % (server, since-600),
        auth=(token, '')
    ).json()

    # parse collection+json
    if 'items' in data['collection']:
       for file_ in data['collection']['items']:
           # locate the pathname of the file in the returned results
           pathname = [c['value'] for c in file_ if c['name']=='core.pathname'][0]
           # Pass over to third party wrapper function
           emitter(pathname)

    since = time()
    sleep(300)  # Wait for 5 minutes and repeat

Note

There is the potential for some lag between files being modified and making it into the database. Thus the presence of the since - 600 to allow for a small overlap

This example is available in the samples/cookbook directory, as files_since.py

An alternative approach would be to write a custom plugin which will be called at ingest time. This is more complex to achieve, but has the advantage of being more “realtime”.

However, please note the following warning:

Warning

This approach is considered an abuse of the supported interface for a plugin. The code may, at some future time, be modified in ways that break such plugin use.

from arcapix.search.metadata.plugins.base import Plugin

def emitter(filename):
     # Send filename to your third party watch process
     print filename

class ThirdPartyEmitterPlugin(Plugin):

    def namespace(self):
        return "_product_by_third_party"

    def schema(self):
        return []

    def handles(self, mimetype, extension):
        return True

    def is_async(self):
        return True

    def process(self, filename, fileinfo):
        emitter(filename)
        # This could also be done asynchronously by calling,
        # self._submit(emitter, args=[filename])
        #
        # However, this is a fairly heavyweight process, so should not be used unless
        # the emitter wrapped function is long running (>1 minute or so)

By having handles & is_async both return True, the plugin has maximum flexibility. However, for performance reasons, you may wish to declare the plugin as only handling certain filetypes, or if possible, have is_async return False.

Access via the command line ¶

PixStor search can be queried on the commandline using cURL e.g.

$ curl 'https://mypixsearchserver/api/files/?where={"_all":"jpg"}&max_results=10&pretty'  -H 'authorization: Basic ...'

This will return search results in C+J format. See Using the REST API for more information on the REST interface.

Getting paths that match a query ¶

To get a list of paths for files which match a certain query, the results from cURL can be processed like

$ curl ... | awk -F': ' '/core\.pathname/ {getline; gsub(/\"/, ""); print $2}'

Or using jq

$ curl ... | jq -r '.collection.items[].data[] | select(.name=="core.pathname").value'

Or using the pxs_file_list tool from the arcapix-search-client-utils package

The paths can then be piped to some other utility, e.g.

$ pxs_file_list --filter-by-field core.extension .tmp | xargs -I {} rm {}

In this example we are finding and deleting temporary (.tmp) files.

Note: this particular example only returns the first 10 results. If there are more than 10 files matching a search you will need to increase max_results, or else iterate over pages of results

Cookbook¶

Overview¶

Batch operations¶

Using a non system-user¶

Locating files¶

Finding all the files which have changed recently¶

Access via the command line¶

Getting paths that match a query¶