Cookbook¶
Overview¶
This page presents a number of “recipes”, to show how to achieve simple tasks using the REST API. These are not supported as such, and are provided to give hints and ideas, rather than to be used as is.
Batch operations¶
Using a non system-user¶
Whilst many environments have an accepted “robot” user to perform batch operations, typically with an obfuscated password, in some circumstances, this is either not possible, or not desirable.
If you are using the sample authentication server, you can achieve passwordless batch operations by getting a token using a special user which is shared between your script and the authentication server.
Firstly, add a user to the authentication server - you will need to restart the authentication server after doing this. In this example, we use a username of “specialuser”, but it should be chosen on a site specific basis.
passwords = config.get("arcapix.search.server.authserver.passwords", {})
passwords.update({'specialuser': 'specialpassword'})
config['arcapix.search.server.authserver.passwords'] = passwords
Warning
You may not be able to read or write that configuration property on a properly configured system without amending the group memberships/filesystem permissions/ACL’s to grant your script user access to it.
Then, utilise that user’s credentials in your scripts
import requests
from arcapix.config import config
pwd = config.get("arcapix.search.server.authserver.passwords")['specialuser']
authserverurl = config.get("arcapix.search.server.authserver.url")
resp = requests.post(
authserverurl + "oauth2/token",
data={"grant_type": "password", "username": "specialuser", "password": pwd})
token = resp.json()['access_token']
Note
This approach will only work if the user running the batch operation has access to the
arcapix.search.server.authserver.passwords
configuration key. In a properly configured system,
such access will be limited to a specific group, quite possibly the apsearch
user only.
This example is available in the samples/cookbook
directory, as batch_token.py
and update_authserver.py
Locating files¶
Finding all the files which have changed recently¶
This is most easily done by searching for all files which have a last modification date since the time you last checked. In essence
import requests
from time import time, sleep
def emitter(pathname):
# Send pathname to your third party watch process
# If you are very concerned at never emitting the same pathname twice
# you can add logic to do that here
print pathname
token = ... # Follow method above or use system standardised robot user.
server = ... # e.g. https://mypixsearchserver/api
since = time()
while True:
# Get all files which have been modified more recently than (since - 600) seconds ago
data = requests.get(
'%s/files/?where={"core.modificationtime": {"gte": %r}}' % (server, since-600),
auth=(token, '')
).json()
# parse collection+json
if 'items' in data['collection']:
for file_ in data['collection']['items']:
# locate the pathname of the file in the returned results
pathname = [c['value'] for c in file_ if c['name']=='core.pathname'][0]
# Pass over to third party wrapper function
emitter(pathname)
since = time()
sleep(300) # Wait for 5 minutes and repeat
Note
There is the potential for some lag between files being modified and making it into the database.
Thus the presence of the since - 600
to allow for a small overlap
This example is available in the samples/cookbook
directory, as files_since.py
An alternative approach would be to write a custom plugin which will be called at ingest time. This is more complex to achieve, but has the advantage of being more “realtime”.
However, please note the following warning:
Warning
This approach is considered an abuse of the supported interface for a plugin. The code may, at some future time, be modified in ways that break such plugin use.
from arcapix.search.metadata.plugins.base import Plugin
def emitter(filename):
# Send filename to your third party watch process
print filename
class ThirdPartyEmitterPlugin(Plugin):
def namespace(self):
return "_product_by_third_party"
def schema(self):
return []
def handles(self, mimetype, extension):
return True
def is_async(self):
return True
def process(self, filename, fileinfo):
emitter(filename)
# This could also be done asynchronously by calling,
# self._submit(emitter, args=[filename])
#
# However, this is a fairly heavyweight process, so should not be used unless
# the emitter wrapped function is long running (>1 minute or so)
By having handles & is_async both return True, the plugin has maximum flexibility. However, for performance reasons, you may wish to declare the plugin as only handling certain filetypes, or if possible, have is_async return False.
Access via the command line¶
PixStor search can be queried on the commandline using cURL e.g.
$ curl 'https://mypixsearchserver/api/files/?where={"_all":"jpg"}&max_results=10&pretty' -H 'authorization: Basic ...'
This will return search results in C+J format. See Using the REST API for more information on the REST interface.
Getting paths that match a query¶
To get a list of paths for files which match a certain query, the results from cURL can be processed like
$ curl ... | awk -F': ' '/core\.pathname/ {getline; gsub(/\"/, ""); print $2}'
Or using jq
$ curl ... | jq -r '.collection.items[].data[] | select(.name=="core.pathname").value'
Or using the pxs_file_list
tool from the arcapix-search-client-utils
package
The paths can then be piped to some other utility, e.g.
$ pxs_file_list --filter-by-field core.extension .tmp | xargs -I {} rm {}
In this example we are finding and deleting temporary (.tmp
) files.
Note: this particular example only returns the first 10 results. If there are more than 10 files matching a search
you will need to increase max_results
, or else iterate over pages of results