Using the REST API¶
This section contains information on how to make requests to PixStor Search by using the REST API.
The API is compliant with the Collection+JSON (C+J) specification,
but note it also makes use of the collection level properties
extension, and a minor extension to allow for templated URL’s.
It may be a little different to some other REST approaches, but is more predictable and easier to machine consume, since the schema is entirely consistent across any service which uses the C+J approach.
It should be viewed as somewhat like a website, with links that can be followed, forms that can be filled in etc, with the strong caveat that it is designed for machine consumption, rather than human, rather than as a simple fire-and-forget approach.
In addition, it is possible to short-circuit the process, but this requires an understanding of the URL format, which consuming the C+J directly does not. The URL format is considered to be less canonical (i.e. more likely to change between versions).
Note
In order to reduce the documentation size, in all circumstances success is indicated by a 200, 201, or 202 HTTP status code. A 500 HTTP Status code can also be returned for a generic Server error
All methods are GET
, unless otherwise specified.
Note
In order to reduce the documentation size, in all circumstances OTHER THAN Authorisation requests, the following HTTP Headers must be included
Accept: application/vnd.collection+json
WWW-Authenticate: <bearer based token string>
Authentication¶
Before accessing the API, one must first authenticate, using the RFC6749 oAuth2
process oAuth2 process for resources.
It should be noted that in this context, the developers application is the “Client”, and typically utilisation is via grant_type=password
Arcapix supplies a basic example Authorisation Server which supports only a grant_type
of password
,
and authenticated against the standard PAM system
Note that the authentication server is a distinct service vs the search server, and is access on an alternative port. It must in all cases be via SSL. As standard, the search server runs on port 5000, but the auth server runs on 5001. Depending on the client library/language you are using, you may need to accept self-signed certificates or install the Arcapix CA.
URL¶
https://authserver/oauth2/token
Method¶
POST
Request Parameters¶
Parameter | Description | Required |
---|---|---|
grant_type | Always “password” | Yes |
username | username | Yes |
password | password | Yes |
Success Status Code¶
200
Payload¶
{
"access_token": "feoF8MpdWqnUAI3FiMc9v6_PDspAfJPzc_-uwueC9I7IDkxz_hvYITVsNWZ5IOH19nfwIADhIpo9q_GDaCLyUGvA-_RUAEaPcurWFSTX5zClBGZ-I3n2WQbnvLVkvweVWGNilBTdNwdNndmNyqYI-lVt4RO1tIylV29mN7GQOMRXZAWKMXunc_0qpNpJy47M8tPZVVReXREnGd96SovGspKQ-AUAH1IcaD3mqlzrxiNg_j9cRP3KSdhSy_cHSuhN4QdX96jJ5TnsPPHXbFnK26k4jbBPb7sOx39LcXXOOuCjV_RioqaZHe_xt7l3tuuetxlNeU5PhgM2vJsWxBHQrJau9bG0pO24tkMEj5ByUBIH4EiXCyCtx9NbfpB_Hyu0KsHv8IFPcMAZlC7Ijcpg9g2zCa7iGIA_o-uYrHDzxg6sQPQVzgPmJuD1RkFVMXsbiwan7vFCFOscoeCKfcxHW8GTB9SFEZ3aErnGsHMgIRIvBbcH3nyIATcnaTVVZOKYP82851NJgHQUaCmZ1zDkjndbcmiAdvYnOh2EUVVlAoL0UiTLS4qh6EgEF4OIj3_blEn0iSzF5269tiDgaMYtf39839_2eN1zr9Td7BEs9srz5OWQm482Djz04LjL2veYhLOdxVaDYoiRYrvyeDblRPaMu4AWZmjlJEqtDSm664AARCAPIX",
"expires_in": 86400,
"token_type": "Bearer"
}
Note
Tokens by default expire after 24 hours. The sample auth server does not support token refresh - a new token must be requested. Tokens may not persist across server restarts, depending on the configuration of the server.
Error status code¶
400 - Bad request (NB. This is not terribly precise - 403 or 412 might be more useful but are not what’s specified in the RFC)
Onward Usage¶
However it is achieved, the end point of a successful authentication is an access token.
This must be passed to the search server via an appropriately encoded Bearer WWW-Authenticate
Header.
NB. Most libraries will take care of the encoding, if you pass the access token as the username, and an empty password e.g.
import requests
requests.get("https://mypixsearchserver/api/files/", auth=requests.auth.HTTPBasicAuth(access_token, ''))
Billboard URL¶
The C+J exploration starts by retrieving the server’s root URL.
Provided the correct access token is passed via the standard WWW-Authenticate header, we will receive a response containing a list of possible queries.
By filling in the paramters requested, one can craft a suitable query without knowing the URL structure.
URL¶
https://mypixsearchserver/api/files/
Response¶
Considering a snippet of the response above:
{
"data": [
{
"prompt": "Search string",
"name": "where",
"value": ""
}
],
"href": "/files/?where={\"_all\":\"{where}\"}",
"prompt": "Enter a string to search in all fields across all files",
"rel": "search"
}
By replacing the {where}
entries with values prompted for using the supplied prompts (Search string
),
a suitable query URL can be constructed - e.g. /files/?where={"_all":"promptedvalue"}
A small command line tool might be written as follows:
r = requests.get("https://mypixsearchserver/api/files/", auth=HTTPBasicAuth(access_token, ''))
query = r.json()['collection']['queries'][0]
href = query['href']
print query['prompt']+"\n"
for param in queries['data']:
href = href.replace("{"+param['name']+"}", raw_input(param['prompt']+":\n"))
results = requests.get(href, auth=HTTPAuth(access_token, ''))
Rich/Direct query¶
It is possible to directly query without going via the billboard URL, although this may mean your application needs updating should the URL format change.
URL¶
https://mypixsearchserver/api/files/
Request Parameters¶
Parameter | Description | Required | Default |
---|---|---|---|
where | Clause to filter results by | Yes* | NA |
sort | Key to sort by | No | relevance |
page | desired page of results | No | first |
projection | Specify fields to return | No | all properties |
max_results | Amount of results per page | No | 10 |
*The where clause isn’t strictly needed, but no items are returned if you do not provide one in order to reduce the chances of a malformed query overloading the server.
where¶
filters¶
Filters are applied to specific, named properties. Property names are of the form <namespace>.<property>
The format is as follows where={"property1":"value1", "property2":"value2"}
, which will produce an “AND” search.
It is possible to pass multiple values (OR) with an array syntax where={"property1":["value1","value2"]}
It is possible to apply an AND filter on a single field where={"property1": {"and": ["value1", "value2"]}}
It is also possible to exclude a particular value where={"property1": {"not": "value1"}}
Numerical and date-based properties can be search using ranges, with the keywords gt
, gte
, lt
, and lte
e.g. where={"property1":{"gte":"value1","lt":"value2}}
.
Date-range query values can be milliseconds since epoch (not seconds), or an iso-8601 formatted strings
e.g. where={"core.modificationtime":{"gt":"2000-01-01T00:00"}}
These property filters produce exact matches - this means the whole terms must match, including matching case.
For example, {"location.city":"new"}
will not match location.city: New York
, nor will {"location.city":"new york"}
‘_all’ queries¶
There is a special, magic _all
field, which performs search across all properties.
The _all
query is tokenised and case-insensitive, meaning where={"_all": "new"}
would match New York
.
Additionally, the _all
query supports a rich query syntax, including boolean operators and wildcards.
Some example queries are:
# files matching either cats or dogs
cats OR dogs
# files matching BOTH cats and dogs
cats AND dogs
# files matching cats and black, or dogs and black
(cats OR dogs) AND black
# files matching cats, but not matching black
cats AND NOT black
# files matching an exact filename
# without quotes, this would be split into three terms: cats, 16, jpg
"cats-16.jpg"
# wildcard query
cats-*
# fuzzy search - files *almost* matching 'cast', such as 'cats'
cast~
# boost - match either cats or dogs, favouring cats
# that is, files matching cats will be preferred over those matching dogs in the results
# this doesn't guarantee that cats will appear first - you may need to use a larger boost
# if 'sort' is used (see below), it takes precedence over boosts
cats^2 OR dogs
# query a specific metadata field
core.directory:cats
# unlike filters (see above), field queries are 'analysed'
# without quotes, the following would be split into: mmfs1, cats
# for an exact match, either use a filter, or quote the query
core.directory:"/mmfs1/cats"
# query by range on a specific field
# less than value
core.size:<1024
# greater or equal to date
core.modificationtime:>=2020-03-01
# value in range (inclusive)
image.width:[800 TO 1920]
These queries would be used as, e.g. where={"_all": "cats OR dogs"}
In the case of exact match, where the search term is quoted, the quotation marks would need to be escaped,
e.g. where={"_all": "\"cats-16.jpg\""}
Warning
Avoid using queries with leading wildcards, like *.jpg
, or worse *foo*
.
Queries with leading wildcards are very slow and resource heavy, and may timeout
In the case of searching for files with a particular extension,
one can simply search the extension without the wildcard, e.g. {"_all": "jpg"}
‘all’ queries can be combined with property filters - e.g. where={"_all": "cats", "core.size": {"gt": 1024}}
Note
Queries don’t match substrings - for example a query for cat
won’t match caterpillar
To match a substring, you would have to use wildcards - cat*
Similarly, strings aren’t split on underscores, so cat
won’t match cat_pictures
.
In that case, you would need to search for the full string cat_pictures
sort¶
The sort property specifies a column to sort the data on, with a preceding -
used to indicate an inversion of the sort.
Multiple, comma-separated fields can be specified e.g. sort=-core.size,core.modificationtime
Note
By default, the items are returned in a “relevance” order. Unless the filter has been very precise, a lot of matches are likely, and sorting on these matches is likely to not be terribly useful, as well as being a performance hit.
projection¶
It is technically possible to request only a subset of the properties for items to be returned - if one knew for example that a particular metadata field was very large (say 10K or more), it may make sense to not have it returned, to reduce both network utilisation and JSON parsing overhead.
The syntax is projection={"property1":0}
to exclude a field.
Alternatively, you can specify projection={"property1":1}
to return only that field.
page¶
The page property indicates where in a paged result set you wish to be. In essence, page*max_results
is the index of the first result you want.
NB. Using the C+J ‘HATEOS’ links means you don’t need to do computations to provide “previous”, “next”, “last” type functionality - the required URL’s are given to you.
Important
The underlying database, Elasticsearch, has a pagination limit of 10k results.
Pagination links take this limit into account, so if you follow “next” or “last” links you will never exceed the limit.
If you explicitly request a page beyond the pagination limit, a 416 (Range Not Satisfiable) will be returned. This indicates that, while there are more results in the database, the REST server can’t return them.
max_results¶
The maximum number of results to return. This has a default of 25 and an absolute maximum of 1000. Smaller pages give faster results.
Payload¶
(See typical response below)
Error status codes¶
403 - Forbidden - most likely incorrect access token
Example request¶
GET https://mypixsearchserver/api/files/?where={"_all":"jpg"}&sort=core.pathname&projection={"core.size":0}&page=1&max_results=20 HTTP/1.1
Accept: application/vnd.collection+json
WWW-Authenticate: <bearer based token string>
Typical query response¶
The response (in C+J format) will contain 4 major sections
(For a full example, see Example Responses)
Items¶
This is a list of matches, typically the first 25. Each nested item will contain a “data” key, which in turn is a list of triples for the properties name, value, and prompt.
{
"items": [
{
"href": "/files/3735374022151170231",
"data": [
{
"prompt": "File basename (string)",
"name": "core.filename",
"value": "cats-22.jpg"
},
{
"prompt": "File mime-type (string)",
"name": "core.mimetype",
"value": "image/jpeg"
}
]
}
]
}
Properties provided are
name - name of the field property value - field value prompt - human readable description of the field
The href attribute gives a direct link to this item, which will return this item, and only this item, with all properties returned. Thus, detail views can be built when used with projections.
Item Links¶
This contains a list of links to other resources connected with the item, typically the proxies.
The type of proxy is indicated by the “rel” attribute. In particular, the special _thumbnail
rel can be assumed to be a small image.
{
"items": [
{
"href": "/files/3735374022151170231",
"data": [
],
"links": [
{
"prompt": "Thumbnail image",
"name": "thumb.png",
"render": "image",
"accept": "image/png",
"href": "/media/090/453/627/9045362721810216358.png",
"rel": "_thumbnail"
},
{
"prompt": "Preview image",
"name": "preview.png",
"render": "image",
"accept": "image/png",
"href": "/media/051/069/261/5106926172767688680.png",
"rel": "image.preview"
}
]
}
]
}
By careful utilisation of the render, accept and href attributes, user interfaces with the correct controls can be produced.
Collection links¶
This provides a list of links to other related collections
Firstly, it contains links to the previous, next, last etc. pages of results, using IANA registered relation types. This enable the crafting of paged result sets without explicit URL calculations
{
"links": [
{
"render": "link",
"href": "/files/?where={\"_all\":\"cats\"}&page=3",
"prompt": "Last",
"name": "last",
"rel": "last"
},
{
"render": "link",
"href": "/files/?where={\"_all\":\"cats\"}&page=2",
"prompt": "Next",
"name": "next",
"rel": "next"
}
]
}
But more interestingly, it contains the link to kick off the guided dynamic search.
{
"links": [
{
"render": "link",
"href": "/files/?where={\"_all\":\"cats\"}&projection={\"filters\": \".\"}",
"prompt": "Search filters",
"name": "_filters",
"rel": "links"
}
]
}
By following the href in that “link”, you will retrieve a much wider variety of useful collections related to your search, which enables efficient drill down through large result sets.
Collection properties¶
These provide a list of data items indicating the total number of hits.
{
"properties": [
{
"prompt": "Number of matching documents",
"name": "hits",
"value": 73
}
]
}
Dynamic guided search¶
If one follows a link of the form
https://mypixsearchserver/api/files/?projection={"filters":"."}&where=...
either from the initial query response or by crafting directly, then a more verbose list of related collections of results is returned. (See Example Responses). The actual results are not returned - this is a secondary operation.
The links are of the form
{
"links": [
{
"href": "\/files/?where={\"_all\":\"cats\"}&projection={\"filters\": \"core.size\"}",
"prompt": "Core - Size (73)",
"name": "core.size",
"rel": "links"
},
{
"href": "\/files/?where={\"_all\": \"cats\", \"core.size\": {\"lt\": 44000, \"gte\": 27400}}",
"prompt": "Core - Size - 27400 - 44000 (14)",
"name": "core.size.27400-44000",
"rel": "collection"
}
]
}
Here, the rel
attribute indicates that by following the link, you will get a new collection of items (bottom example),
or a new collection of links (top example).
The collection of items will be a subset of your initial search, but restricted by a certain property - in this case, restricting to only those files who have a size of between 27400 and 44000 bytes.
The interesting thing about the results is that they are ordered in such a way as to present those which sub-divide the collection most effectively first. For example, if there were approximately equal numbers of True & False values for a given property, this would be a good candidate. If almost all the results were True, it would not be, and would be presented further down the list of links.
With ranged values (e.g. integers or dates), the process is similar in concept, however, in that instance, the system automatically computes the most effective bucket sizes, with the aim of dividing the total into around 5 roughly equally sized sets. So in the above example, of the 73 values which match the initial search, 14 are in the range 27400-44000. This ‘Auto-bucket’, and ‘Most useful’ approach leads to users being able to rapidly reduce a large result set down to a more specific set of results.
The first 5 or so properties return the sub-divisions inline, the remaining properties (the “less useful” ones) require an additional fetch step -
these are indicated by the "rel"="links"
, without matching “collection” entries.
Contrived Example¶
One could envisage a system with a wizard, asking a series of questions based on the “most useful” discriminator, in order to get to one page of results in the fewest number of steps.
In pseudo code, this might look like
href="/files/?where={"_all":"jpg"}
while True:
items=get_href(href) # Get the items matching the search
if len(items)<size_of_screen: # If we have few enough, print and exit
print items
exit
filters=items.links[rel='links'] # Fetch the dynamic guided filters
print "Which of these best matches your desired item"
for sub_collection in items.links[rel='collection'][0-5]:
print sub_collection.prompt # Print out the top 5 possibles
input(choice) # Have the user choose one
href=items.links[choice].href # Move on to the URL chosen and repeat
Updating metadata¶
Metadata can be updated by performing a PATCH request against a given file.
To update metadata, you will need an auth token, and you user must have the update_search_metadata
auth right.
By default, only the special ‘broker’ user (which performs ingest) has full read-write permission.
Additional users can be given the update_search_metadata
right via apconfig.
Only a sysadmin will have permission to make this change.
URL¶
https://mypixsearchserver/api/files/<fileid>
Method¶
PATCH
Headers¶
Parameter | Description | Required | Default |
---|---|---|---|
If-Match | etag for the current version of the doc | Yes | NA |
Content-Type | the mimetype of the data being sent | Yes | application/vnd.collection+json |
Payload¶
Collection+JSON template of key-values to update - e.g.
{
"template": {
"data": [
{
"name": "core.creator",
"value": "arcapix"
}
]
}
}
Success Status Code¶
200 (OK)
Response¶
{
"collection": {
"href": "/files/3721279936826738506",
"items": [
{
"href": "/files/3721279936826738506",
"data": [
{
"prompt": "_updated",
"name": "_updated",
"value": "2019-01-16T13:18:33"
},
{
"prompt": "_created",
"name": "_created",
"value": "2018-10-12T09:08:57"
},
{
"prompt": "_status",
"name": "_status",
"value": "OK"
},
{
"prompt": "_id",
"name": "_id",
"value": 3721279936826738506
},
{
"prompt": "_etag",
"name": "_etag",
"value": "bfd9c38ac604b7a86c5b34242c2c940a0f84b9af"
}
],
"links": []
}
],
"version": "1.0",
"links": [
{
"render": "link",
"href": "/files/3721279936826738506",
"prompt": "File",
"name": "self",
"rel": "self"
}
]
}
}
Error Status Code¶
403 (Forbidden) - invalid access token; this might be caused by incorrect user or password,
token expired, or user doesn’t have the update_search_metadata
auth right
412 (Precondition Failed) - ETAG is invalid or outdated
422 (Unprocessable Entity) - the PATCHed metadata failed validation. In this case, the response body should include an explanation of the issue(s) - e.g.
{
"collection": {
"error": {
"title": "Error",
"message": "common : {u'creator': 'must be of string type'}"
}
}
}
428 (Precondition Missing) - If-Match
header (ETAG) wasn’t provided
What can be updated¶
You can update the value for any field defined in the PixStor Search schema. A field is in the PxS schema if it is defined in one of the installed PxS plugins.
Any metadata field not defined in the schema will be rejected with status 422. Similarly, updated values are validated against the schema - e.g. you can’t update a string field with an integer. Any value that fails validation will be rejected with status 422.
Note that if a given metadata field is populated via a plugin, any user changes made to the value of that field are likely to be replaced during the next ingest.
One possible way around this is to create a ‘schema plugin’ - a plugin which defines a schema, but doesn’t extract any metadata.
class TagSchemaPlugin(Plugin):
def namespace(self):
return 'user'
def handles(self, mimetype, ext):
# doesn't handle any files
return False
def schema(self):
return [{
"name": "tags",
"prompt": "file tags",
"value": {
"datatype": "[String]" # list of strings
}
}]
def process(self, id_, path, fileinfo=None):
return PluginStatus.SUCCESS
This plugin will add the user.tags
field to PxS’s schema - making the field ‘valid’.
But the plugin doesn’t generate any metadata itself, so won’t override any user-provided values on ingest.
Deleting Documents¶
Documents can be removed from the database by performing a DELETE request against a given file.
It’s not possible to remove individual metadata fields. Only whole documents can be removed.
To delete metadata, you will need an auth token, and your user must have the delete_search_metadata
auth right.
URL¶
https://mypixsearchserver/api/files/<fileid>
Method¶
DELETE
Headers¶
Parameter | Description | Required | Default |
---|---|---|---|
If-Match | etag for the current version of the doc | Yes | NA |
Success Status Code¶
204 (No Content)
Response¶
<empty>
Error Status Code¶
403 (Forbidden) - invalid access token; this might be caused by incorrect user or password,
token expired, or user doesn’t have the delete_search_metadata
auth right
412 (Precondition Failed) - ETAG is invalid or outdated
428 (Precondition Missing) - If-Match
header (ETAG) wasn’t provided
Delete by Query¶
It is also possible to bulk delete multiple documents in one go.
This is done by performing a DELETE request against the /files
endpoint, with some query.
e.g.
DELETE https://mypixsearchserver/api/files/?where={"core.extension":".DS_Store"}
Warning
In general, you should avoid using delete by query, and great care should be taken if you do use it.
There are no special safety checks, so it is very easy to unintentionally delete large numbers of documents.
URL¶
https://mypixsearchserver/api/files/?where=<query>
Method¶
DELETE
Headers¶
Unlike a single file delete, delete by query doesn’t require an If-Match header, since each file matched by the query has its own unique ETag.
Consequently, you don’t have the safety that ETags provide for per-file delete.
Success Status Code¶
204 (No Content)
Response¶
<empty>
Error Status Code¶
403 (Forbidden) - invalid access token; this might be caused by incorrect user or password,
token expired, or user doesn’t have the update_search_metadata
auth right