-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API does not appear to be applying LIDVID-sort by default #250
Comments
Have confirmed this behaviour with script. Testing starts [0,3] with limits [1,3], the following hashed lidvids are obtained:
|
More test data from start 0, limits [1,11]
|
Completed test script import hashlib
import string
from typing import List, Union
import requests
def get_ids(start: int, limit: int, hash_ids=True) -> List[str]:
collection_urn = 'urn:nasa:pds:mars2020_supercam:data_calibrated_audio'
url = f'http://localhost:8080/collections/{collection_urn}::4.0/products?start={start}&limit={limit}'
# print(f'Requesting start={start}, limit={limit}')
response = requests.get(url)
try:
data = response.json()['data']
except KeyError as err:
print(f'Response for start={start}, limit={limit} did not contain a "data" attribute')
return []
ids = [hashlib.md5(product['id'].encode()).hexdigest()[:6] if hash_ids else product['id'] for product in data] # for ease of visual comparison
return ids
max_start = 0
min_limit = 1
max_limit = 10
result_end_idx = max_start + max_limit
print(f'Hits 0 through {result_end_idx - 1} are as follows:')
ids = get_ids(0, result_end_idx - 1, hash_ids=False)
hashes = get_ids(0, result_end_idx)
pairs = zip(hashes, ids)
[print(f'{x[0]} - {x[1]}') for x in pairs]
results: List[List[Union[str, None]]] = []
for start in range(0, max_start + 1):
for limit in range(min_limit, max_limit + 1):
left_pad = [None] * start
ids = get_ids(start, limit)
padded_ids = (left_pad + ids + [None] * result_end_idx)[:result_end_idx]
results.append(padded_ids)
for result_set in results:
print([' ' if val is None else val for val in result_set])
for idx in range(0, result_end_idx):
values_at_idx = set()
for result in results:
value_at_idx = result[idx]
if value_at_idx is not None:
values_at_idx.add(value_at_idx)
assert len(values_at_idx) == 1 |
@alexdunnjpl this is not really a bug but a missing feature from the specification. |
@tloubrieu-jpl is it not a bug in the context of registry-api offering pagination, and that pagination (appeaing to be) providing incorrect results (missing/duplicated data) because of this issue? |
Nevermind - testing with the following script, I wasn't actually able to produce duplicated/missing data, so while the results above are confusing, pagination is not broken page_size = 50
pages_to_fetch = 50
expected_lidvids_count = page_size*pages_to_fetch
results = []
for page_idx in range(0, pages_to_fetch):
start = page_size*page_idx
limit = page_size
ids = get_ids(start, limit, hash_ids=False)
results.extend(ids)
unique_results = set(results)
print(f'Fetched {len(results)} results, with {len(unique_results)} unique, vs. {expected_lidvids_count} expected') |
I will put this ticket in the icebox. |
Checked for duplicates
Yes - I've already checked
🐛 Describe the bug
When I used a test script to validate pagination correctness, results indicated that sorting did not appear to be correctly applied.
If true, this completely breaks ability to query across multiple pages of data as some LIDVIDs will be duplicated while others will be missing.
🕵️ Expected behavior
Given values of
start
andlimit
, I expect them to provide repeatable and consistent slices of data even when an explicit sort queryparam is not provided📜 To Reproduce
TBD - need to fix #240 first and then validate that the test script isn't what's borked.
🖥 Environment Info
No response
📚 Version of Software Used
No response
🩺 Test Data / Additional context
No response
🦄 Related requirements
No response
⚙️ Engineering Details
No response
The text was updated successfully, but these errors were encountered: