Switch to using new doc_ids IA search parameter; avoid errors!#11303
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR switches to using a new doc_ids parameter for Internet Archive search requests instead of constructing long identifier:(foo OR bar OR ...) queries, which helps avoid hitting maximum query length limits. The change also modernizes the code by using Python's built-in itertools.batched instead of custom batching functions.
- Replaces custom
batchandbatch_until_lenfunctions withitertools.batched - Updates IA search to use
doc_idsparameter instead of complex query construction - Removes unnecessary query parameters and adjusts batch sizing
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| import re | ||
| import typing | ||
| from collections.abc import Iterable, Sized | ||
| from collections.abc import Iterable, Sequence |
There was a problem hiding this comment.
Missing import for itertools module which is used on lines 212 and 220.
| logger.warning(f"Trying to cache invalid OCAIDs: {invalid_ocaids}") | ||
| valid_ocaids = list(set(ocaids) - invalid_ocaids) | ||
| batches = list(batch_until_len(valid_ocaids, 3000)) | ||
| batches = list(itertools.batched(valid_ocaids, 250)) |
There was a problem hiding this comment.
The batch size has been significantly reduced from 3000 characters to 250 items without explanation. This could lead to many more API requests than necessary, potentially impacting performance. Consider documenting the rationale for this specific batch size or making it configurable.
Instead of a long
identifier:(foo OR bar OR ...)query, @ximm is working on adding a newdoc_idsparameter that will let us just specify the direct ids! This will fix an issue we've been having where we hit the maximum query length.Technical
itertools.batchedTesting
Screenshot
Stakeholders