X Tutup
Skip to content

Remove some legacy IA-related solr fields #11586

@hornc

Description

@hornc

Problem

ia_box_id
ia_loaded_id
ia_collections
ia_count

are all solr indexed fields in

<field name="ia_box_id" type="string" multiValued="true"/>
<field name="ia_loaded_id" type="string" multiValued="true"/>
<field name="ia_count" type="pint"/>
<field name="ia_collection" type="string" multiValued="true" />

It is not clear why

I know what ia_box_id is, and it's not appropriate for OL to index due to separation of concerns. It used to be used for something, and that use has been refactored away. Solr is still indexing it.

I don't even know the history of ia_loaded_id and ia_count but they do not appear to be used.

ia_collections is indexing a ridiculous amount of data for no purpose. The fav_ collections were removed because they were obviously pointless ...... but the majority of collections that are listed are still of no use.

One random example:
https://openlibrary.org/search.json?q=cats+AND+ia:*&mode=everything&fields=title,ia,availability,ia_collection&limit=1

has many -ol entries and a library-of-atlantis.

I know what these are in terms of archive.org, but they have not purpose on OL.

They are particularly a problem because they are not even archive.org 'collections' (they are 'simplelists', but that has zero relevance to OL). If there was a clear usecase for why we needed collections, we could adjust, but I don't believe there is one. The original use was to determine lending status for the 'borrow' buttons, which is no longer appropriate. @cdrini mentioned using collections to filter the book explorer, but that just happens to work in some cases as an un-documented feature, and the majority of these collections are actively inappropriate.

I'm concerned about all this because we are having ongoing performance issues with Solr and with connectivity between OL and archive.org. Requesting this data from archive.org on Solr re-index seems completely unnecessary.

It seems like there may be some justification to get and index an item's likely availability, but AFAICT we make requests to the availability API when it's a matter of displaying the actual status anyway, and in terms to search filtering, I can't make much sense of what the expectations are, nor what the current system is achieving with the data is stores.

ebook_access:[borrowable TO *] appears to be exactly equivalent to has_fulltext=true but these are acquired and stored differently?

NOT public_scan_b:false is used in multiple carousel queries, but I can only see 2 records that have public_scan_b populated: https://openlibrary.org/search.json?q=public_scan_b:true&fields=title,ia,availability,ia_collection,public_scan_b

The data fetched by solr

class IALiteMetadata(TypedDict):
boxid: set[str]
collection: set[str]
access_restricted_item: Literal['true', 'false'] | None

also includes access_restricted_item which might be useful, but I can't see where it is stored or used later. I think this is checked real time via the availability API?

There is very little documentation around all this, and I'm not really sure how any of the current data models fit expected usecases around item availability. The model is not clear on the differences between an Open Library book record and individual borrowable / readable etc items.

Open Library has book pages, and book pages have a 'call-to-action' button and most of the magic happens via archive.org's availability API, and I'm not sure where that is documented, yet we have a lot of extra stuff indexed in Solr, which is only updated on record edits.

I imagine there are two main use-cases:

  • Discoverability, search / browse, where a patron is trying to locate or browse using 'availability' as part of the criteria, so indexing relevant aspects would be appropriate
  • Utilisation of a particular item -- the patron has identified a record and wants accurate and current information on how to access an item linked with the record.

The first suggests categorising records by some aspect would be good, but this should be driven by thought out usecases.

The second is where API calls come in, I'm not sure if availability API requests are firing for every item listed in a multi-page search query. I don't know where to look though. I hope OL isn't doing that, but I can't be sure.

From what I see, we have multiple redundant layers of historical attempts to satisfy unspecified and conflated usecases using Solr, and in most cases we fall back to sending availability API requests because we know the Solr values can't be trusted.

That's almost the worst of both worlds; redundant indexing that is not useful, and expensive real-time API requests that may or may not be useful.

The current code is poorly documented, isn't traceable back to supporting core usecases, and has clearly out-of-place and date tech debt like ia_box_id which makes the rest of the code suspect, and much harder to follow.

Reproducing the bug

  1. Go to ...
  2. Do ...
  • Expected behavior:
  • Actual behavior:

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Lead: @cdriniIssues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Module: SolrIssues related to the configuration or use of the Solr subsystem. [managed]Needs: ResponseIssues which require feedback from leadPriority: 4An issue, but should be worked on when no other pressing work can be done. [managed]Type: Refactor/Clean-upIssues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      X Tutup