X Tutup
Skip to content

Add chapters to solr#11369

Merged
mekarpeles merged 1 commit intointernetarchive:masterfrom
cdrini:feature/chapters-in-solr
Dec 22, 2025
Merged

Add chapters to solr#11369
mekarpeles merged 1 commit intointernetarchive:masterfrom
cdrini:feature/chapters-in-solr

Conversation

@cdrini
Copy link
Collaborator

@cdrini cdrini commented Oct 20, 2025

Closes #8756. Experiment with adding chapters to solr. Continuation of #9898 . Depends on #11366 .

Technical

The risk this carries is that it stores the table of contents both on the edition and work to allow for appropriate matching.

The work-level TOC is needed to allow it to take part in boosting work search results. The edition-level TOC is I believe needed to have the correct edition match as well. If there are performance issues we might be able to remove the edition-level TOC.

The new field has been added to our prod solr, so this should be able to be deployed without any intervention.

Testing

Locally import some books with tocs, and search for them. Requires resetting the local solr.

Screenshot

Stakeholders

@cdrini cdrini added the Needs: Special Deploy This PR will need a non-standard deploy to production label Oct 20, 2025
@cdrini cdrini force-pushed the feature/chapters-in-solr branch from daf7c50 to 603f427 Compare October 25, 2025 17:49
@cdrini cdrini force-pushed the feature/chapters-in-solr branch from 603f427 to 2ce9b0e Compare December 5, 2025 18:02
@cdrini cdrini removed the Needs: Special Deploy This PR will need a non-standard deploy to production label Dec 11, 2025
@cdrini cdrini marked this pull request as ready for review December 11, 2025 15:00
Copilot AI review requested due to automatic review settings December 11, 2025 15:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for indexing and searching book chapters (table of contents) in Solr, enabling users to search for books by chapter titles and view matching chapters in search results with highlighted matches.

Key Changes

  • Added a new chapter field to the Solr schema that stores table of contents entries in a pipe-delimited format
  • Implemented chapter data extraction from edition metadata and aggregation at the work level for search result boosting
  • Enhanced search results UI to display matching chapters with Archive.org page links when available

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
conf/solr/conf/managed-schema.xml Added chapter field to Solr schema as multiValued text field
openlibrary/solr/solr_types.py Added chapter type definition to SolrDocument TypedDict
openlibrary/solr/updater/edition.py Implemented chapter property to extract and format TOC entries from editions
openlibrary/solr/updater/work.py Added chapter property to aggregate chapters from all editions of a work
openlibrary/plugins/worksearch/schemes/works.py Integrated chapter field into search query (qf), highlighting, and both work/edition queries
openlibrary/macros/SearchResultsWork.html Added chapter display UI with highlighting, Archive.org links, and read-more functionality
openlibrary/macros/ReadMore.html Updated to accept customizable button text via parameters
static/css/components/search-result-item.less Added styling for chapter display section (.srw__chapters)
static/css/page-user.less Imported read-more.less stylesheet
openlibrary/i18n/messages.pot Added "Show more" and "Show less" translation strings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

'title': self.title,
'subtitle': self.subtitle,
'alternative_title': list(self.alternative_title),
'chapter': self.chapter,
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential index size and query performance impact from storing chapters at both edition and work level. The chapter data is included in both the edition-level Solr document (line 334) and aggregated at the work level (via work.py line 347). This duplication could significantly increase the Solr index size for books with extensive tables of contents, and may impact query performance. Consider monitoring index size and query latency after deployment, and be prepared to remove edition-level chapter storage if performance issues arise as mentioned in the PR description.

Suggested change
'chapter': self.chapter,

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +156
@property
def chapter(self) -> list[str]:
result = []
olid = self._edition.get('key', '').split('/', 2)[-1]
for chapter in self._edition.get('table_of_contents', []):
# Check if plain string first
if isinstance(chapter, str):
result.append(f'{olid} | {chapter}')
continue

title = chapter.get("title", "")
if chapter.get("subtitle"):
title += f": {chapter['subtitle']}"
if chapter.get("authors"):
title += f" ({', '.join(a['name'] for a in chapter['authors'])})"

result.append(
f'{olid} | {chapter.get("label", "")} | {title} | {chapter.get("pagenum", "")}'
)
return result
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for the chapter field format and structure. The chapter property constructs a pipe-delimited string with a specific format, but there's no docstring or comment explaining the structure. This makes it difficult for future maintainers to understand the format. Consider adding a docstring that documents the format, such as: 'Returns chapters in the format: "{olid} | {chapter_title}" for plain string chapters, or "{olid} | {label} | {title} | {pagenum}" for structured chapters.'

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +142
$if len(chapter_parts) < 3:
$ chapter_parts = ['', chapter, '']
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect logic when handling chapters with fewer than 3 parts. When len(chapter_parts) is less than 3, you assign chapter_parts to ['', chapter, ''], but 'chapter' here refers to the original string which still contains the edition OLID prefix (e.g., 'OL123M | Chapter Title'). This should use 'rest' instead of 'chapter', or better yet, pad the chapter_parts list appropriately. This will cause the chapter display to show the entire 'ed_olid | rest' string as the title.

Suggested change
$if len(chapter_parts) < 3:
$ chapter_parts = ['', chapter, '']
$ chapter_parts += [''] * (3 - len(chapter_parts))

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@mekarpeles mekarpeles self-assigned this Dec 15, 2025
@mekarpeles mekarpeles merged commit 11caa2a into internetarchive:master Dec 22, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Open Library Tables of Contents → Solr

3 participants

X Tutup