Add chapters to solr#11369
Conversation
daf7c50 to
603f427
Compare
603f427 to
2ce9b0e
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds support for indexing and searching book chapters (table of contents) in Solr, enabling users to search for books by chapter titles and view matching chapters in search results with highlighted matches.
Key Changes
- Added a new
chapterfield to the Solr schema that stores table of contents entries in a pipe-delimited format - Implemented chapter data extraction from edition metadata and aggregation at the work level for search result boosting
- Enhanced search results UI to display matching chapters with Archive.org page links when available
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| conf/solr/conf/managed-schema.xml | Added chapter field to Solr schema as multiValued text field |
| openlibrary/solr/solr_types.py | Added chapter type definition to SolrDocument TypedDict |
| openlibrary/solr/updater/edition.py | Implemented chapter property to extract and format TOC entries from editions |
| openlibrary/solr/updater/work.py | Added chapter property to aggregate chapters from all editions of a work |
| openlibrary/plugins/worksearch/schemes/works.py | Integrated chapter field into search query (qf), highlighting, and both work/edition queries |
| openlibrary/macros/SearchResultsWork.html | Added chapter display UI with highlighting, Archive.org links, and read-more functionality |
| openlibrary/macros/ReadMore.html | Updated to accept customizable button text via parameters |
| static/css/components/search-result-item.less | Added styling for chapter display section (.srw__chapters) |
| static/css/page-user.less | Imported read-more.less stylesheet |
| openlibrary/i18n/messages.pot | Added "Show more" and "Show less" translation strings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 'title': self.title, | ||
| 'subtitle': self.subtitle, | ||
| 'alternative_title': list(self.alternative_title), | ||
| 'chapter': self.chapter, |
There was a problem hiding this comment.
Potential index size and query performance impact from storing chapters at both edition and work level. The chapter data is included in both the edition-level Solr document (line 334) and aggregated at the work level (via work.py line 347). This duplication could significantly increase the Solr index size for books with extensive tables of contents, and may impact query performance. Consider monitoring index size and query latency after deployment, and be prepared to remove edition-level chapter storage if performance issues arise as mentioned in the PR description.
| 'chapter': self.chapter, |
| @property | ||
| def chapter(self) -> list[str]: | ||
| result = [] | ||
| olid = self._edition.get('key', '').split('/', 2)[-1] | ||
| for chapter in self._edition.get('table_of_contents', []): | ||
| # Check if plain string first | ||
| if isinstance(chapter, str): | ||
| result.append(f'{olid} | {chapter}') | ||
| continue | ||
|
|
||
| title = chapter.get("title", "") | ||
| if chapter.get("subtitle"): | ||
| title += f": {chapter['subtitle']}" | ||
| if chapter.get("authors"): | ||
| title += f" ({', '.join(a['name'] for a in chapter['authors'])})" | ||
|
|
||
| result.append( | ||
| f'{olid} | {chapter.get("label", "")} | {title} | {chapter.get("pagenum", "")}' | ||
| ) | ||
| return result |
There was a problem hiding this comment.
Missing documentation for the chapter field format and structure. The chapter property constructs a pipe-delimited string with a specific format, but there's no docstring or comment explaining the structure. This makes it difficult for future maintainers to understand the format. Consider adding a docstring that documents the format, such as: 'Returns chapters in the format: "{olid} | {chapter_title}" for plain string chapters, or "{olid} | {label} | {title} | {pagenum}" for structured chapters.'
| $if len(chapter_parts) < 3: | ||
| $ chapter_parts = ['', chapter, ''] |
There was a problem hiding this comment.
Incorrect logic when handling chapters with fewer than 3 parts. When len(chapter_parts) is less than 3, you assign chapter_parts to ['', chapter, ''], but 'chapter' here refers to the original string which still contains the edition OLID prefix (e.g., 'OL123M | Chapter Title'). This should use 'rest' instead of 'chapter', or better yet, pad the chapter_parts list appropriately. This will cause the chapter display to show the entire 'ed_olid | rest' string as the title.
| $if len(chapter_parts) < 3: | |
| $ chapter_parts = ['', chapter, ''] | |
| $ chapter_parts += [''] * (3 - len(chapter_parts)) |
Closes #8756. Experiment with adding chapters to solr. Continuation of #9898 . Depends on #11366 .
Technical
The risk this carries is that it stores the table of contents both on the edition and work to allow for appropriate matching.
The work-level TOC is needed to allow it to take part in boosting work search results. The edition-level TOC is I believe needed to have the correct edition match as well. If there are performance issues we might be able to remove the edition-level TOC.
The new field has been added to our prod solr, so this should be able to be deployed without any intervention.
Testing
Locally import some books with tocs, and search for them. Requires resetting the local solr.
Screenshot
Stakeholders