Selecting Text with Anchorpoint¶
Anchorpoint is a tool for labeling referenced passages within text documents, in a format that allows the “anchors” to the referenced passages to be stored and transmitted separately from the documents themselves. Anchorpoint has two basic ways of selecting text: as text positions, or as text quotes. Here’s a demonstration of creating a text string in Python and then using both kinds of text selectors.
>>> from anchorpoint import TextPositionSelector, TextQuoteSelector
>>> legal_text = (
... "Copyright protection subsists, in accordance with this title, "
... "in original works of authorship fixed in any tangible medium of expression, "
... "now known or later developed, from which they can be perceived, reproduced, "
... "or otherwise communicated, either directly or with the aid of a machine or device. "
... "Works of authorship include the following categories: "
... "literary works; musical works, including any accompanying words; "
... "dramatic works, including any accompanying music; "
... "pantomimes and choreographic works; "
... "pictorial, graphic, and sculptural works; "
... "motion pictures and other audiovisual works; "
... "sound recordings; and architectural works.")
>>> positions = TextPositionSelector(start=65, end=93)
>>> positions.select_text(legal_text)
'original works of authorship'
>>> quote = TextQuoteSelector(exact="in accordance with this title")
>>> quote.select_text(legal_text)
'in accordance with this title'
A TextPositionSelector
works by identifying the positions of
the start and end characters within the text string object, while
a TextQuoteSelector
describes the part of the text
that is being selected.
Sometimes a selected passage is too long to include in full in
the TextQuoteSelector
.
In that case, you can identify the selection by specifying its prefix
and suffix
.
That is, the text immediately before and immediately after the text you want to select.
>>> quote = TextQuoteSelector(prefix="otherwise communicated, ", suffix=" Works of authorship")
>>> quote.select_text(legal_text)
'either directly or with the aid of a machine or device.'
If you specify just a suffix, then the start of your text selection is the beginning of the text string. If you specify just a prefix, then your text selection continues to the end of the text string.
>>> quote_from_start = TextQuoteSelector(suffix="in accordance with this title")
>>> quote_from_start.select_text(legal_text)
'Copyright protection subsists,'
>>> quote_from_end = TextQuoteSelector(prefix="sound recordings; and")
>>> quote_from_end.select_text(legal_text)
'architectural works.'
If you want to use a TextQuoteSelector
to select
a particular instance of a phrase that appears more than once in the text, then you
can add a prefix
or suffix
in addition to the exact
phrase to eliminate the
ambiguity. For example, this selector applies to the second instance of the word
“authorship” in the text, not the first instance.
>>> authorship_selector = TextQuoteSelector(exact="authorship", suffix="include")
>>> authorship_selector.select_text(legal_text)
'authorship'
Converting Between Selector Types¶
You can use the as_position()
and
as_quote()
methods
to convert between the two types of selector.
>>> authorship_selector.as_position(legal_text)
TextPositionSelector(start=306, end=316)
>>> positions.as_quote(legal_text)
TextQuoteSelector(exact='original works of authorship', prefix='', suffix='')
Combining and Grouping Selectors¶
Position selectors can be combined into a single selector that covers both spans of text.
>>> left = TextPositionSelector(start=5, end=22)
>>> right = TextPositionSelector(start=12, end=27)
>>> left + right
TextPositionSelector(start=5, end=27)
If two position selectors don’t overlap, then adding them returns a different
class called a TextPositionSet
.
>>> from anchorpoint import TextPositionSet
>>> left = TextPositionSelector(start=65, end=79)
>>> right = TextPositionSelector(start=100, end=136)
>>> selector_set = left + right
>>> selector_set
TextPositionSet(positions=[TextPositionSelector(start=65, end=79), TextPositionSelector(start=100, end=136)], quotes=[])
The TextPositionSet
can be used to select nonconsecutive passages of text.
>>> selector_set.select_text(legal_text)
'…original works…in any tangible medium of expression…'
If needed, you can use a TextPositionSet
to
select text with a combination of both positions and quotes.
>>> text = "red orange yellow green blue indigo violet"
>>> position = TextPositionSelector(start=4, end=17)
>>> quote = TextQuoteSelector(exact="blue indigo")
>>> group = TextPositionSet(positions=[position], quotes=[quote])
>>> group.select_text(text)
'…orange yellow…blue indigo…'
You can also add or subtract an integer to move the text selection left or right, but only the position selectors will be moved, not the quote selectors.
>>> earlier_selectors = group - 7
>>> earlier_selectors.select_text(text)
'red orange…blue indigo…'
Union and intersection operators also work.
>>> left = TextPositionSelector(start=2, end=10)
>>> right = TextPositionSelector(start=5, end=20)
>>> left & right
TextPositionSelector(start=5, end=10)
Comparing Selectors and Sets¶
The greater than and less than operators can be used to check whether one selector or set covers the entire range of another. This is used to check whether one selector only contains text that’s already within another selector.
>>> smaller = TextPositionSelector(start=4, end=8)
>>> overlapping = TextPositionSelector(start=6, end=50)
>>> overlapping > smaller
False
>>> superset = TextPositionSelector(start=0, end=10)
>>> superset > smaller
True
TextPositionSets also have a __gt__()
method
that works in the same way.
>>> selector_set > TextPositionSelector(start=100, end=110)
True
Serializing Selectors¶
Anchorpoint uses Pydantic to serialize selectors either to Python dictionaries or to JSON strings suitable for sending over the internet with APIs.
>>> authorship_selector.json()
'{"exact": "authorship", "prefix": "", "suffix": "include"}'
>>> selector_set.dict()
{'positions': [{'start': 65, 'end': 79}, {'start': 100, 'end': 136}], 'quotes': []}
Pydantic’s data loading methods mean that you can also create the data for an Anchorpoint selector using nested dictionaries, and then load it with the class’s constructor method.
>>> data = {'positions': [{'start': 65, 'end': 79}, {'start': 100, 'end': 136}]}
>>> TextPositionSet(**data)
TextPositionSet(positions=[TextPositionSelector(start=65, end=79), TextPositionSelector(start=100, end=136)], quotes=[])
You can also get a valid OpenAPI schema, for using Anchorpoint selectors in an API that you design.
>>> TextPositionSelector.schema_json()
'{"title": "TextPositionSelector", "description": "Describes a textual segment by start and end positions.\\n\\nBased on the Web Annotation Data Model `Text Position Selector\\n<https://www.w3.org/TR/annotation-model/#text-position-selector>`_ standard\\n\\n:param start:\\n The starting position of the segment of text.\\n The first character in the full text is character position 0,\\n and the character is included within the segment.\\n\\n:param end:\\n The end position of the segment of text.\\n The character is not included within the segment.", "type": "object", "properties": {"start": {"title": "Start", "default": 0, "type": "integer"}, "end": {"title": "End", "type": "integer"}}}'