Anchorpoint API Documentation¶
Release v. 0.7.0.
Anchorpoint is a Python library that supplies text substring selectors for anchoring annotations. It includes helper methods for switching between positional and contextual selectors, and it’s used for referencing judicial opinions and statutes by AuthoritySpoke.
Anchorpoint is licensed under the Atmosphere Software License.
Selecting Text with Anchorpoint¶
Anchorpoint is a tool for labeling referenced passages within text documents, in a format that allows the “anchors” to the referenced passages to be stored and transmitted separately from the documents themselves. Anchorpoint has two basic ways of selecting text: as text positions, or as text quotes. Here’s a demonstration of creating a text string in Python and then using both kinds of text selectors.
>>> from anchorpoint import TextPositionSelector, TextQuoteSelector
>>> legal_text = (
... "Copyright protection subsists, in accordance with this title, "
... "in original works of authorship fixed in any tangible medium of expression, "
... "now known or later developed, from which they can be perceived, reproduced, "
... "or otherwise communicated, either directly or with the aid of a machine or device. "
... "Works of authorship include the following categories: "
... "literary works; musical works, including any accompanying words; "
... "dramatic works, including any accompanying music; "
... "pantomimes and choreographic works; "
... "pictorial, graphic, and sculptural works; "
... "motion pictures and other audiovisual works; "
... "sound recordings; and architectural works.")
>>> positions = TextPositionSelector(start=65, end=93)
>>> positions.select_text(legal_text)
'original works of authorship'
>>> quote = TextQuoteSelector(exact="in accordance with this title")
>>> quote.select_text(legal_text)
'in accordance with this title'
A TextPositionSelector
works by identifying the positions of
the start and end characters within the text string object, while
a TextQuoteSelector
describes the part of the text
that is being selected.
Sometimes a selected passage is too long to include in full in
the TextQuoteSelector
.
In that case, you can identify the selection by specifying its prefix
and suffix
.
That is, the text immediately before and immediately after the text you want to select.
>>> quote = TextQuoteSelector(prefix="otherwise communicated, ", suffix=" Works of authorship")
>>> quote.select_text(legal_text)
'either directly or with the aid of a machine or device.'
If you specify just a suffix, then the start of your text selection is the beginning of the text string. If you specify just a prefix, then your text selection continues to the end of the text string.
>>> quote_from_start = TextQuoteSelector(suffix="in accordance with this title")
>>> quote_from_start.select_text(legal_text)
'Copyright protection subsists,'
>>> quote_from_end = TextQuoteSelector(prefix="sound recordings; and")
>>> quote_from_end.select_text(legal_text)
'architectural works.'
If you want to use a TextQuoteSelector
to select
a particular instance of a phrase that appears more than once in the text, then you
can add a prefix
or suffix
in addition to the exact
phrase to eliminate the
ambiguity. For example, this selector applies to the second instance of the word
“authorship” in the text, not the first instance.
>>> authorship_selector = TextQuoteSelector(exact="authorship", suffix="include")
>>> authorship_selector.select_text(legal_text)
'authorship'
Converting Between Selector Types¶
You can use the as_position()
and
as_quote()
methods
to convert between the two types of selector.
>>> authorship_selector.as_position(legal_text)
TextPositionSelector(start=306, end=316)
>>> positions.as_quote(legal_text)
TextQuoteSelector(exact='original works of authorship', prefix='', suffix='')
Combining and Grouping Selectors¶
Position selectors can be combined into a single selector that covers both spans of text.
>>> left = TextPositionSelector(start=5, end=22)
>>> right = TextPositionSelector(start=12, end=27)
>>> left + right
TextPositionSelector(start=5, end=27)
If two position selectors don’t overlap, then adding them returns a different
class called a TextPositionSet
.
>>> from anchorpoint import TextPositionSet
>>> left = TextPositionSelector(start=65, end=79)
>>> right = TextPositionSelector(start=100, end=136)
>>> selector_set = left + right
>>> selector_set
TextPositionSet(positions=[TextPositionSelector(start=65, end=79), TextPositionSelector(start=100, end=136)], quotes=[])
The TextPositionSet
can be used to select nonconsecutive passages of text.
>>> selector_set.select_text(legal_text)
'…original works…in any tangible medium of expression…'
If needed, you can use a TextPositionSet
to
select text with a combination of both positions and quotes.
>>> text = "red orange yellow green blue indigo violet"
>>> position = TextPositionSelector(start=4, end=17)
>>> quote = TextQuoteSelector(exact="blue indigo")
>>> group = TextPositionSet(positions=[position], quotes=[quote])
>>> group.select_text(text)
'…orange yellow…blue indigo…'
You can also add or subtract an integer to move the text selection left or right, but only the position selectors will be moved, not the quote selectors.
>>> earlier_selectors = group - 7
>>> earlier_selectors.select_text(text)
'red orange…blue indigo…'
Union and intersection operators also work.
>>> left = TextPositionSelector(start=2, end=10)
>>> right = TextPositionSelector(start=5, end=20)
>>> left & right
TextPositionSelector(start=5, end=10)
Comparing Selectors and Sets¶
The greater than and less than operators can be used to check whether one selector or set covers the entire range of another. This is used to check whether one selector only contains text that’s already within another selector.
>>> smaller = TextPositionSelector(start=4, end=8)
>>> overlapping = TextPositionSelector(start=6, end=50)
>>> overlapping > smaller
False
>>> superset = TextPositionSelector(start=0, end=10)
>>> superset > smaller
True
TextPositionSets also have a __gt__()
method
that works in the same way.
>>> selector_set > TextPositionSelector(start=100, end=110)
True
Serializing Selectors¶
Anchorpoint uses Pydantic to serialize selectors either to Python dictionaries or to JSON strings suitable for sending over the internet with APIs.
>>> authorship_selector.json()
'{"exact": "authorship", "prefix": "", "suffix": "include"}'
>>> selector_set.dict()
{'positions': [{'start': 65, 'end': 79}, {'start': 100, 'end': 136}], 'quotes': []}
Pydantic’s data loading methods mean that you can also create the data for an Anchorpoint selector using nested dictionaries, and then load it with the class’s constructor method.
>>> data = {'positions': [{'start': 65, 'end': 79}, {'start': 100, 'end': 136}]}
>>> TextPositionSet(**data)
TextPositionSet(positions=[TextPositionSelector(start=65, end=79), TextPositionSelector(start=100, end=136)], quotes=[])
You can also get a valid OpenAPI schema, for using Anchorpoint selectors in an API that you design.
>>> TextPositionSelector.schema_json()
'{"title": "TextPositionSelector", "description": "Describes a textual segment by start and end positions.\\n\\nBased on the Web Annotation Data Model `Text Position Selector\\n<https://www.w3.org/TR/annotation-model/#text-position-selector>`_ standard\\n\\n:param start:\\n The starting position of the segment of text.\\n The first character in the full text is character position 0,\\n and the character is included within the segment.\\n\\n:param end:\\n The end position of the segment of text.\\n The character is not included within the segment.", "type": "object", "properties": {"start": {"title": "Start", "default": 0, "type": "integer"}, "end": {"title": "End", "type": "integer"}}}'
Development Updates¶
Changelog¶
0.7.0 (2021-10-09)¶
passing TextPositionSet to from_selection_sequence doesn’t cause error
fix bug: Range with end “Inf” caused string slicing error
add TextPositionSet.from_quotes
remove TextSelector class
start_less_than_end is no longer a root_validator
0.6.1 (2021-09-23)¶
python-ranges by Superbird11 is imported instead of vendored
0.6.0 (2021-09-19)¶
TextPositionSelector no longer inherits from Range
TextPositionSet no longer inherits from RangeSet
TextPositionSelector no longer has real_start and real_end that can differ from start and end
Selectors and TextPositionSets are Pydantic models
TextSelector is Pydantic model for either Quote or Position Selector
remove Marshmallow schemas
update type annotations for TextPositionSelector.from_range
add TestQuoteSelector.as_unique_position method
TextPositionSet can include TextQuoteSelectors
add convert_quotes_to_positions method to TextPositionSet
replace TextPositionSet.selectors field with positions and quotes
change as_quote_selector method to as_quote
TextPositionSet.add_margin includes quotes
fix bug: subtracting int from selector set caused quotes to be lost
add __ge__ and __gt__ methods for TextPositionSelector
add Selecting Text with Anchorpoint guide
0.5.3 (2021-08-11)¶
change readme to .rst
use setup.py instead of setup.cfg
0.5.2 (2021-08-02)¶
TextPositionSet can be made from list of tuples
long passage in exception is truncated
0.5.1 (2021-05-15)¶
improper shorthand for selector raises TextSelectionError
0.5.0 (2021-05-07)¶
add TextPositionSelector.from_text constructor
Range constructor interprets None as 0
fix bug: union with TextPositionSet should return TextPositionSet
add PositionSelectorSchema, for when a selector can’t be a TextQuoteSelector
0.4.4 (2021-01-25)¶
provide “missing” instead of “optional” argument for marshmallow schema
add TextPositionSetFactory.from_exact_strings
SelectorSchema.expand_anchor_shorthand takes only a string argument
TextPositionSetFactory.from_selection will accept a Sequence of mixed types
0.4.3 (2020-12-11)¶
TextPositionSelector serializer dumps .real_start and .real_end
TextPositionSelector serializer omits “include_start” and “include_end”
TextPositionSelector serializer orders fields so “start” comes before “end”
disallow zero-length TextPositionSelectors
0.4.2 (2020-08-30)¶
create TextPositionSelector .real_start and .real_end
create TextPositionSet.add_margin
0.4.1 (2020-08-29)¶
TextPositionSetFactory will accept list of strings
subtracting more than start value is no longer IndexError, but more than end value is
TextSequence quoting from empty string doesn’t start with None
0.4.0 (2020-08-08)¶
TextPositionSet can output a TextSequence
create TextSequence addition method
0.3.3 (2020-07-28)¶
fix bug: leading whitespace when selecting from prefix
0.3.2 (2020-07-22)¶
fix bug where adding selectors converted them to parent class
add TextSelectionError exception
0.3.1 (2020-07-19)¶
add left and right margin parameters to TextPositionSelector.as_quote_selector
as_quotes method for TextSelectorSet
enable adding int to TextSelectorSet
fix class name in repr for TextSelectorSet
0.3.0 (2020-07-18)¶
add TextQuoteSelector.from_text shortcut
add ability to subtract an integer from all values in a TextPositionSet
include [marshmallow](https://github.com/marshmallow-code/marshmallow) schema for serializing
0.2.1 (2020-05-21)¶
add init file to utils directory
0.2.0 (2020-05-21)¶
Make TextPositionSelector subclass Range from [python-ranges](https://github.com/Superbird11/ranges).
0.1.1 (2019-12-01)¶
add init file to tests directory
0.1.0 (2019-11-30)¶
Create TextPositionSelector and TextQuoteSelector classes
GitHub¶
You can find open issues and current changes to anchorpoint through its GitHub repo.
Twitter¶
On Twitter, you can follow @authorityspoke or @mcareyaus for project updates.
API Reference¶
Text Selectors¶
Text substring selectors for anchoring annotations.
Based on parts of the W3C Web Annotation Data Model.
- class anchorpoint.textselectors.TextPositionSelector(**data)¶
Describes a textual segment by start and end positions.
Based on the Web Annotation Data Model Text Position Selector standard
- Parameters
start – The starting position of the segment of text. The first character in the full text is character position 0, and the character is included within the segment.
end – The end position of the segment of text. The character is not included within the segment.
- __add__(value)¶
Make a new selector covering the combined ranges of self and other.
- Parameters
other – selector for another text interval
margin – allowable distance between two selectors that can still be added together
- Return type
- Returns
a selector reflecting the combined range if possible, otherwise None
- __and__(other)¶
Make a new selector covering the intersection of the ranges of self and other.
- Parameters
other (
Union
[TextPositionSelector
,TextPositionSet
,Range
,RangeSet
]) – selector for another text interval- Return type
- Returns
a selector reflecting the range of the intersection
- __gt__(other)¶
Check if self is greater than other.
- Parameters
other (
Union
[TextPositionSelector
,TextPositionSet
]) – selector for another text interval- Return type
- Returns
whether self is greater than other
- __hash__ = None¶
- __or__(other)¶
Make a new selector covering the combined ranges of self and other.
- Parameters
other (
Union
[TextPositionSelector
,TextPositionSet
,Range
,RangeSet
]) – selector for another text interval- Return type
- Returns
a selector reflecting the combined range
- as_quote(text, left_margin=0, right_margin=0)¶
Make a quote selector, creating prefix and suffix from specified lengths of text.
- Parameters
- Return type
- combine(other, text)¶
Make new selector combining ranges of self and other if it will fit in text.
- difference(other)¶
Get selectors in self or other but not both.
Applies Range difference, method replacing RangeSet with
TextPositionSet
in return value.- Return type
- classmethod from_range(range)¶
Make TextPositionSelector with same extent as a Range object from python-ranges.
- Return type
- classmethod from_text(text, start=0, end=None)¶
Make position selector including the text strings “start” and “end” within “text”.
- Return type
- range()¶
Get the range of the text.
- Return type
Range
- rangeset()¶
Get the range set of the text.
- Return type
RangeSet
- select_text(text)¶
Get the quotation from text identified by start and end positions.
- Return type
- classmethod start_less_than_end(v, values)¶
Verify start position is before the end position.
- Returns
the end position, which after the start position
- classmethod start_not_negative(v)¶
Verify start position is not negative.
- Return type
- Returns
the start position, which is not negative
- subtract_integer(value)¶
Reduce self’s startpoint and endpoint by an integer.
- Return type
- class anchorpoint.textselectors.TextPositionSet(**data)¶
A set of TextPositionSelectors.
- __add__(value)¶
Increase all startpoints and endpoints by the given amount.
- Parameters
value (
Union
[int
,TextPositionSelector
,TextPositionSet
]) – selector for another text interval, or integet to add to every start and end value in self’s position selectors- Return type
- Returns
a selector reflecting the combined range if possible, otherwise None
- __gt__(other)¶
Test if self’s rangeset includes all of other’s rangeset, but is not identical.
- Return type
- __hash__ = None¶
- __str__()¶
Return str(self).
- __sub__(value)¶
Decrease all startpoints and endpoints by the given amount.
- Return type
- add_margin(text, margin_width=3, margin_characters=', ."\\' ;[]()')¶
Expand selected position selectors to include margin of punctuation.
This can cause multiple selections to be merged into a single one.
Ignores quote selectors.
- Parameters
- Return type
- Returns
A new TextPositionSet with the margin added
>>> from anchorpoint.textselectors import TextPositionSetFactory >>> text = "I predict that the grass is wet. (It rained.)" >>> factory = TextPositionSetFactory(text=text) >>> selectors = [TextQuoteSelector(exact="the grass is wet"), TextQuoteSelector(exact="it rained")] >>> position_set = factory.from_selection(selection=selectors) >>> len(position_set.ranges()) 2 >>> new_position_set = position_set.add_margin(text=text) >>> len(new_position_set.ranges()) 1 >>> new_position_set.ranges()[0].start 15 >>> new_position_set.ranges()[0].end 43
- as_quotes(text)¶
Copy self’s quote and position selectors, converting all position selectors to quote selectors.
- Return type
- as_string(text)¶
Return a string representing the selected parts of text.
>>> selectors = [TextPositionSelector(start=5, end=10)] >>> selector_set = TextPositionSet(positions=selectors) >>> sequence = selector_set.as_text_sequence("Some text.") >>> selector_set.as_string("Some text.") '…text.'
- Return type
- as_text_sequence(text, include_nones=True)¶
List the phrases in a text passage selected by this TextPositionSet.
- Parameters
passage – A passage to select text from
include_nones (
bool
) – Whether the list of phrases should include None to indicate a block of unselected text
- Return type
- Returns
A TextSequence of the phrases in the text
>>> selectors = [TextPositionSelector(start=5, end=10)] >>> selector_set = TextPositionSet(positions=selectors) >>> selector_set.as_text_sequence("Some text.") TextSequence([None, TextPassage("text.")])
- convert_quotes_to_positions(text)¶
Return new TextPositionSet with all quotes replaced by their positions in the given text.
- Return type
- classmethod from_quotes(selection)¶
Construct TextPositionSet from string or TextQuoteSelectors.
If a string is used, it will be converted to a
TextQuoteSelector
with no prefix or suffix.- Return type
- classmethod from_ranges(ranges)¶
Make new class instance from Range objects from python-ranges library.
- Return type
- merge_rangeset(rangeset)¶
Merge another RangeSet into this one, returning a new TextPositionSet.
- Parameters
rangeset (
RangeSet
) – the RangeSet to merge- Return type
- Returns
a new TextPositionSet representing the combined ranges
- classmethod order_of_selectors(v)¶
Ensure that selectors are in order.
- positions_as_quotes(text)¶
Copy self’s position selectors, converted to quote selectors.
- Return type
- positions_of_quote_selectors(text)¶
Convert self’s quote selectors to position selectors for a given text.
- Return type
- classmethod quote_selectors_are_in_list(selectors)¶
Put single selector in list and convert strings to selectors.
- quotes_rangeset(text)¶
Get ranges where these quotes appear in the provided text.
- Return type
RangeSet
- rangeset()¶
Convert positions into python-ranges Rangeset.
- Return type
RangeSet
- select_text(text, margin_width=3, margin_characters=', ."\\' ;[]()')¶
Return the selected text from text.
- Parameters
- Return type
- Returns
The selected text
>>> from anchorpoint.textselectors import TextPositionSetFactory >>> text = "I predict that the grass is wet. (It rained.)" >>> factory = TextPositionSetFactory(text=text) >>> selectors = [TextQuoteSelector(exact="the grass is wet"), TextQuoteSelector(exact="it rained")] >>> position_set = factory.from_selection(selection=selectors) >>> position_set.select_text(text=text) '…the grass is wet. (It rained…'
- classmethod selectors_are_in_list(selectors)¶
Put single selector in list.
- class anchorpoint.textselectors.TextPositionSetFactory(text)¶
Factory for constructing
TextPositionSet
from text passages and various kinds of selector.- __init__(text)¶
Store text passage that will be used to generate text selections.
- __weakref__¶
list of weak references to the object (if defined)
- from_bool(selection)¶
Select either the whole passage or none of it.
- Return type
- from_exact_strings(selection)¶
Construct TextPositionSet from a sequence of strings representing exact quotations.
First converts the sequence to TextQuoteSelectors, and then to TextPositionSelectors.
- Return type
- from_quote_selectors(quotes)¶
Construct TextPositionSet from a sequence of TextQuoteSelectors.
- Return type
- from_selection(selection)¶
Construct TextPositionSet for a provided text passage, from any type of selector.
- Return type
- from_selection_sequence(selections)¶
Construct TextPositionSet from one or more of: strings, Quote Selectors, and Position Selectors.
First converts strings to TextQuoteSelectors, and then to TextPositionSelectors.
- Return type
- class anchorpoint.textselectors.TextQuoteSelector(**data)¶
Describes a textual segment by quoting it, or passages before or after it.
Based on the Web Annotation Data Model Text Quote Selector standard
- Parameters
exact – a copy of the text which is being selected
prefix – a snippet of text that occurs immediately before the text which is being selected.
suffix – the snippet of text that occurs immediately after the text which is being selected.
- __hash__ = None¶
- as_position(text)¶
Get the interval where the selected quote appears in “text”.
- Parameters
text (
str
) – the passage where an exact quotation needs to be located- Return type
- Returns
the position selector for the location of the exact quotation
- as_unique_position(text)¶
Get the interval where the selected quote appears in “text”.
- Parameters
text (
str
) – the passage where an exact quotation needs to be located- Return type
- Returns
the position selector for the location of the exact quotation
- find_match(text)¶
Get the first match for the selector within a string.
- Parameters
text (
str
) – text to search for a match to the selector- Return type
Optional
[Match
]- Returns
a regular expression match, or None
>>> text = "process, system, method of operation, concept, principle" >>> selector = TextQuoteSelector(exact="method of operation") >>> selector.find_match(text) <re.Match object; span=(17, 36), match='method of operation'>
- classmethod from_text(text)¶
Create a selector from a text string.
“prefix” and “suffix” fields may be created by separating part of the text with a pipe character (“|”).
- Parameters
text (
str
) – the passage where an exact quotation needs to be located- Return type
- Returns
a selector for the location of the exact quotation
>>> text = "process, system,|method of operation|, concept, principle" >>> selector = TextQuoteSelector.from_text(text) >>> selector.prefix 'process, system,' >>> selector.exact 'method of operation' >>> selector.suffix ', concept, principle'
- is_unique_in(text)¶
Test if selector refers to exactly one passage in text.
- classmethod no_none_for_prefix(value)¶
Ensure that ‘prefix’, ‘exact’, and ‘suffix’ are not None.
- passage_regex()¶
Get regex to identify the selected text.
- prefix_regex()¶
Get regex for the text before any whitespace and the selection.
- rebuild_from_text(text)¶
Make new selector with the “exact” value found in a given text.
Used for building a complete selector when
exact
has not been specified.- Parameters
text (
str
) – the passage where an exact quotation needs to be located- Return type
- Returns
a new selector with the “exact” value found in the provided text
- select_text(text)¶
Get the passage matching the selector, minus any whitespace at ends.
- Parameters
text (
str
) – the passage where an exact quotation needs to be located.- Return type
- Returns
the passage between
prefix
andsuffix
intext
.
>>> text = "process, system, method of operation, concept, principle" >>> selector = TextQuoteSelector(prefix="method of operation,") >>> selector.select_text(text) 'concept, principle'
- static split_anchor_text(text)¶
Break up shorthand text selector format into three fields.
Tries to break up the string into
prefix
,exact
, andsuffix
, by splitting on exactly two pipe characters.
- suffix_regex()¶
Get regex for the text following the selection and any whitespace.
Text Sequences¶
- class anchorpoint.textsequences.TextPassage(text)¶
A contiguous passage of text.
Can be used to compare passages while disregarding end punctuation.
- means(other)¶
Test if passages have the same text, disregarding end puncutation.
- Parameters
other (
Optional
[TextPassage
]) – the other passage to test against- Return type
- Returns
True if the two passages have the same text, False otherwise
>>> TextPassage("Hello, world.").means(TextPassage("Hello, world")) True >>> TextPassage("Hello world").means(TextPassage("Hello, world")) False
- class anchorpoint.textsequences.TextSequence(passages=None)¶
Sequential passages of text that need not be consecutive.
Unlike a Legislice Enactment, a TextSequence does not preserve the tree structure of the quoted document.
- __add__(other)¶
Combine TextSequences by merging their selected
TextPassage
s.- Return type
- __ge__(other)¶
Return self>=value.
- __gt__(other)¶
Return self>value.
- __init__(passages=None)¶
Make new TextSequence from
TextPassage
list.- Parameters
passages (
Optional
[List
[Optional
[TextPassage
]]]) – the text passages included in the TextSequence, which should be chosen to express a coherent idea. “None”s in the sequence represent spans of text that exist in the source document, but that haven’t been chosen to be part of the TextSequence.
- __repr__()¶
Return repr(self).
- __str__()¶
Return str(self).
- __weakref__¶
list of weak references to the object (if defined)
- means(other)¶
Test if all the passages in self and other correspond with each other.
- Return type
- strip()¶
Remove symbols representing missing text from the beginning and end.
- Return type