Parsing Documents

Parser Class

class edgar.Parser(metadata: Optional[metadata_manager] = None, data_dir: str = 'edgar_data')

Main class for extracting information from HTML documents.

featurize_file(tikr: str, submission: str, filename: str, force: bool = False, silent: bool = False, remove_raw: bool = False, **kwargs)

Load featurized dataframe with extracted annotations from file.

Parameters

tikr (str) – a company identifier to query
submission (str) – The filing to access the file from
filename (str) – The name of the file to featurize
force (bool, default=False) – if (True), then ignore locally downloaded files and overwrite them. Otherwise, attempt to detect previous download and abort server query.
silent (bool default=False) – if (True), then does not print runtime warnings.
remove_raw (bool) – if (True), the packed data will be deleted after extraction

Return type

Pandas.DataFrame

Notes

Documents without annotations receive entries in the dataframe The sentinel column is_annotated set to False.

Each row corresponds to one text field. Rows are not unique, one is generated for each iXBRL annotation on that text field.

get_annotated_submissions(tikr, document_type='all', silent: bool = False) → list: Get list of submissions names with annotations.

get_driver_path(tikr, submission, fname, partition='files')

Get absolute path of file.

get_unannotated_submissions(tikr, document_type='all', silent: bool = False) → list: Get list of submissions names without annotations.

static labels_in_table(child_span, parent_span)

check whether strings in child_span is in any of the parent_span

child_span, parent_span: a list of spans

a list of boolean values