Parsing Documents
Parser Class
- class edgar.Parser(metadata: Optional[metadata_manager] = None, data_dir: str = 'edgar_data')
Main class for extracting information from HTML documents.
- featurize_file(tikr: str, submission: str, filename: str, force: bool = False, silent: bool = False, remove_raw: bool = False, **kwargs)
Load featurized dataframe with extracted annotations from file.
- Parameters
tikr (str) – a company identifier to query
submission (str) – The filing to access the file from
filename (str) – The name of the file to featurize
force (bool, default=False) – if (True), then ignore locally downloaded files and overwrite them. Otherwise, attempt to detect previous download and abort server query.
silent (bool default=False) – if (True), then does not print runtime warnings.
remove_raw (bool) – if (True), the packed data will be deleted after extraction
- Return type
Pandas.DataFrame
Notes
Documents without annotations receive entries in the dataframe The sentinel column
is_annotated
set to False.Each row corresponds to one text field. Rows are not unique, one is generated for each iXBRL annotation on that text field.
- get_annotated_submissions(tikr, document_type='all', silent: bool = False) list
Get list of submissions names with annotations.
- get_driver_path(tikr, submission, fname, partition='files')
Get absolute path of file.
- Parameters
fname (str) – The file to get the path for.
- get_unannotated_submissions(tikr, document_type='all', silent: bool = False) list
Get list of submissions names without annotations.
- static labels_in_table(child_span, parent_span)
- check whether strings in child_span is in any of the parent_span
child_span, parent_span: a list of spans
a list of boolean values