DataLoader Class

class edgar.DataLoader(tikrs: str, document_type: str, data_dir: str = 'data', parser: ~typing.Callable[[str], str] = <function clean_text>, config: ~typing.Optional[~edgar.DataLoaderConfig] = None, loading_bar: bool = True)

Master class for iterating through parsed text of filing submissions.

__init__(tikrs: str, document_type: str, data_dir: str = 'data', parser: ~typing.Callable[[str], str] = <function clean_text>, config: ~typing.Optional[~edgar.DataLoaderConfig] = None, loading_bar: bool = True)

Construct desired dataloading pipeline.

Parameters
  • tikrs (List[str]) – A set of companies to load documents for.

  • document_type (DocumentType or str) – The submission type to load documents for.

  • parser (Callable[[str], str]) – A function that turns a raw document into cleaned output.

  • loading_bar (bool = True) – if True, will display a tqdm loading bar while downloading files.

Notes

If files for associated company tickers are not available locally, will download and cache them during initialization for future use.

Usage

The DataLoader will find, load, and return parsed text from a set of desired documents. The constructor downloads the necessary files and the text is extracted just-in-time when the user calls for it.

The interface supports slice-indexing and iterating.

1from edgar import DataLoader
2
3data = DataLoader(tikrs=['nflx'], '8-K')
4print(data[14])
5for text in data:
6    print(text)
7print(data[:3])

Custom Parse Functions

In order to access raw data, we can pass in ‘no operation’ cleaning function.

def noop(x):
    return x
data = DataLoader(['nflx'], '8-K', parser=noop)

The default cleaning function is as follows:

from edgar import html

def clean(text):
    # Remove html <> tags
    text = html.remove_tags(text)
    # Remove most malformed characters
    text = html.remove_htmlbytes(text)
    # Remove newlines / tabs
    text = re.sub('[\n\t]', ' ', text)
    # Replace multiple spaces in a row with one
    text = html.compress_spaces(text)

    return text

For custom purposes, a modified cleaning function can be passed to the dataloader.

For example, we may want to remove numerical tables from 10-Q forms.

from edgar import html

def clean_tables(text):
    # Remove numerical tables
    text = html.remove_tables(text)
    # Remove html <> tags
    text = html.remove_tags(text)
    # Remove most malformed characters
    text = html.remove_htmlbytes(text)
    # Remove newlines / tabs
    text = re.sub('[\n\t]', ' ', text)
    # Replace multiple spaces in a row with one
    text = html.compress_spaces(text)

    return text