DataLoader Class

class edgar.DataLoader(tikrs: str, document_type: str, data_dir: str = 'data', parser: ~typing.Callable[[str], str] = <function clean_text>, config: ~typing.Optional[~edgar.DataLoaderConfig] = None, loading_bar: bool = True)

Master class for iterating through parsed text of filing submissions.

__init__(tikrs: str, document_type: str, data_dir: str = 'data', parser: ~typing.Callable[[str], str] = <function clean_text>, config: ~typing.Optional[~edgar.DataLoaderConfig] = None, loading_bar: bool = True)

Construct desired dataloading pipeline.

Parameters

tikrs (List[str]) – A set of companies to load documents for.
document_type (DocumentType or str) – The submission type to load documents for.
parser (Callable[[str], str]) – A function that turns a raw document into cleaned output.
loading_bar (bool = True) – if True, will display a tqdm loading bar while downloading files.

Notes

If files for associated company tickers are not available locally, will download and cache them during initialization for future use.

Usage

The DataLoader will find, load, and return parsed text from a set of desired documents. The constructor downloads the necessary files and the text is extracted just-in-time when the user calls for it.

The interface supports slice-indexing and iterating.

from edgar import DataLoader

data = DataLoader(tikrs=['nflx'], '8-K')

print(data[14])

for text in data:
    print(text)

print(data[:3])

Custom Parse Functions

In order to access raw data, we can pass in ‘no operation’ cleaning function.

def noop(x):
    return x
data = DataLoader(['nflx'], '8-K', parser=noop)

The default cleaning function is as follows:

from edgar import html

def clean(text):
    # Remove html <> tags
    text = html.remove_tags(text)
    # Remove most malformed characters
    text = html.remove_htmlbytes(text)
    # Remove newlines / tabs
    text = re.sub('[\n\t]', ' ', text)
    # Replace multiple spaces in a row with one
    text = html.compress_spaces(text)

    return text

For custom purposes, a modified cleaning function can be passed to the dataloader.

For example, we may want to remove numerical tables from 10-Q forms.

from edgar import html

def clean_tables(text):
    # Remove numerical tables
    text = html.remove_tables(text)
    # Remove html <> tags
    text = html.remove_tags(text)
    # Remove most malformed characters
    text = html.remove_htmlbytes(text)
    # Remove newlines / tabs
    text = re.sub('[\n\t]', ' ', text)
    # Replace multiple spaces in a row with one
    text = html.compress_spaces(text)

    return text