David Li
4 min readFeb 18, 2023
title img

Summary

For the stonk pipeline I tend to focus on stock tickers from various csv files automatic generate.

This code initializes the spacy library with some custom configurations and data, such as company names, stock symbols, and exchange information. It also adds an entity_ruler to the pipeline, which allows the library to identify specific entities in text, such as stocks and companies, based on pre-defined patterns. The entity_ruler is populated with patterns using data from various CSV files.

def init_nlp(exchange_data_path: str = "https://raw.githubusercontent.com/dli-invest/fin_news_nlp/main/nlp_articles/core/data/exchanges.tsv", indicies_data_path: str = "https://raw.githubusercontent.com/dli-invest/fin_news_nlp/main/nlp_articles/core/data/indicies.tsv"):
SPLIT_COMPANY_INTO_WORDS = False
BEAR_MARKET_ADJUSTMENT = True
nlp = spacy.load("en_core_web_sm")
ticker_df = pd.read_csv(
"https://raw.githubusercontent.com/dli-invest/eod_tickers/main/data/us.csv"
)
ticker_df = ticker_df.dropna(subset=['Code', 'Name'])
ticker_df = ticker_df[~ticker_df.Name.str.contains("Wall Street", na=False)]

This code reads a CSV file containing stock ticker data using the pandas library. It then filters out rows that have missing values in the “Code” and “Name” columns, or that contain the string “Wall Street” in the “Name” column. These operations…