Data Science Pipeline

Data Product Pipeline

##Baleen - There’s plenty of natural language all over the Internet that you can use to build a custom corpora IF ONLY YOU COULD GET IT - One such implementation is Baleen, an RSS/Atom ingestion service that will regularly scrape the feeds specified in an OMPL format file for new content and store them as MongoDB documents for your later consumption

Baleen Dashboard

Posts from Baleen Feeds are labelled by their feedtype category, which means we can use its data for supervised training of a model that can use text features to predict a Post’s category
### Formal vs. Natural Languages #### Formal Languages - Strict, unchanging rules defined by grammars and parsed by regular expressions - Generally application specific (chemistry, math) - Literal: exactly what is said is meant - No ambiguity - Parsable by regular expressions - Inflexible: no new terms or meaning
### Formal vs. Natural Languages #### Natural Languages - Flexible, evolving language that occurs naturally in human communication - Unspecific and used in many domains and applications - Redundant and verbose in order to make up for ambiguity - Expressive - Difficult to parse - Very flexible even in narrow contexts
## Lemmatization - Lemmatizer - WordNet lexicon - [class nltk.stem.wordnet.WordNetLemmatizer](http://www.nltk.org/_modules/nltk/stem/wordnet.html#WordNetLemmatizer)