12. Oktober 2020

Problem Statement being an information scientist when it comes to marketing division at reddit.

i must discover the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages therefore we may use them to ascertain which adverts should populate for each web page. Because this is a category issue, we’ll make use of Logistic Regression & Bayes models. Misclassifications in this full instance could be fairly harmless and so I will utilize the precision rating and a baseline of 63.3% to price success. Making use of TFiDfVectorization, I’ll get the feature value to find out which terms have actually the greatest forecast energy for the goal factors. If effective, this model may be utilized to a target other pages which have comparable regularity of this exact same terms and expressions.

Data Collection

See dating-advice-scrape and relationship-advice-scrape notebooks because of this component.

After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get into the dataset folder with this repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless if you ask me.
  • combined name and selftext column directly into one brand new columns that are all_text
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 this means if I always select the value that develops oftentimes, i will be appropriate 63.3% of times.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Merely enhancing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a great deal. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a get a cross val to 82.3 Nevertheless, these rating disappeared.

I believe Tfidf worked the greatest to decrease my overfitting due to variance issue because

we customized the end terms to simply simply take away the ones which were actually too regular to be predictive. This is a success, nevertheless, with additional time we most likely could’ve tweaked them much more to improve all ratings. Taking a look at both the solitary terms and terms in categories of two (bigrams) had been the most readily useful param that gridsearch advised, but, each of my top many predictive terms finished up being uni-grams. My initial range of features had a good amount of jibberish terms and typos. Minimizing the # of that time period word had been needed to show as much as 2, helped be rid of these. Gridsearch additionally proposed 90% max df rate which assisted to remove oversaturated terms also. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they were to simply focus the most commonly used terms of that which was kept.

Summary and tips

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

and so I think the model is prepared to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. It was found by me interesting that taking right out the overly used terms assisted with overfitting, but brought the precision rating down. I do believe there is certainly probably nevertheless space to relax and play around with the paramaters of this Tfidf Vectorizer to see if different end words produce an or that is different


payday loans TX

Used Reddit’s API, demands collection, and BeautifulSoup to clean articles from two subreddits: Dating information & union information, and trained a classification that is binary to anticipate which subreddit confirmed post originated from