Аннотация
Introduction 1
Chapter 1. Project statement 6
1.1 Review of online advertising market 6
1.2 Quiet Media company and industry information 7
1.3 QIT GLOBAL PAAS platform description 8
1.4 The problem of inappropriate site filtering using site classification 9
1.5 Project specifications 11
1.6 Project plan 12
Chapter 2. Analysis of the current state of the market and technologies for inappropriate site filtering 15
2.1 Scientific approaches to inappropriate site classification 15
2.2 Neural Network Algorithm 24
2.3 Commercial software for inappropriate site classification 27
Chapter 3. Development of a website classification model for Quiet media 30
3.1 Data collection and preprocessing 31
3.2 Development of a model for website classification 46
Chapter 4. Business analysis of the developed model deployment 52
4.1 Market analysis 52
4.2 Cost analysis and financial model 53
4.3 Risk analysis of model deployment 58
4.4 Business recommendation 60
Conclusion 62
References 64
Appendix 68
Since the 21st century, we have ushered in a big explosion of information, information overload has become the most severe problem of the Internet. According to the statistics of the number of real-time websites in “internet live stats,” as of April 2022, the number of websites on the Internet is as high as 1.9 billion, which is an astonishing number. Some invalid, blank, malicious, and inappropriate websites are also mixed into them. They not only increase the load of the Internet, but also bring great burdens to Internet users, it has become difficult to filter and identify relevant information nowadays.
Telecommunications carriers are organizations of telecommunications service providers that provide wireless voice and data communications to their subscribed mobile subscribers (Techoopedia, 2018). It plays the indispensable role in our information-driven lives. However, the rapid growth of information volume has become the biggest threat in the telecom industry. Slow growth in subscriber base leaves telcos challenged to improve business performance. It requires not only advanced technology but also prohibitive costs for telecommunication operators to maintain the stability of traffic to provide users with high quality service. The main advantage of the challenge in this situation is the volume of additional services provided to subscribers and optimization of operation activity.
Regarding to the Internet users, overloaded website information is undoubtedly a huge burden as well. The information indiscriminately been transmitted contains inappropriate, malicious websites and commercial advertisements, which are superfluous, useless, or even harmful to adolescence. Because of the advertisements banners full of the Internet, the network transmission speed is weakened, network environment is cluttered. On the other hand, although digital advertisers could run the commercial ads everywhere, they still must face the problem of the low efficiency in the business performance due to the fierce competition with others. They could not focus on the target customers but to afford the high expense on digital advertisements.
Therefore, the telecommunication developer industry came into being. Developer for telecommunication companies provides comprehensive, healthy, multifaceted services in order to profile subscribers in terms of collected data to precise target advertising, provide banner blockers service to reduce the advertisement and create greener Internet environment and help to filter spurious tracking requests and transform telco into a valuable advertisement market player.
Since the developer for telecommunication companies provides a variety of services for multifaceted clients, despite the fact of the necessity of the exist of this industry, there are few academic papers regarding analysis and research of it.
Our project was initiated by the company Quiet Media, which is one of the developers of telecommunication operators. Our research provides a model that will help company to differentiate sites with an inappropriate content and will improve business process of advertising. It relies on machine learning and deep learning techniques to build a binary classification model.
The main research goal of this study is to build a high-performance text classification model for company Quiet Media to automatically differentiate sites with an inappropriate content and to assess the performance of the model based on risk analysis from the business perspective view.
The project task of our group is to provide technical improvement for company Quiet Media in the process of providing services to telecommunication operators. Firstly, we thoroughly proceeded the research in the advertising market and introduced the operation mechanism of the company Quiet Media. Secondly, focus on the problem to be solved, we conducted the related literature research to figure out the optimal solution to our project. After the preparation work, based on the data set which was provided by our project sponsor, we started to get acquainted with the data set, by data collection and prepossessing to clean and integrate the data, then develop a classification model for website classification. Additionally, based on model evaluation and risk analysis from business perspective view to implement model deployment, to provide business recommendations for Quiet Media.
To achieve the goal, it was decided to initiate the project to tackle the following tasks:
• collect and preprocess data for further model development by applying data crawling, data cleaning, data construction, feature selection and data transformation procedures (Data preprocessing stage);
• develop model to classify the content of the websites by selecting optimal classifier using specified evaluation metrics (Model development stage);
• apply risk and cost analysis methods to assess developed model from the business perspective view to consider its risks and challenges in the commercial market (Business analysis stage).
In pursuing those tasks, various tools of data preprocessing and model development were applied, including JupyterHub environment with Python libraries (Scikit-Learn).
The research has practical as well as business value. In terms of practical value, we create the classification model by identify the contents of the website to differentiate the toxic websites from the useful ones. As for the business value, the research applied the business knowledge based on the risk analysis and cost analysis to assess the model performance in the business process of advertising....
The research in this thesis builds a high-performance text classification model for Quiet Media to automatically distinguish websites with inappropriate content and evaluates the model through risk analysis and cost analysis to improve the progress of Quiet Media's advertising business. Based on the research of related literature, the survey of market environment, industry research, the application of learned knowledge of machine learning and neural networks, and the understanding of the problem, the company's practical problem is solved constructively.
In the data preparation and model building stages, we introduced the whole steps including data crawling, data preprocessing, text vectorization, data preliminary data analysis and exploration. We used modules like BeautifulSoup and urllib to extract and collect data from websites. Using regular expression to preprocess the data and vectorized text content by implementing TF-IDF. Also fitting curve and wordcloud were used to extract insights from the data and helped us collecting better understanding about the structure of them. In model building stage, cross validation method was used to avoid the potential impact of unbalanced dataset from each category. Multiple metrics such as recall, precision, f1-measure, AUC-ROC were used in order to evaluate the performance of the model to help us choose the best algorithm, parameters and model. Although factors such as the structural quality of the site's content or the quality of the data cannot be assured, we carefully compared multiple approaches at each step and chose the one that best fit our model. In the end, we obtained a model with an AUC-ROC score of 0.989 using the LSTM model. In addition, the pipeline was built to test the behavior of the model and compared the method of manual classification. The result show that, neural networks have better advantages over manual classification and are more suitable for classification of large-scale data, At the same time, it can save a lot of human resources and time resources, thus increasing corporate profits.
In terms of product promotion, we look for a lightweight pornographic website classification tool by using P / MF model, and we focus on this part of the market segment. Through communication with Quiet Media, we further focused on the segment market of pornographic web page screening for advertising. In addition, we also analyzed the cost budget and risk assessment in the product deployment section. By building a financial model for budget evaluation, we can conclude that the cost of the model built using data science-based methods is lower than the cost of manual labeling in the long run. For the risk assessment, based on the basic theory of project management, we classify the risks by type, and in general the risk level of our classification model is low risk.
At the end of the paper, we combine the actual situation of company Quiet Media with the characteristics of our model and provide business recommendations for the company in three aspects, such as technology and platform services.
In the further research, we think the model could be updated in multilingual support and to fit more application scenarios. With larger dataset to be trained, the accuracy would be higher. In addition, most of adult websites would be closed in some day, so the training dataset need to be updated when the tool iterates.
1. Aggarwal, C.C., & Zhai, C. (2012). Mining Text Data. Springer US.
2. Araba, A.M., Memon, Z.A., Alhawat, M., Ali, M., & Milad, A. (2021). Estimation at Completion in Civil Engineering Projects: Review of Regression and Soft Computing Models. Knowledge-Based Engineering and Sciences.
3. Bank of Russia, (2022, April 23) Official exchange rates on selected date https://www.cbr.ru/eng/currency_base/daily/
4. Brown, J.A., & Wisco, J.J. (2019). The components of the adolescent brain and its unique sensitivity to sexually explicit material. Journal of adolescence, 72, 10-13.
5. Buber, E., & Diri, B. (2019). Web page classification using RNN. Procedia Computer Science, 154, 62-72.
6. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T.P., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide.
7. Chen, H., Ma, Q., Lin, Z., & Yan, J. (2021). Hierarchy-aware Label Semantics Matching Network for Hierarchical Text Classification. ACL.
8. Chinnadurai, J. (2017). A Framework for Detecting Phishing Websites using EDA algorithm and URL based Website Classification. International Research Journal of Innovations in Engineering and Technology, 1(2), 10.
9. Dong, K., Guo, L., & Fu, Q. (2014). An adult image detection algorithm based on Bag-of-Visual- Words and text information. 2014 IEEE 10th International Conference on Natural Computation (ICNC), 556-560.
10. Duncan, W.R. (1996). A GUIDE TO THE PROJECT MANAGEMENT BODY OF KNOWLEDGE.
11. Espinosa-Leal, L., Akusok, A., Lendasse, A., & Bjork, K. (2019). Website Classification from Webpage Renders.
12. Fanaei, M.A., Tahmasbi-Sarvestani, A., Fallah, Y.P., Bansal, G., Valenti, M.C., & Kenney, J.B. (2014). Adaptive content control for communication amongst cooperative automated vehicles. 2014 IEEE 6th International Symposium on Wireless Vehicular Communications (WiVeC 2014), 1-7.
13. Glazkova, A., Egorov, Y., & Glazkov, M. (2020). A Comparative Study of Feature Types for AgeBased Text Classification. AIST.
14. Goodfellow, I.J., Bengio, Y., & Courville, A.C. (2015). Deep Learning. Nature, 521, 436-444.
15. Haddadi, H., Hui, P., Henderson, T., & Brown, I. (2011). Targeted Advertising on the Handset: Privacy and Security Challenges. Pervasive Advertising....47