{"id":2641,"date":"2022-10-25T13:44:06","date_gmt":"2022-10-25T13:44:06","guid":{"rendered":"https:\/\/nilg.ai\/?p=2641"},"modified":"2023-09-02T07:48:27","modified_gmt":"2023-09-02T07:48:27","slug":"duplicate-detection-text","status":"publish","type":"post","link":"https:\/\/nilg.ai\/pt\/202210\/duplicate-detection-text\/","title":{"rendered":"Duplicate detection in text data"},"content":{"rendered":"<p>A common use case seen across several industries is the creation of systems capable of detecting the similarity between pairs of objects &#8211; images and texts. For example, duplicate detection in marketplaces, or recommendation systems that show similar objects to the ones the user has searched for, can use such systems. They can also be useful for detecting plagiarism in a thesis or articles due to the massive number of publications over the years. So, being text so a widespread data modality, being able to do duplicate detection in text is a critical task in Machine Learning.<\/p>\n<p>But how can we make these systems? A human can easily perceive the similarity between two sentences said differently. For example, the sentences \u201cShe survived\u201d and \u201cShe did not die\u201d have the same meaning. A text similarity algorithm is expected to retrieve a very high similarity rate. To do this, Machine Learning is the right path, but it\u2019s not that easy, due to the complexities of Natural Language Processing (NLP).<\/p>\n<p>This article describes the creation of two tools I developed with the orientation of Pedro Dias, a Data Scientist at NILG.AI, for my curricular internship.<\/p>\n<h3>Natural Language Processing and text modeling<\/h3>\n<p><span style=\"font-weight: 400;\">NLP is a subfield of artificial intelligence concerned with the interactions between computers and human language, particularly how to program computers to process and analyze large amounts of natural language data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Text modeling was the main base of my work. It consists of analyzing text data to find a group of words from a collection of documents that best represents the information in the collection.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Of course, there are many ways to perform feature extraction from texts, but the path chosen was to use Word2vec and bag-of-N-gram.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Word2Vec is a method to obtain word embedding, a term used to represent words in a vector space for text analysis. Once trained, it can detect synonymous words or suggest additional words for a partial sentence.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Bag-of-N-gram is a technique that counts how many times an N-gram appears in a document. An N-gram is a sequence of N words where N is a number between 1 and infinite. For example, given the sentence \u201cHello neighbor next door,\u201d \u201cHello World\u201d \/ \u201cnext door\u201d is a 2-gram while \u201cHello neighbor next door\u201d is a 4-gram.<\/span><\/p>\n<h3><b>Duplicate Detection in Text Using Machine Learning<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">One of the tools created had the objective of retrieving the <\/span><b>similarity between two texts <\/b><span style=\"font-weight: 400;\">inserted by the user. To do that, an <\/span><b>abstract representation<\/b><span style=\"font-weight: 400;\"> of these texts was created with different methods. The next step was calculating the distance<\/span><span style=\"font-weight: 400;\">\u00a0between these abstract representations, generating the <\/span><b>probability of being similar<\/b><span style=\"font-weight: 400;\">. Below is a scheme to better understand these steps.\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/text-similarity-1.svg\"><img decoding=\"async\" class=\"alignnone wp-image-2642 attachment-svg\" src=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/text-similarity-1.svg\" alt=\"\" width=\"1067\" height=\"287\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">The other tool created allowed the user to enter a text, and it returned the most 10 similar texts from a bank of texts. This bank of texts belongs to a dataset from <\/span><a href=\"https:\/\/paperswithcode.com\/dataset\/quora-question-pairs\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Quora Question Pairs<\/span><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/text-similarity-2.svg\"><img decoding=\"async\" class=\"alignnone size-full wp-image-2644 attachment-svg\" src=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/text-similarity-2.svg\" alt=\"\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">There were three types of approaches in the development of the product: an unsupervised, using Word2Vec and Bag-of-N-gram for the abstract representation, a supervised, using Logistic Regression, and another that simulates real-life situations explained further on.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">All of these methods used <\/span><span style=\"font-weight: 400;\">the dataset mentioned above, and represented in the figure below, to train the model.<\/span><\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-2647 size-full\" src=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/dataset-beatriz-e1666705439217.png\" alt=\"\" width=\"788\" height=\"117\" srcset=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/dataset-beatriz-e1666705439217.png 788w, https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/dataset-beatriz-e1666705439217-300x45.png 300w, https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/dataset-beatriz-e1666705439217-600x89.png 600w, https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/dataset-beatriz-e1666705439217-768x114.png 768w\" sizes=\"(max-width: 788px) 100vw, 788px\" \/><\/p>\n<h3><b>Real-life simulation<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In this approach, the initial dataset has a particularity. The second question is replaced by a synonymous phrase, but only the rows that indicate that the questions are duplicates. This simulates the case where two phrases with different but similar words have the same meaning, and a scenario where there is no annotated data for the duplicates, but we can still train a model.<\/span><\/p>\n<h3><b>Development of a Rest API and a Web App<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Furthermore, two services were created to make the tool that can retrieve the similarity between two texts, and the one that, given a text, returns the most similar texts from a bank of texts. These services were implemented for every method. Integrating these models in a Rest API (backend) and a Dash Interface (frontend), you can find the final result in this <\/span><a href=\"http:\/\/3.11.75.12:8751\/\"><span style=\"font-weight: 400;\">dashboard<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h4><b>Automation of deployment<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">This was made to release the versions without much effort and a much more seamless way of delivering products in the industry. To put the above-mentioned services into production using <\/span><a href=\"https:\/\/www.terraform.io\/intro\"><span style=\"font-weight: 400;\">Terraform<\/span><\/a><span style=\"font-weight: 400;\">, an <\/span><a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/concepts.html\"><span style=\"font-weight: 400;\">EC2<\/span><\/a><span style=\"font-weight: 400;\"> machine, and an <\/span><a href=\"https:\/\/docs.aws.amazon.com\/AmazonECR\/latest\/userguide\/what-is-ecr.html\"><span style=\"font-weight: 400;\">ECR<\/span><\/a><span style=\"font-weight: 400;\"> repository were created to store the docker images of each interface.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">By generating the docker image on a computer and pushing them to the ECR repository, it is only needed to access the machine-generated EC2 instance and pulling these images from the ECR repository. The image below intends to clarify this process further.<\/span><\/p>\n<p><a href=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/text-similarity-3.svg\"><img decoding=\"async\" class=\"alignnone size-full wp-image-2650 attachment-svg\" src=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/text-similarity-3.svg\" alt=\"\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">These steps are all aggregated in a deployment script to make the release easier for a possible client.<\/span><\/p>\n<h3>Conclusions on my internship: Duplicate detection in text data<\/h3>\n<p><span style=\"font-weight: 400;\">I believe that the work on duplicate detection in text was carried out successfully, although I consider that I could increase the accuracy and reliability of the tools created with more time.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This experience allowed me to be part of a project that has a lot of use for companies or private entities, such as detecting duplicate data to delete it, or detecting plagiarism.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Overall, I felt that it was an internship that put me more in touch with the business world and gave me a lot of knowledge in the machine learning area, especially about NLP and supervised and unsupervised learning. It also gave me more knowledge about how to make a deployment, not only focusing on the artificial intelligence area.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Ultimately, I would like to thank my excellent tutor from NILG.AI, who was always willing to help and teach me throughout the semester.<\/span><\/p>","protected":false},"excerpt":{"rendered":"<p>A common use case seen across several industries is the creation of systems capable of detecting the similarity between pairs of objects &#8211; images and texts. For example, duplicate detection in marketplaces, or recommendation systems that show similar objects to the ones the user has searched for, can use such systems. They can also be [&hellip;]<\/p>\n","protected":false},"author":116,"featured_media":2655,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[53],"tags":[48,88,45,168],"class_list":["post-2641","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technical","tag-ai4tech","tag-information-retrieval","tag-machine-learning","tag-nlp"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.8 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Duplicate detection in text data - NILG.AI<\/title>\n<meta name=\"description\" content=\"We explore multiple techniques for duplicate detection in text data using fuzzy similarity techniques based on embeddings.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/nilg.ai\/pt\/202210\/duplicate-detection-text\/\" \/>\n<meta property=\"og:locale\" content=\"pt_PT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Duplicate detection in text data - NILG.AI\" \/>\n<meta property=\"og:description\" content=\"We explore multiple techniques for duplicate detection in text data using fuzzy similarity techniques based on embeddings.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/nilg.ai\/pt\/202210\/duplicate-detection-text\/\" \/>\n<meta property=\"og:site_name\" content=\"NILG.AI\" \/>\n<meta property=\"article:published_time\" content=\"2022-10-25T13:44:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-09-02T07:48:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/books-randomly-stacked-shelf.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1897\" \/>\n\t<meta property=\"og:image:height\" content=\"1265\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Beatriz Santos\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@nilg_ai\" \/>\n<meta name=\"twitter:site\" content=\"@nilg_ai\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Beatriz Santos\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutos\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/\"},\"author\":{\"name\":\"Beatriz Santos\",\"@id\":\"https:\/\/nilg.ai\/#\/schema\/person\/340c81f30fb632a2336f1a312ae4514a\"},\"headline\":\"Duplicate detection in text data\",\"datePublished\":\"2022-10-25T13:44:06+00:00\",\"dateModified\":\"2023-09-02T07:48:27+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/\"},\"wordCount\":947,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/nilg.ai\/#organization\"},\"keywords\":[\"AI4tech\",\"Information Retrieval\",\"Machine Learning\",\"NLP\"],\"articleSection\":[\"Technical\"],\"inLanguage\":\"pt-PT\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/\",\"url\":\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/\",\"name\":\"Duplicate detection in text data - NILG.AI\",\"isPartOf\":{\"@id\":\"https:\/\/nilg.ai\/#website\"},\"datePublished\":\"2022-10-25T13:44:06+00:00\",\"dateModified\":\"2023-09-02T07:48:27+00:00\",\"description\":\"We explore multiple techniques for duplicate detection in text data using fuzzy similarity techniques based on embeddings.\",\"inLanguage\":\"pt-PT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/\"]}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/nilg.ai\/#website\",\"url\":\"https:\/\/nilg.ai\/\",\"name\":\"NILG.AI\",\"description\":\"Create ever-improving businesses with AI\",\"publisher\":{\"@id\":\"https:\/\/nilg.ai\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/nilg.ai\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"pt-PT\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/nilg.ai\/#organization\",\"name\":\"NILG.AI\",\"url\":\"https:\/\/nilg.ai\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"pt-PT\",\"@id\":\"https:\/\/nilg.ai\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/03\/logo.svg\",\"contentUrl\":\"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/03\/logo.svg\",\"caption\":\"NILG.AI\"},\"image\":{\"@id\":\"https:\/\/nilg.ai\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/twitter.com\/nilg_ai\",\"https:\/\/youtube.com\/@nilg_ai\",\"https:\/\/www.linkedin.com\/company\/nilg-ai\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/nilg.ai\/#\/schema\/person\/340c81f30fb632a2336f1a312ae4514a\",\"name\":\"Beatriz Santos\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"pt-PT\",\"@id\":\"https:\/\/nilg.ai\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/ccb3514a92bba9a8a10b016c2b78d8c5f5160675036dfa21cd361fad5580167f?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/ccb3514a92bba9a8a10b016c2b78d8c5f5160675036dfa21cd361fad5580167f?s=96&d=mm&r=g\",\"caption\":\"Beatriz Santos\"},\"url\":\"https:\/\/nilg.ai\/pt\/author\/beatriz\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Duplicate detection in text data - NILG.AI","description":"We explore multiple techniques for duplicate detection in text data using fuzzy similarity techniques based on embeddings.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/nilg.ai\/pt\/202210\/duplicate-detection-text\/","og_locale":"pt_PT","og_type":"article","og_title":"Duplicate detection in text data - NILG.AI","og_description":"We explore multiple techniques for duplicate detection in text data using fuzzy similarity techniques based on embeddings.","og_url":"https:\/\/nilg.ai\/pt\/202210\/duplicate-detection-text\/","og_site_name":"NILG.AI","article_published_time":"2022-10-25T13:44:06+00:00","article_modified_time":"2023-09-02T07:48:27+00:00","og_image":[{"width":1897,"height":1265,"url":"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/10\/books-randomly-stacked-shelf.jpg","type":"image\/jpeg"}],"author":"Beatriz Santos","twitter_card":"summary_large_image","twitter_creator":"@nilg_ai","twitter_site":"@nilg_ai","twitter_misc":{"Written by":"Beatriz Santos","Est. reading time":"5 minutos"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/#article","isPartOf":{"@id":"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/"},"author":{"name":"Beatriz Santos","@id":"https:\/\/nilg.ai\/#\/schema\/person\/340c81f30fb632a2336f1a312ae4514a"},"headline":"Duplicate detection in text data","datePublished":"2022-10-25T13:44:06+00:00","dateModified":"2023-09-02T07:48:27+00:00","mainEntityOfPage":{"@id":"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/"},"wordCount":947,"commentCount":0,"publisher":{"@id":"https:\/\/nilg.ai\/#organization"},"keywords":["AI4tech","Information Retrieval","Machine Learning","NLP"],"articleSection":["Technical"],"inLanguage":"pt-PT","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/nilg.ai\/202210\/duplicate-detection-text\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/","url":"https:\/\/nilg.ai\/202210\/duplicate-detection-text\/","name":"Duplicate detection in text data - NILG.AI","isPartOf":{"@id":"https:\/\/nilg.ai\/#website"},"datePublished":"2022-10-25T13:44:06+00:00","dateModified":"2023-09-02T07:48:27+00:00","description":"We explore multiple techniques for duplicate detection in text data using fuzzy similarity techniques based on embeddings.","inLanguage":"pt-PT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/nilg.ai\/202210\/duplicate-detection-text\/"]}]},{"@type":"WebSite","@id":"https:\/\/nilg.ai\/#website","url":"https:\/\/nilg.ai\/","name":"NILG.AI","description":"Create ever-improving businesses with AI","publisher":{"@id":"https:\/\/nilg.ai\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/nilg.ai\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"pt-PT"},{"@type":"Organization","@id":"https:\/\/nilg.ai\/#organization","name":"NILG.AI","url":"https:\/\/nilg.ai\/","logo":{"@type":"ImageObject","inLanguage":"pt-PT","@id":"https:\/\/nilg.ai\/#\/schema\/logo\/image\/","url":"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/03\/logo.svg","contentUrl":"https:\/\/nilg.ai\/wp-content\/uploads\/2022\/03\/logo.svg","caption":"NILG.AI"},"image":{"@id":"https:\/\/nilg.ai\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/twitter.com\/nilg_ai","https:\/\/youtube.com\/@nilg_ai","https:\/\/www.linkedin.com\/company\/nilg-ai\/"]},{"@type":"Person","@id":"https:\/\/nilg.ai\/#\/schema\/person\/340c81f30fb632a2336f1a312ae4514a","name":"Beatriz Santos","image":{"@type":"ImageObject","inLanguage":"pt-PT","@id":"https:\/\/nilg.ai\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/ccb3514a92bba9a8a10b016c2b78d8c5f5160675036dfa21cd361fad5580167f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/ccb3514a92bba9a8a10b016c2b78d8c5f5160675036dfa21cd361fad5580167f?s=96&d=mm&r=g","caption":"Beatriz Santos"},"url":"https:\/\/nilg.ai\/pt\/author\/beatriz\/"}]}},"_links":{"self":[{"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/posts\/2641","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/users\/116"}],"replies":[{"embeddable":true,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/comments?post=2641"}],"version-history":[{"count":10,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/posts\/2641\/revisions"}],"predecessor-version":[{"id":3434,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/posts\/2641\/revisions\/3434"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/media\/2655"}],"wp:attachment":[{"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/media?parent=2641"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/categories?post=2641"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nilg.ai\/pt\/wp-json\/wp\/v2\/tags?post=2641"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}