Data Resources

DZone's Featured Data Resources

Effective Log Data Analysis With Amazon CloudWatch: Harnessing Machine Learning

By Rajat Gupta

In today's cloud computing world, all types of logging data are extremely valuable. Logs can include a wide variety of data, including system events, transaction data, user activities, web browser logs, errors, and performance metrics. Managing logs efficiently is extremely important for organizations, but dealing with large volumes of data makes it challenging to detect anomalies and unusual patterns or predict potential issues before they become critical. Efficient log management strategies, such as implementing structured logging, using log aggregation tools, and applying machine learning for log analysis, are crucial for handling this data effectively. One of the latest advancements in effectively analyzing a large amount of logging data is Machine Learning (ML) powered analytics provided by Amazon CloudWatch. It is a brand new capability of CloudWatch. This innovative service is transforming the way organizations handle their log data. It offers a faster, more insightful, and automated log data analysis. This article specifically explores utilizing the machine learning-powered analytics of CloudWatch to overcome the challenges of effectively identifying hidden issues within the log data. Before deep diving into some of these features, let's have a quick refresher about Amazon CloudWatch. What Is Amazon CloudWatch? It is an AWS-native monitoring and observability service that offers a whole suite of capabilities: Monitoring: Tracks performance and operational health. Data collection: Gathers logs, metrics, and events, providing a comprehensive view of AWS resources. Unified operational view: Provides insights into applications running on AWS and on-premises servers. Challenges With Logs Data Analysis Volume of Data There's too much log data. In this modern era, applications emit a tremendous amount of log events. Log data can grow so rapidly that developers often find it difficult to identify issues within it; it is like finding a needle in a haystack. Change Identification Another common problem we have often seen is the fundamental problem of log analysis that goes back as long as logs have been around, identifying what has changed in your logs. Proactive Detection Proactive detection is another common challenge. It's great if you can utilize logs to dive in when an application's having an issue, find the root cause of that application issue, and fix it. But how do you know when those issues are occurring? How do you proactively detect them? Of course, you can implement metrics, alarms, etc., for the issues you know about. But there's always the problem of unknowns. So, we're often instrumenting observability and monitoring for past issues. Now, let's dive deep into the machine learning capabilities from CloudWatch that will help you overcome the challenges we have just discussed. Machine Learning Capabilities From CloudWatch Pattern Analysis Imagine you are troubleshooting a real-time distributed application accessed by millions of customers globally and generating a significant amount of application logs. Analyzing tens of thousands of log events manually is challenging, and it can take forever to find the root cause. That is where the new AWS CloudWatch machine learning-based capability can quickly help by grouping log events into patterns within the Logs Insight page of CloudWatch. It is much easier to identify through a limited number of patterns and quickly filter the ones that might be interesting or relevant based on the issue you are trying to troubleshoot. It also allows you to expand the specific pattern to look for the relevant events along with related patterns that might be pertinent. In simple words, Pattern Analysis is the automated grouping and categorization of your log events. Comparison Analysis How can we elevate pattern analysis to the next level? Now that we've seen how pattern analysis works let's see how we can extend this feature to perform comparison analysis. "Comparison Analysis" aims to solve the second challenge of identifying the log changes. Comparison analysis lets you effectively profile your logs using patterns from one time period and then compare them to the patterns extracted for another period and analyze the differences. This will help us answer this fundamental question of what changed to my logs. You can quickly compare your logs while your application's having an issue to a known healthy period. Any changes between two time periods are a strong indicator of the possible root cause of your problem. CloudWatch Logs Anomaly Detection Anomaly detection, in simple terms, is the process of identifying unusual patterns or behaviors in the logs that do not conform to expected norms. To use this feature, we need to first select the LogGroup for the application and enable CloudWatch Logs anomaly detection for it. At that point, CloudWatch will train a machine-learning model on the expected patterns and the volume of each pattern associated with your application. CloudWatch will take five minutes to train the model using logs from your application, and the feature will become active and automatically start servicing these anomalies any time they occur. So things like a brand new error message occurring that wasn't there before, a sudden spike in the volume, or if there's a spike in HTTP 400s are some examples that will result in an anomaly being generated for that. Generate Logs Insight Queries Using Generative AI With this capability, you can give natural language commands to filter log events, and CloudWatch can generate queries using Generative AI. If you are unfamiliar with CloudWatch query language or are from a non-technical background, you can easily use this feature to generate queries and filter logs. It's an iterative process; you need to learn precisely what you want from the first query. So you can update and iterate the query based on the results you see. Let's look at a couple of examples: Natural Language Prompt: "Check API Response Times" Auto-generated query by CloudWatch: In this query: fields @timestamp, @message selects the timestamp and message fields from your logs. | parse @message "Response Time: *" as responseTime parses the @message field to extract the value following the text "Response Time: " and labels it as responseTime. | stats avg(responseTime) calculates the average of the extracted responseTime values. Natural Language Prompt: "Please provide the duration of the ten invocations with the highest latency." Auto-generated query by CloudWatch In this query: fields @timestamp, @message, latency selects the @timestamp, @message, and latency fields from the logs. | stats max(latency) as maxLatency by @message computes the maximum latency value for each unique message. | sort maxLatency desc sorts the results in descending order based on the maximum latency, showing the highest values at the top. | limit 10 restricts the output to the top 10 results with the highest latency values. We can execute these queries in the CloudWatch “Logs Insights” query box to filter the log events from the application logs. These queries extract specific information from the logs, such as identifying errors, monitoring performance metrics, or tracking user activities. The query syntax might vary based on the particular log format and the information you seek. Conclusion CloudWatch's machine learning features offer a robust solution for managing the complexities of log data. These tools make log analysis more efficient and insightful, from automating pattern analysis to enabling anomaly detection. The addition of generative AI for query generation further democratizes access to these powerful insights. More

BigQuery DataFrames in Python

By Sreenath Devineni

Google BigQuery is a powerful cloud-based data warehousing solution that enables users to analyze massive datasets quickly and efficiently. In Python, BigQuery DataFrames provide a Pythonic interface for interacting with BigQuery, allowing developers to leverage familiar tools and syntax for data querying and manipulation. In this comprehensive developer guide, we'll explore the usage of BigQuery DataFrames, their advantages, disadvantages, and potential performance issues. Introduction To BigQuery DataFrames BigQuery DataFrames serve as a bridge between Google BigQuery and Python, allowing seamless integration of BigQuery datasets into Python workflows. With BigQuery DataFrames, developers can use familiar libraries like Pandas to query, analyze, and manipulate BigQuery data. This Pythonic approach simplifies the development process and enhances productivity for data-driven applications. Advantages of BigQuery DataFrames Pythonic Interface: BigQuery DataFrames provide a Pythonic interface for interacting with BigQuery, enabling developers to use familiar Python syntax and libraries. Integration With Pandas: Being compatible with Pandas, BigQuery DataFrames allow developers to leverage the rich functionality of Pandas for data manipulation. Seamless Query Execution: BigQuery DataFrames handle the execution of SQL queries behind the scenes, abstracting away the complexities of query execution. Scalability: Leveraging the power of Google Cloud Platform, BigQuery DataFrames offer scalability to handle large datasets efficiently. Disadvantages of BigQuery DataFrames Limited Functionality: BigQuery DataFrames may lack certain advanced features and functionalities available in native BigQuery SQL. Data Transfer Costs: Transferring data between BigQuery and Python environments may incur data transfer costs, especially for large datasets. API Limitations: While BigQuery DataFrames provide a convenient interface, they may have limitations compared to directly using the BigQuery API for complex operations. Prerequisites Google Cloud Platform (GCP) Account: Ensure an active GCP account with BigQuery access. Python Environment: Set up a Python environment with the required libraries (pandas, pandas_gbq, and google-cloud-bigquery). Project Configuration: Configure your GCP project and authenticate your Python environment with the necessary credentials. Using BigQuery DataFrames Install Required Libraries Install the necessary libraries using pip: Python pip install pandas pandas-gbq google-cloud-bigquery Authenticate GCP Credentials Authenticate your GCP credentials to enable interaction with BigQuery: Python from google.auth import load_credentials # Load GCP credentials credentials, _ = load_credentials() Querying BigQuery DataFrames Use pandas_gbq to execute SQL queries and retrieve results as a DataFrame: Python import pandas_gbq # SQL Query query = "SELECT * FROM `your_project_id.your_dataset_id.your_table_id`" # Execute Query and Retrieve DataFrame df = pandas_gbq.read_gbq(query, project_id="your_project_id", credentials=credentials) Writing to BigQuery Write a DataFrame to a BigQuery table using pandas_gbq: Python # Write DataFrame to BigQuery pandas_gbq.to_gbq(df, destination_table="your_project_id.your_dataset_id.your_new_table", project_id="your_project_id", if_exists="replace", credentials=credentials) Advanced Features SQL Parameters Pass parameters to your SQL queries dynamically: Python params = {"param_name": "param_value"} query = "SELECT * FROM `your_project_id.your_dataset_id.your_table_id` WHERE column_name = @param_name" df = pandas_gbq.read_gbq(query, project_id="your_project_id", credentials=credentials, dialect="standard", parameters=params) Schema Customization Customize the DataFrame schema during the write operation: Python schema = [{"name": "column_name", "type": "INTEGER"}, {"name": "another_column", "type": "STRING"}] pandas_gbq.to_gbq(df, destination_table="your_project_id.your_dataset_id.your_custom_table", project_id="your_project_id", if_exists="replace", credentials=credentials, table_schema=schema) Performance Considerations Data Volume: Performance may degrade with large datasets, especially when processing and transferring data between BigQuery and Python environments. Query Complexity: Complex SQL queries may lead to longer execution times, impacting overall performance. Network Latency: Network latency between the Python environment and BigQuery servers can affect query execution time, especially for remote connections. Best Practices for Performance Optimization Use Query Filters: Apply filters to SQL queries to reduce the amount of data transferred between BigQuery and Python. Optimize SQL Queries: Write efficient SQL queries to minimize query execution time and reduce resource consumption. Cache Query Results: Cache query results in BigQuery to avoid re-executing queries for repeated requests. Conclusion BigQuery DataFrames offer a convenient and Pythonic way to interact with Google BigQuery, providing developers with flexibility and ease of use. While they offer several advantages, developers should be aware of potential limitations and performance considerations. By following best practices and optimizing query execution, developers can harness the full potential of BigQuery DataFrames for data analysis and manipulation in Python. More

Implementation of the Raft Consensus Algorithm Using C++20 Coroutines

By Aleksei Ozeritskii

Flexible Data Generation With Datafaker Gen

By Roman Rybak

Python Function Pipelines: Streamlining Data Processing

By Sameer Shukla

CORE

The Curse of Simplicity: The Simplest Doesn’t Mean the Least Sophisticated

It is often said that software developers should create simple solutions to the problems that they are presented with. However, coming up with a simple solution is not always easy, as it requires time, experience, and a good approach. And to make matters worse, a simple solution in many ways will not impress your co-workers or give your resume a boost. Ironically, the quest for simplicity in software development is often a complex journey. A developer must navigate through a labyrinth of technical constraints, user requirements, and evolving technological landscapes. The catch-22 is palpable: while a simple solution is desirable, it is not easily attained nor universally appreciated. In the competitiveness of software development, where complexity often disguises itself as sophistication, simple solutions may not always resonate with the awe and admiration they deserve. They may go unnoticed in a culture that frequently equates complexity with competence. Furthermore, the pursuit of simplicity can sometimes be a thankless endeavor. In an environment where complex designs and elaborate architectures are often celebrated, a minimalist approach might not captivate colleagues or stand out in a portfolio. This dichotomy presents a unique challenge for software developers who want to balance the art of simplicity with the practicalities of career advancement and peer recognition. As we get closer to the point of this discussion, I will share my personal experiences in grappling with the "curse of simplicity." These experiences shed light on the nuanced realities of being a software developer committed to simplicity in a world that often rewards complexity. The Story Several years ago, I was part of a Brazilian startup confronted with a daunting issue. The accounting report crucial for tax payment to São Paulo's city administration had been rendered dysfunctional due to numerous changes in its data sources. These modifications stemmed from shifts in payment structures with the company's partners. The situation escalated when the sole analyst responsible for manually generating the report went on vacation, leaving the organization vulnerable to substantial fines from the city hall. To solve the problem, the company’s CFO called a small committee to forward a solution. In advocating for a resolution, I argued against revisiting the complex, defunct legacy solution and proposed a simpler approach. I was convinced that we needed "one big table" with all the columns necessary for the report and that each row should have the granularity of a transaction. This way, the report could be generated by simply flattening the data in a simple query. Loading the data into this table should be done by a simple, secure, and replicable process. My team concurred with my initial proposal and embarked on its implementation, following two fundamental principles: The solution had to be altruistic and crafted for others to utilize and maintain. It had to be code-centric, with automated deployment and code reviews through Pull Requests (PR). We selected Python as our programming language due to its familiarity with the data analysis team and its reputation for being easy to master. In our tool exploration, we came across Airflow, which had been gaining popularity even before its version 1.0 release. Airflow employs DAGs (Direct Acyclic Graphs) to construct workflows, where each step is executed via what is termed "operators." Our team developed two straightforward operators: one for transferring data between tables in different databases, and another for schema migration. This approach allowed for local testing of DAG changes, with the deployment process encompassing Pull Requests followed by a CI/CD pipeline that deployed changes to production. The schema migration operator bore a close resemblance to the implementation in Ruby on Rails migration. We hosted Airflow on AWS Elastic Beanstalk, and Jenkins was employed for the deployment pipeline. During this period, Metabase was already operational for querying databases. Within a span of two to three weeks, our solution was up and running. The so-called "one big table," effectively provided the accounting report. It was user-friendly and, most crucially, comprehensible to everyone involved. The data analysis team, thrilled by the architecture, began adopting this infrastructure for all their reporting needs. A year down the line, the landscape had transformed significantly, with dozens of DAGs in place, hundreds of reporting tables created, and thousands of schema migration files in existence. Synopsis of the Solution In essence, our simple solution might not have seemed fancy, but it was super effective. It allowed the data analysis team to generate reports more quickly and easily, and it saved the company money on fines. The concept of the "curse of simplicity" in software development is a paradoxical phenomenon. It suggests that solutions that appear simple on the surface are often undervalued, especially when compared to their more complex counterparts, which I like to refer to as "complex megazords." This journey of developing a straightforward yet effective solution was an eye-opener for me, and it altered my perspective on the nature of simplicity in problem-solving. There's a common misconception that simple equates to easy. However, the reality is quite the contrary. In reality, as demonstrated by the example I have provided, crafting a solution that is both simple and effective requires a deep understanding of the problem, a sophisticated level of knowledge, and a wealth of experience. It's about distilling complex ideas and processes into their most essential form without losing their effectiveness. What I've come to realize is that simple solutions, though they may seem less impressive at first glance, are often superior. Their simplicity makes them more accessible and easier to understand, maintain, and use. This accessibility is crucial in a world where technology is rapidly evolving and there is a need for user-friendly, maintainable solutions.

By Gustavo Ribeiro Amigo

The Noticeable Shift in SIEM Data Sources

SIEM solutions didn't work perfectly well when they were first introduced in the early 2000s, partly because of their architecture and functionality at the time but also due to the faults in the data and data sources that were fed into them. During this period, data inputs were often rudimentary, lacked scalability, and necessitated extensive manual intervention across operational phases. Three of those data sources stood out. 1. Hand-Coded Application Layer Security Coincidentally, application layer security became a thing when SIEM solutions were first introduced. Around that time, it became obvious that defending the perimeter, hosts, and endpoints was not sufficient security for applications. Some developers experimented with manually coding application security layers to bolster protection against functionality-specific attacks. While this approach provided an additional security layer, it failed to provide SIEM solutions with accurate data due to developers' focus on handling use cases rather than abuse cases. This was because the developers were accustomed to writing code to handle use cases, not abuse cases. So, they weren’t experienced and didn’t have the experience or knowledge to anticipate all likely attacks and write complex codes to collect or authorize access to data related to those attacks. Moreover, many sophisticated attacks necessitated correlating events across multiple applications and data sources, which was beyond the monitoring of individual applications and their coding capabilities. 2. SPAN and TAP Ports SPAN ports, also known as mirror ports or monitor ports, were configured on network switches or routers to copy and forward traffic from one or more source ports to a designated monitoring port. They operated within the network infrastructure and allowed admins to monitor network traffic without disrupting the flow of data to the intended destination. On the other hand, TAP ports were hardware devices that passively captured and transmitted network traffic from one network segment to another. TAP operated independently of network switches and routers but still provided complete visibility into network traffic regardless of network topology or configuration. Despite offering complete visibility into network traffic, these ports fell out of favor in SIEM integration due to their deficiency in contextual information. The raw packet data that SPAN and TAP ports collected lacked the necessary context for effective threat detection and analysis, alongside challenges such as limited network visibility, complex configuration, and inadequate capture of encrypted traffic. 3. The 2000s REST API As a successor to SOAP API, REST API revolutionized data exchange with its simplicity, speed, efficiency, and statelessness. Aligned with the rise of cloud solutions, REST API served as an ideal conduit between SIEM and cloud environments, offering standardized access to diverse data sources. However, it had downsides: one of which was its network efficiency issues. REST APIs sometimes over-fetched or under-fetched data, which resulted in inefficient data transfer between the API and the SIEM solution. There were also the issues of evolving schemas in REST APIs. Without a strongly typed schema, SIEM solutions found it difficult to accurately map incoming data fields to the predefined schema, leading to parsing errors or data mismatches. Then there was the issue of its complexity and learning curve. REST API implementation is known to be complex, especially in managing authentication, pagination, rate limiting, and error handling. Because of this complexity, security analysts and admins responsible for configuring SIEM data sources found it difficult or even required additional training to handle its integrations effectively. This also led to configuration errors, which then affected data collection and analysis. While some of the above data sources have not been completely scrapped out of use, their technologies have been greatly improved, and they now have seamless integrations. Most Recently Used SIEM Data Sources 1. Cloud Logs The cloud was introduced in 2006 when Amazon launched AWS EC2, followed shortly by Salesforce's service cloud solution in 2009. It offers unparalleled scalability, empowering organizations to manage vast volumes of log data effortlessly. Additionally, it provides centralized logging and monitoring capabilities, streamlining data collection and analysis for SIEM solutions. With built-in security features and compliance controls, cloud logs enable SIEM solutions to swiftly detect and respond to security threats. However, challenges accompany these advantages. According to Adam Praksch, a SIEM administrator at IBM, SIEM solutions often struggle to keep pace with the rapid evolution of cloud solutions, resulting in the accumulation of irrelevant events or inaccurate data. Furthermore, integrating SIEM solutions with both on-premises and cloud-based systems increases complexity and cost, as noted by Mohamed El Bagory, a SIEM Technical Instructor at LogRhythm. Notwithstanding, El Bagory acknowledged the vast potential of cloud data for SIEM solutions, emphasizing the need to explore beyond basic information from SSH logins and Chrome tabs to include data from command lines and process statistics. 2. IoT Device Logs As Praksch rightly said, any IT or OT technology that creates logs or reports about its operation is already used for security purposes. This is because IoT devices are known to generate a wealth of rich data about their operations, interactions, and environments. IoT devices, renowned for producing diverse data types such as logs, telemetry, and alerts, are considered a SIEM solutions’s favorite data source. This data diversity allows SIEM solutions to analyze different aspects of the network and identify anomalies or suspicious behavior. Conclusion In conclusion, as Praksch rightly said, "The more data a SIEM solution can work with, the higher its chances of successfully monitoring an organization's environment against cyber threats." So, while most SIEM data sources date back to the inception of the technology, they have gone through several evolution stages to make sure they are extracting accurate and meaningful data for threat detection.

By Diamaka Aniagolu

How To Embed Documents for Semantic Search

In this post, you will take a closer look at embedding documents to be used for a semantic search. By means of examples, you will learn how embedding influences the search result and how you can improve the results. Enjoy! Introduction In a previous post, a chat with documents using LangChain4j and LocalAI was discussed. One of the conclusions was that the document format has a large influence on the results. In this post, you will take a closer look at the influence of source data and the way it is embedded in order to get a better search result. The source documents are two Wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and are mainly in a table format. The same documents were used in the previous post, so it will be interesting to see how the findings from that post compare to the approach used in this post. This blog can be read without reading the previous blogs if you are familiar with the concepts used. If not, it is recommended to read the previous blogs as mentioned in the prerequisites paragraph. The sources used in this blog can be found on GitHub. Prerequisites The prerequisites for this blog are: Basic knowledge of embedding and vector stores Basic Java knowledge: Java 21 is used Basic knowledge of LangChain4j - see the previous blogs: How to Use LangChain4j With LocalAI LangChain4j: Chat With Documents You need LocalAI if you want to run the examples at the end of this blog. See a previous blog on how you can make use of LocalAI. Version 2.2.0 is used for this blog. Embed Whole Document The easiest way to embed a document is to read the document, split it into chunks, and embed the chunks. Embedding means transforming the text into vectors (numbers). The question you will ask also needs to be embedded. The vectors are stored in a vector store which is able to find the results that are the closest to your question and will respond with these results. The source code consists of the following parts: The text needs to be embedded. An embedding model is needed for that; for simplicity, use the AllMiniLmL6V2EmbeddingModel. This model uses the BERT model, which is a popular embedding model. The embeddings need to be stored in an embedding store. Often, a vector database is used for this purpose; but in this case, you can use an in-memory embedding store. Read the two documents and add them to a DocumentSplitter. Here you will define to split the documents into chunks of 500 characters with no overlap. By means of the DocumentSplitter, the documents are split into TextSegments. The embedding model is used to embed the TextSegments. The TextSegments and their embedded counterpart are stored in the embedding store. The question is also embedded with the same model. Ask the embedding store to find relevant embedded segments to the embedded question. You can define how many results the store should retrieve. In this case, only one result is asked for. If a match is found, the following information is printed to the console: The score: A number indicating how well the result corresponds to the question The original text: The text of the segment The metadata: Will show you the document the segment comes from Java private static void askQuestion(String question) { EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel(); EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // Read and split the documents in segments of 500 chunks Document springsteenDiscography = loadDocument(toPath("example-files/Bruce_Springsteen_discography.pdf")); Document springsteenSongList = loadDocument(toPath("example-files/List_of_songs_recorded_by_Bruce_Springsteen.pdf")); ArrayList<Document> documents = new ArrayList<>(); documents.add(springsteenDiscography); documents.add(springsteenSongList); DocumentSplitter documentSplitter = DocumentSplitters.recursive(500, 0); List<TextSegment> documentSegments = documentSplitter.splitAll(documents); // Embed the segments Response<List<Embedding>> embeddings = embeddingModel.embedAll(documentSegments); embeddingStore.addAll(embeddings.content(), documentSegments); // Embed the question and find relevant segments Embedding queryEmbedding = embeddingModel.embed(question).content(); List<EmbeddingMatch<TextSegment>> embeddingMatch = embeddingStore.findRelevant(queryEmbedding,1); System.out.println(embeddingMatch.get(0).score()); System.out.println(embeddingMatch.get(0).embedded().text()); System.out.println(embeddingMatch.get(0).embedded().metadata()); } The questions are the following, and are some facts that can be found in the documents: Java public static void main(String[] args) { askQuestion("on which album was \"adam raised a cain\" originally released?"); askQuestion("what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?"); askQuestion("what is the highest chart position of the album \"tracks\" in canada?"); askQuestion("in which year was \"Highway Patrolman\" released?"); askQuestion("who produced \"all or nothin' at all?\""); } Question 1 The following is the result for question 1: "On which album was 'Adam Raised a Cain' originally released?" Shell 0.6794537224516205 Jim Cretecos 1973 [14] "57 Channels (And Nothin' On)" Bruce Springsteen Human Touch Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan 1992 [15] "7 Rooms of Gloom" (Four Tops cover) Holland–Dozier– Holland † Only the Strong Survive Ron Aniello Bruce Springsteen 2022 [16] "Across the Border" Bruce Springsteen The Ghost of Tom Joad Chuck Plotkin Bruce Springsteen 1995 [17] "Adam Raised a Cain" Bruce Springsteen Darkness on the Edge of Town Jon Landau Bruce Springsteen Steven Van Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=4, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } What do you see here? The score is 0.679…: This means that the segment matches 67.9% of the question. The segment itself contains the specified information at Line 27. The correct segment is chosen - this is great. The metadata shows the document where the segment comes from. You also see how the table is transformed into a text segment: it isn’t a table anymore. In the source document, the information is formatted as follows: Another thing to notice is where the text segment is split. So, if you had asked who produced this song, it would be an incomplete answer, because this row is split in column 4. Question 2 The following is the result for question 2: "What is the highest chart position of 'Greetings from Asbury Park, NJ' in the US?" Shell 0.6892728817378977 29. Greetings from Asbury Park, N.J. (LP liner notes). Bruce Springsteen. US: Columbia Records. 1973. KC 31903. 30. Nebraska (LP liner notes). Bruce Springsteen. US: Columbia Records. 1982. TC 38358. 31. Chapter and Verse (CD booklet). Bruce Springsteen. US: Columbia Records. 2016. 88985 35820 2. 32. Born to Run (LP liner notes). Bruce Springsteen. US: Columbia Records. 1975. PC 33795. 33. Tracks (CD box set liner notes). Bruce Springsteen. Europe: Columbia Records. 1998. COL 492605 2 2. Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=100, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } The information is found in the correct document, but the wrong text segment is found. This segment comes from the References section and you needed the information from the Songs table, just like for question 1. Question 3 The following is the result for question 3: "What is the highest chart position of the album 'Tracks' in Canada?" Shell 0.807258199400863 56. @billboardcharts (November 29, 2021). "Debuts on this week's #Billboard200 (1/2)..." (https://twitter.com/bil lboardcharts/status/1465346016702566400) (Tweet). Retrieved November 30, 2021 – via Twitter. 57. "ARIA Top 50 Albums Chart" (https://www.aria.com.au/charts/albums-chart/2021-11-29). Australian Recording Industry Association. November 29, 2021. Retrieved November 26, 2021. 58. "Billboard Canadian Albums" (https://www.fyimusicnews.ca/fyi-charts/billboard-canadian-albums). Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=142, file_name=Bruce_Springsteen_discography.pdf, document_type=PDF} } The information is found in the correct document, but also here, the segment comes from the References section, while the answer to the question can be found in the Compilation albums table. This can explain some of the wrong answers that were given in the previous post. Question 4 The following is the result for question 4: "In which year was 'Highway Patrolman' released?" Shell 0.6867325432140559 "Highway 29" Bruce Springsteen The Ghost of Tom Joad Chuck Plotkin Bruce Springsteen 1995 [17] "Highway Patrolman" Bruce Springsteen Nebraska Bruce Springsteen 1982 [30] "Hitch Hikin' " Bruce Springsteen Western Stars Ron Aniello Bruce Springsteen 2019 [53] "The Hitter" Bruce Springsteen Devils & Dust Brendan O'Brien Chuck Plotkin Bruce Springsteen 2005 [24] "The Honeymooners" Bruce Springsteen Tracks Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt 1998 [33] [76] "House of a Thousand Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=31, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } The information is found in the correct document and the correct segment is found. However, it is difficult to retrieve the correct answer because of the formatting of the text segment, and you do not have any context about what the information represents. The column headers are gone, so how should you know that 1982 is the answer to the question? Question 5 The following is the result for question 5: "Who produced 'All or Nothin’ at All'?" Shell 0.7036564758755796 Zandt (assistant) 1978 [18] "Addicted to Romance" Bruce Springsteen She Came to Me (soundtrack) Bryce Dessner 2023 [19] [20] "Ain't Good Enough for You" Bruce Springsteen The Promise Jon Landau Bruce Springsteen 2010 [21] [22] "Ain't Got You" Bruce Springsteen Tunnel of Love Jon Landau Chuck Plotkin Bruce Springsteen 1987 [23] "All I'm Thinkin' About" Bruce Springsteen Devils & Dust Brendan O'Brien Chuck Plotkin Bruce Springsteen 2005 [24] "All or Nothin' at All" Bruce Springsteen Human Touch Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=5, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } The information is found in the correct document, but again, the segment is split in the row where the answer can be found. This can explain the incomplete answers that were given in the previous post. Conclusion Two answers are correct, one is partially correct, and two are wrong. Embed Markdown Document What would change when you convert the PDF documents into Markdown files? Tables are probably better to recognize in Markdown files than in PDF documents, and they allow you to segment the document at the row level instead of some arbitrary chunk size. Only the parts of the documents that contain the answers to the questions are converted; this means the Studio albums and Compilation albums from the discography and the List of songs recorded. The segmenting is done as follows: Split the document line per line. Retrieve the data of the table in the variable dataOnly. Save the header of the table in the variable header. Create a TextSegment for every row in dataOnly and add the header to the segment. The source code is as follows: Java List<Document> documents = loadDocuments(toPath("markdown-files")); List<TextSegment> segments = new ArrayList<>(); for (Document document : documents) { String[] splittedDocument = document.text().split("\n"); String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length); String header = splittedDocument[0] + "\n" + splittedDocument[1] + "\n"; for (String splittedLine : dataOnly) { segments.add(TextSegment.from(header + splittedLine, document.metadata())); } } Question 1 The following is the result for question 1: "On which album was 'Adam Raised a Cain' originally released?" Shell 0.6196628642947255 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| |The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} } The answer is incorrect. Question 2 The following is the result for question 2: "What is the highest chart position of 'Greetings from Asbury Park, NJ' in the US?" Shell 0.8229951885990189 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| | Greetings from Asbury Park,N.J. |60|71|—|—|—|—|—|—|35|41| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_studio_albums.md, document_type=UNKNOWN} } The answer is correct, and the answer can easily be retrieved, as you have the header information for every column. Question 3 The following is the result for question 3: "What is the highest chart position of the album 'Tracks' in Canada?" Shell 0.7646818618182345 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| |Tracks|27|97|—|63|—|36|—|4|11|50| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} } The answer is correct. Question 4 The following is the result for question 4: "In which year was 'Highway Patrolman' released?" Shell 0.6108392657222184 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } The answer is incorrect. The correct document is found, but the wrong segment is chosen. Question 5 The following is the result for question 5: "Who produced 'All or Nothin’ at All'?" Shell 0.6724577751120745 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| | "All or Nothin' at All" | Bruce Springsteen | Human Touch | Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan |1992 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } The answer is correct and complete this time. Conclusion Three answers are correct and complete. Two answers are incorrect. Note that the incorrect answers are for different questions as before. However, the result is slightly better than with the PDF files. Alternative Questions Let’s build upon this a bit further. You are not using a Large Language Model (LLM) here, which will help you with textual differences between the questions you ask and the interpretation of results. Maybe it helps when you change the question in order to use terminology that is closer to the data in the documents. The source code can be found here. Question 1 Let’s change question 1 from "On which album was 'Adam Raised a Cain' originally released?" to "What is the original release of 'Adam Raised a Cain'?". The column in the table is named original release, so that might make a difference. The result is the following: Shell 0.6370094541277747 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| | "Adam Raised a Cain" | Bruce Springsteen | Darkness on the Edge of Town | Jon Landau Bruce Springsteen Steven Van Zandt (assistant) | 1978| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } The answer is correct this time and the score is slightly higher. Question 4: Attempt #1 Question 4 is, "In which year was 'Highway Patrolman' released?" Remember that you only asked for the first relevant result. However, more relevant results can be displayed. Set the maximum number of results to 5. Java List<EmbeddingMatch<TextSegment>> relevantMatches = embeddingStore.findRelevant(queryEmbedding,5); The result is: Shell 0.6108392657222184 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6076896858171996 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Turn! Turn! Turn!" (with Roger McGuinn) | Pete Seeger † | Magic Tour Highlights (EP) | John Cooper | 2008| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6029946650419344 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6001672430441461 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Downbound Train" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.5982557901838741 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } As you can see, Highway Patrolman is a result, but only the fifth result. That is a bit strange, though. Question 4: Attempt #2 Let’s change question 4 from, "In which year was 'Highway Patrolman' released?" to, "In which year was the song 'Highway Patrolman' released?" So, you add "the song" to the question. The result is: Shell 0.6506125707025556 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.641000538311824 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Raise Your Hand" (live) (Eddie Floyd cover) | Steve Cropper Eddie Floyd Alvertis Isbell † | Live 1975–85 | Jon Landau Chuck Plotkin Bruce Springsteen |1986 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6402738046796352 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6362427185719677 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.635837703599965 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Wreck on the Highway"| Bruce Springsteen |The River | Jon Landau Bruce Springsteen Steven Van Zandt |1980 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } Now Highway Patrolman is the fourth result. It is getting better. Question 4: Attempt #3 Let’s add the words "of the album Nebraska" to question 4. The question becomes, "In which year was the song 'Highway Patrolman' of the album Nebraska released?" The result is: Shell 0.6468954949440158 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6444919056791143 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6376680100362238 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6367565537138745 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| |The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} } 0.6364950606665447 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Raise Your Hand" (live) (Eddie Floyd cover) | Steve Cropper Eddie Floyd Alvertis Isbell † | Live 1975–85 | Jon Landau Chuck Plotkin Bruce Springsteen |1986 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } Again, an improvement: Highway Patrolman is now listed as the third result. Still, it is strange why it is not listed as the first result. However, by adding more information, it ranks higher in the result list. This is as expected. Conclusion Changing the question with terminology that is closer to the source data helps in order to get a better result. Adding more context to the question also helps. Displaying more results gives you more insight and lets you determine the correct answer from the result list. Markdown Embedding Combined With LLM The conclusions up until now are: The format of the documents and the way the documents are segmented and embedded have a significant influence on the results. If the question uses terminology that is close to the data in the documents, better results are achieved. In the previous blog, the PDF files were used as source data. This is now solved by using Markdown and better segmenting and embedding. The second bullet above can probably be solved by using an LLM. Let’s see what happens when you combine the embedding in this post with an LLM from the previous post. The full source code can be found on GitHub. The questions and results are: Q: On which album was “Adam Raised a Cain” originally released?A: The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen album in 1995. This answer is wrong but corresponds to the first embedded result earlier. Q: What is the highest chart position of “Greetings from Asbury Park, NJ” in the US?A: The highest chart position of “Greetings from Asbury Park, NJ” in the US is 60. This answer is correct. Q: What is the highest chart position of the album “Tracks” in Canada?A: The album “Tracks” by Metallica has not reached the highest chart position in Canada. According to the provided information, the highest chart position it reached in Canada was #27. This answer is wrong. The first embedded result earlier was correct, so it is strange that the wrong answer is given here. Q: In which year was “Highway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The answer is wrong, but it can be explained. The first embedded result earlier showed the result for Working on the Highway, and this song was released in 1984. Q: In which year was “Higway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The question here included a typo. The result is similar to the previous one, so this is consistent. Q: Who produced “All or Nothin’ at All”?A: The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on Springsteen’s 1992 album “Human Touch.” This answer is correct and complete. With the PDF as source data, the answer was always incomplete. The results are worse than using the PDF files as source documents. However, when a correct answer is given, it is complete. The answers are also consistent; a minor typo does not change the answer like it did before. Markdown Embedding Combined With System Messages In the previous blog, the final solution also included adding system messages to the chat memory. Let’s see what happens when you combine this with Markdown source documents. The full source code can be found on GitHub. Below is a small extract of the system messages. Java ArrayList<ChatMessage> chatMessages = new ArrayList<>(); chatMessages.add(new SystemMessage("Use the provided documents to answer the questions.")); chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\".")); The questions and results are: Q: On which album was “Adam Raised a Cain” originally released?A: The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen in 1995. This answer is wrong and similar to the previous paragraph. Q: What is the highest chart position of “Greetings from Asbury Park, NJ” in the US?A: Based on the provided information, the highest chart position of “Greetings from Asbury Park, N.J.” in the US is 60. This answer is correct and similar to the previous paragraph. Q: What is the highest chart position of the album “Tracks” in Canada?A: Based on the information provided, the highest chart position of the album “Tracks” in Canada is 64. This answer is still wrong. However, Metallica is not mentioned anymore, so it is a bit less wrong than in the previous paragraph. Q: In which year was “Highway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The answer is wrong and identical to the previous paragraph. Q: In which year was “Higway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The answer is wrong and identical to the previous paragraph. Q: Who produced “All or Nothin’ at All”?A: The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on the album “Human Touch” in 1992. This answer is correct, complete, and similar to the previous paragraph. Adding system messages did not have any influence on the results. Overall Conclusion What did you learn from all of this? The way documents are read and embedded seems to have the largest influence on the result. An advantage of this approach is that you are able to display a number of results. This allows you to determine which result is the correct one. Changing your question in order to use the terminology used in the text segments helps to get a better result. Querying a vector store is very fast. Embedding costs some time, but you only need to do this once. Using an LLM takes a lot more time to retrieve a result when you do not use a GPU. An interesting resource to read is Deconstructing RAG, a blog from LangChain. When improvements are made in this area, better results will be the consequence.

By Gunter Rotsaert

CORE

How To Base64 Encode or Decode Content Using APIs in Java

Base64 encoding was originally conceived more than 30 years ago (named in 1992). Back then, the Simple Mail Transfer Protocol (SMTP) forced developers to find a way to encode e-mail attachments in ASCII characters so SMTP servers wouldn't interfere with them. All these years later, Base64 encoding is still widely used for the same purpose: to replace binary data in systems where only ASCII characters are accepted. E-mail file attachments remain the most common example of where we use Base64 encoding, but it’s not the only use case. Whether we’re stashing images or other documents in HTML, CSS, or JavaScript, or including them in JSON objects (e.g., as a payload to certain API endpoints), Base64 simply offers a convenient, accessible solution when our recipient systems say “no” to binary. The Base64 encoding process starts with a binary encoded file — or encoding a file's content in binary if it isn't in binary already. The binary content is subsequently broken into groups of 6 bits, with each group represented by an ASCII character. There are precisely 64 ASCII characters available for this encoding process — hence the name Base64 — and those characters range from A-Z (capitalized), a-z (lower case), 0 – 9, +, and /. The result of this process is a string of characters; the phrase “hello world”, for example, ends up looking like “aGVsbG8gd29ybGQ=”. The “=” sign at the end is used as padding to ensure the length of the encoded data is a multiple of 4. The only significant challenge with Base64 encoding in today’s intensely content-saturated digital world is the toll it takes on file size. When we Base64 encode content, we end up with around 1 additional byte of information for every 3 bytes in our original content, increasing the original file size by about 33%. In context, that means an 800kb image file we’re encoding instantly jumps to over 1mb, eating up additional costly resources and creating an increasingly cumbersome situation when we share Base64 encoded content at scale. When our work necessitates a Base64 content conversion, we have a few options at our disposal. First and foremost, many modern programming languages now have built-in classes designed to handle Base64 encoding and decoding locally. Since Java 8 was initially released in 2014, for example, we’ve been able to use java.util.Base64 to handle conversions to and from Base64 with minimal hassle. Similar options exist in Python and C# languages, among others. Depending on the needs of our project, however, we might benefit from making our necessary conversions to and from Base64 encoding with a low-code API solution. This can help take some hands-on coding work off our own plate, offload some of the processing burden from our servers, and in some contexts, deliver more consistent results. In the remainder of this article, I’ll demonstrate a few free APIs we can leverage to streamline our workflows for 1) identifying when content is Base64 encoded and 2) encoding or decoding Base64 content. Demonstration Using the ready-to-run Java code examples provided further down the page, we can take advantage of three separate free-to-use APIs designed to help build out our Base64 detection and conversion workflow. These three APIs serve the following functions (respectively): Detect if a text string is Base64 encoded Base64 decode content (convert Base64 string to binary content) Base64 encode content (convert either binary or file data to a Base64 text string) A few different text encoding options exist out there, so we can use the first API as a consistent way of identifying (or validating) Base64 encoding when it comes our way. Once we’re sure that we’re dealing with Base64 content, we can use the second API to decode that content back to binary. When it comes time to package our own content for email attachments or relevant systems that require ASCII characters, we can use the third API to convert binary OR file data content directly to Base64. As a quick reminder, if we’re using a file data string, the API will first binary encode that content before Base64 encoding it, so we don’t have to worry about that part. To authorize our API calls, we’ll need a free-tier API key, which will allow us a limit of 800 API calls per month (with no additional commitments — our total will simply reset the following month if/when we reach it). Before we call the functions for any of the above APIs, our first step is to install the SDK. In our Maven POM file, let’s add a reference to the repository (Jitpack is used to dynamically compile the library): XML <repositories> <repository> <id>jitpack.io</id> <url>https://jitpack.io</url> </repository> </repositories> Next, let’s add a reference to the dependency: XML <dependencies> <dependency> <groupId>com.github.Cloudmersive</groupId> <artifactId>Cloudmersive.APIClient.Java</artifactId> <version>v4.25</version> </dependency> </dependencies> Now we can implement ready-to-run code to call each independent API. Let’s start with the base64 detection API. We can use the following code to structure our API call: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.EditTextApi; ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); EditTextApi apiInstance = new EditTextApi(); Base64DetectRequest request = new Base64DetectRequest(); // Base64DetectRequest | Input request try { Base64DetectResponse result = apiInstance.editTextBase64Detect(request); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling EditTextApi#editTextBase64Detect"); e.printStackTrace(); } Next, let’s move on to our base64 decoding API. We can use the following code to structure our API call: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.EditTextApi; ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); EditTextApi apiInstance = new EditTextApi(); Base64DecodeRequest request = new Base64DecodeRequest(); // Base64DecodeRequest | Input request try { Base64DecodeResponse result = apiInstance.editTextBase64Decode(request); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling EditTextApi#editTextBase64Decode"); e.printStackTrace(); } Finally, let’s implement our base64 encoding option (as a reminder, we can use binary OR file data content for this one). We can use the following code to structure our API call: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.EditTextApi; ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); EditTextApi apiInstance = new EditTextApi(); Base64EncodeRequest request = new Base64EncodeRequest(); // Base64EncodeRequest | Input request try { Base64EncodeResponse result = apiInstance.editTextBase64Encode(request); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling EditTextApi#editTextBase64Encode"); e.printStackTrace(); } Now we have a few additional options for identifying, decoding, and/or encoding base64 content in our Java applications.

By Brian O'Neill

CORE

Non-Volatile Random Access Memory: Key Guidelines for Writing an Efficient NVRAM Algorithm

In an era when data management is critical to business success, exponential data growth presents a number of challenges for technology departments, including DRAM density limitations and strict budget constraints. These issues are driving the adoption of memory tiering, a game-changing approach that alters how data is handled and stored. Non-Volatile Random Access Memory (NVRAM), which is becoming more affordable and popular, is one of the key technologies designed to work within a tiered memory architecture. This article will investigate the fundamentals of NVRAM, compare it to traditional solutions, and provide guidelines for writing efficient NVRAM algorithms. What Is NVRAM? Non-Volatile Random Access Memory, or NVRAM, is a type of memory that retains data even when the power is turned off. It combines the RAM (Random Access Memory) and ROM (Read Only Memory) properties, allowing data to be read and written quickly like RAM while also being retained when the power is turned off like ROM. It is available on Intel-based computers and employs 3D Xpoint, a revolutionary memory technology that strikes a balance between the high speed of RAM and the persistence of traditional storage, providing a new solution for high-speed, long-term data storage and processing. NVRAM is typically used for specific purposes such as system configuration data storage rather than as a general-purpose memory for running applications. NVRAM is used in a variety of memory modules, including: Dual In-line Memory Modules (DIMMs), which use NVRAM to store firmware data or provide persistent memory Solid State Drives (SSDs) that use NVRAM to store firmware, wear-leveling data, and sometimes to cache writes Motherboard Chipsets that use NVRAM to store BIOS or UEFI settings PCIe Cards that use NVRAM for high-speed data storage or caching Hybrid memory modules use NVRAM in addition to traditional RAM. What Are the Differences Between NVRAM and RAM? To gain a better understanding of the differences between NVRAM and RAM, it is necessary to review the concepts of both types of memory. RAM, or Random Access Memory, is a type of memory that can be read and written to in any order and is commonly used to store working data and machine code. RAM is "volatile" memory, which means it can store data until the computer is powered down. Unlike RAM, NVRAM can retain stored information, making it ideal for storing critical data that must persist across reboots. It may contain information such as system configurations, user settings, or application state. Apart from this critical distinction, these types of memory differ in other ways that define their advantages and disadvantages: Speed: While DRAM is fast, especially when it comes to accessing and writing data, its speed is typically lower than what NVRAM strives for. NVRAM, in addition to potentially competing with DRAM (Dynamic RAM) in terms of speed, offers the durability of traditional non-volatile memory. Energy consumption: NVRAM consumes less power than RAM/DRAM, owing to the fact that it does not require power to retain data, whereas the latter requires constant refreshing. Cost and availability: NVRAM may be more expensive and less widely available at first than established memory technologies such as RAM, which is widely available and available in a variety of price ranges. What Distinguishes NVRAM Algorithms? Because NVRAM allows for the direct storage of important bits of information (such as program settings) in memory, it becomes a game changer in the industry. The NVRAM algorithms are defined by several key characteristics: NVRAM provides new opportunities for developing recoverable algorithms, allowing for efficient recovery of a program's state following system or individual process failure. NVRAM frequently has faster read and write speeds than traditional magnetic disc drives or flash-based SSDs, making it suitable for high-performance computing and real-time tasks that require quick data access. Some types of NVRAM, such as flash memory, are prone to wear due to frequent rewrites. This necessitates the use of special wear-leveling algorithms, which distribute write operations evenly across the memory in order to extend its lifespan. Integrating NVRAM into systems necessitates taking into account its distinct characteristics, such as access speed and wear management. This could entail modifying existing algorithms and system architectures. How To Write NVRAM Algorithms Mutual Exclusion Algorithms Mutex (Mutual Exclusion) algorithms are designed to ensure that multiple processes can manage access to shared resources without conflict, even in the event of system crashes or power outages. The following are the key requirements for this type of algorithm: Mutual exclusion: It ensures that only one process or thread can access a critical section at the same time, preventing concurrent access to shared resources. Deadlock-free: This avoids situations in which processes are indefinitely waiting for each other to release resources, ensuring that programs run continuously. Starvation-free: This ensures that every process has access to the critical section, preventing indefinite delays for any process. Peterson's Algorithm for NVRAM Peterson's Algorithm is an example of an algorithm that can be adapted for NVRAM. In computer science, it is a concurrency control algorithm used to achieve mutual exclusion in multi-threading environments. It enables multiple processes to share a single-use resource without conflict while ensuring that only one process has access to the resource at any given time. In an NVRAM environment, Peterson's algorithm, which was originally designed for two processes, can be extended to support multiple processes (from 0 to n-1). Adapting Peterson's algorithm for NVRAM entails not only expanding it to support multiple processes but also incorporating mechanisms for post-failure recoverability. To adapt Peterson's algorithm for recoverability in NVRAM include specific recovery code that allows a process to re-enter the critical section after a crash. This might involve checking the state of shared variables or locks to determine the last known state before the crash. To write the algorithm, you must first complete the following steps: Initialization: In NVRAM, define shared variables (flag array, turn variable). Set these variables to their default values, indicating that no processes are currently in the critical section. Entry section: Each process that attempts to enter the critical section sets a flag in the NVRAM. After that, the process sets the turn variable to indicate its desire to enter the critical section. It examines the status of other processes' flags as well as the turn variable to see if it can enter the critical section. Critical section: Once inside, the process gets to work. NVRAM stores any state changes or operations that must be saved. Exit section: When the process completes its operations, it resets its flag in the NVRAM, indicating that it has exited the critical section. Recovery mechanism: Include code to handle crashes during entry or exit from the critical section. If a process fails in the entry section, it reads the state of competitors and determines whether to continue. If a process crashes in the exit section, it re-executes the entry section to ensure proper state updates. Handling process failures: Use logic to determine whether a failed process completed its operation in the critical section and take appropriate action. Tournament tree for process completion: Create a hierarchical tournament tree structure. Each process traverses this tree, running recovery and entry code at each level. If necessary, include an empty recovery code segment to indicate that the process is aware of its failure state. Nonblocking Algorithms Nonblocking algorithms are a type of concurrent programming algorithm that enables multiple threads to access and modify shared data without using locks or mutual exclusion mechanisms. These algorithms are intended to ensure that the failure or suspension of one thread does not prevent other threads from progressing. The following are the primary requirements of nonblocking algorithms: Nonblocking algorithms are frequently lock-free, which means that at least one thread makes progress in a finite number of steps even if other threads are delayed indefinitely. Wait-free: A more powerful type of nonblocking algorithm is wait-free, in which each thread is guaranteed to complete its operation in a finite number of steps, regardless of the activity of other threads. Obstruction-free: The most basic type of nonblocking algorithm, in which a thread can finish its operation in a finite number of steps if it eventually operates without interference from other threads. Linearizability is a key concept in concurrent programming that is associated with nonblocking algorithms. It ensures that all operations on shared resources (such as read, write, or update) appear to occur in a single, sequential order that corresponds to the actual order of operations in real time. Nonblocking Algorithm Example Let's take a look at the recoverable version of the CAS program, which is intended to make operations more resilient to failures. The use of a two-dimensional array is a key feature of this implementation. This array acts as a log or record, storing information about which process (or "who") wrote a value and when it happened. Such logging is essential in a recoverable system, particularly in NVRAM, where data persists despite system reboots or failures. The linearizability of operations, which ensures that operations appear to occur in a sequential order consistent with their actual execution, is a key feature of this algorithm. The CAS RECOVER function's evaluation order is critical for maintaining linearizability: If process p1 fails after a successful CAS operation and then recovers, evaluating the second part of the expression in CAS.RECOVER first can lead to non-linearizable execution. This is because another process, p2, could complete a CAS operation in the meantime, changing the state in a way that's not accounted for if p1 only checks the second part of the condition. Therefore, the first part of the condition (checking C=<p,new>) must be evaluated before the second part (checking if new is in R[p][1] to R[p][N]). Conclusion This article delves into the fundamental concepts of NVRAM, a new type of memory, compares it to RAM, presents key requirements for mutex and nonblocking algorithms, and offers guidelines for developing efficient NVRAM algorithms.

By Viktoriia Erokhina

Unlocking the Secrets of Data Privacy: Navigating the World of Data Anonymization: Part 2

In the first part of this series, we discussed the importance, ethical considerations, and challenges of data anonymization. Now, let's dive into various data anonymization techniques, their strengths, weaknesses, and their implementation in Python. 1. Data Masking Data masking, or obfuscation involves hiding original data with random characters or data. This technique protects sensitive information like credit card numbers or personal identifiers in environments where data integrity is not critical. However, confidentiality is essential, such as in development and testing environments. For instance, a developer working on a banking application can use masked account numbers to test the software without accessing real account information. This method ensures that sensitive data remains inaccessible while the overall structure and format are preserved for practical use. Example Use-Case: Data masking is commonly used in software development and testing, where developers must work with realistic data sets without accessing sensitive information. Pros: It maintains the format and type of data. Effective for protecting sensitive information. Cons: Not suitable for complex data analysis. Potential for reverse engineering if the masking algorithm is known. Example Code: Python def data_masking(data, mask_char='*'): return ''.join([mask_char if char.isalnum() else char for char in data]) # Example: data_masking("Sensitive Data") returns "************ **" 2. Pseudonymization Pseudonymization replaces private identifiers with fictitious names or identifiers. It is a method to reduce the risk of data subjects' identification while retaining a certain level of data utility. This technique is helpful in research environments, where researchers must work with individual-level data without the risk of exposing personal identities. For instance, in clinical trials, patient names might be replaced with unique codes, allowing researchers to track individual responses to treatments without knowing the actual identities of the patients. Example Use-Case: Pseudonymization is widely used in clinical research and studies where individual data tracking is necessary without revealing real identities. Pros: Reduces direct linkage to individuals. It is more practical than fully anonymized data for specific analyses. Cons: It is not entirely anonymous; it requires secure pseudonym mapping storage. Risk of re-identification if additional data is available. Example Code: Python import uuid def pseudonymize(data): pseudonym = str(uuid.uuid4()) # Generates a unique identifier return pseudonym # Example: pseudonymize("John Doe") returns a UUID string. 3. Aggregation Aggregation involves summarizing data into larger groups, categories, or averages to prevent the identification of individuals. This technique is used when the specific data details are not crucial, but the overall trends and patterns are. For example, in demographic studies, individual responses might be aggregated into age ranges, income brackets, or regional statistics to analyze population trends without exposing individual-level data. Example Use-Case: Aggregation is commonly used in demographic analysis, public policy research, and market research, focusing on group trends rather than individual data points. Pros: It reduces the risk of individual identification. Useful for statistical analysis. Cons: It loses detailed information. It is only suitable for some types of analysis. Example Code: Python def aggregate_data(data, bin_size): return [x // bin_size * bin_size for x in data] # Example: aggregate_data([23, 37, 45], 10) returns [20, 30, 40] 4. Data Perturbation Data perturbation modifies the original data in a controlled manner by adding a small amount of noise or changing some values slightly. This technique protects individual data points from being precisely identified while maintaining the data's overall structure and statistical distribution. It's instrumental in datasets used for machine learning, where the overall patterns and structures are essential, but exact values are not. For instance, in a dataset used for traffic pattern analysis, the exact number of cars at a specific time can be slightly altered to prevent tracing back to particular vehicles or individuals. Example Use-Case: Data perturbation is often used in machine learning and statistical analysis, where maintaining the overall distribution and data patterns is essential, but exact values are not critical. Pros: It maintains the statistical properties of the dataset. Effective against certain re-identification attacks. Cons: It can reduce data accuracy. It is challenging to find the right level of perturbation. Example Code: Python import random def perturb_data(data, noise_level=0.01): return [x + random.uniform(-noise_level, noise_level) for x in data] # Example: perturb_data([100, 200, 300], 0.05) perturbs data within 5% of the original value. 5. Differential Privacy Differential privacy is a more advanced technique that adds noise to the data or the output of queries on data sets, thereby ensuring that removing or adding a single database item does not significantly affect the outcome. This method provides robust and mathematically proven privacy guarantees and is helpful in scenarios where data needs to be shared or published. For example, a statistical database responding to queries about citizen health trends can use differential privacy to ensure that the responses do not inadvertently reveal information about any individual citizen. Example Use-Case: Differential privacy is widely applied in statistical databases and public data releases, and robust, quantifiable privacy guarantees are required anywhere. Pros: It provides a quantifiable privacy guarantee. Suitable for complex statistical analyses. Cons: It is not easy to implement correctly. It may significantly alter data if not carefully managed. Example Code: Python import numpy as np def differential_privacy(data, epsilon): noise = np.random.laplace(0, 1/epsilon, len(data)) return [d + n for d, n in zip(data, noise)] # Example: differential_privacy([10, 20, 30], 0.1) adds Laplace noise based on epsilon value. Conclusion: Data anonymization is a crucial practice in data engineering and privacy. As discussed in this series, various techniques offer different levels of protection while balancing the need for data utility. Data masking, which involves hiding original data with random characters, is effective for scenarios where confidentiality is essential, such as in software development and testing environments. Pseudonymization replaces private identifiers with fictitious names or codes, balancing data utility and privacy, making it ideal for research environments like clinical trials. Aggregation is a powerful tool for summarizing data when individual details are less critical, commonly employed in demographic and market research. Data perturbation is instrumental in maintaining the overall structure and statistical distribution of data used in machine learning and traffic analysis. Lastly, differential privacy, although challenging to implement, provides robust privacy guarantees and is indispensable in scenarios where data sharing or publication is necessary. Choosing a proper anonymization technique is essential based on the specific use case and privacy requirements. These techniques empower organizations and data professionals to strike a balance between harnessing the power of data for insights and analytics while respecting the privacy and confidentiality of individuals. Understanding and implementing these anonymization techniques will ensure ethical and responsible data practices in the ever-changing, data-driven world as the data landscape evolves. Data privacy is a legal and ethical obligation and a critical aspect of building trust with stakeholders and users, making it an integral part of the modern data engineering landscape.

By Mitesh Mangaonkar

Using Approximate Nearest Neighbor (ANN) Search With SingleStoreDB

The new SingleStoreDB release v8.5 provides several new vector features. In this article, we'll evaluate ANN Index Search with the new VECTOR data type using the Fashion MNIST dataset from Zalando. The notebook file and SQL code are available on GitHub. Create a SingleStoreDB Cloud Account A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings: Workspace Group Name: ANN Demo Group Cloud Provider: AWS Region: US East 1 (N. Virginia) Workspace Name: ann-demo Size: S-00 Create a Database and Tables In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this fmnist_db, as follows: SQL CREATE DATABASE IF NOT EXISTS fmnist_db; We'll also create several tables using the BLOB data type and new VECTOR data type, as follows: SQL USE fmnist_db; CREATE TABLE IF NOT EXISTS train_data_blob ( idx INT(10) UNSIGNED NOT NULL, label VARCHAR(20), vector BLOB, KEY(idx) ); CREATE TABLE IF NOT EXISTS test_data_blob ( idx INT(10) UNSIGNED NOT NULL, label VARCHAR(20), vector BLOB, KEY(idx) ); CREATE TABLE IF NOT EXISTS train_data_vec ( idx INT(10) UNSIGNED NOT NULL, label VARCHAR(20), vector VECTOR(784) NOT NULL, KEY(idx) ); CREATE TABLE IF NOT EXISTS test_data_vec ( idx INT(10) UNSIGNED NOT NULL, label VARCHAR(20), vector VECTOR(784) NOT NULL, KEY(idx) ); We have train and test tables using both formats. We'll load data into the two different sets of tables. New Notebook We'll follow the instructions to create a new notebook as described in a previous article. We'll call the notebook ann_demo. Fill Out the Notebook First, we'll install some libraries: Shell !pip install tensorflow --quiet !pip install matplotlib --quiet Next, let's set up our environment: Python import os os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" from tensorflow import keras from keras.datasets import fashion_mnist import matplotlib.pyplot as plt import numpy as np Load the Dataset We'll use the Fashion MNIST dataset from Zalando. First, we'll get the train and test data: Python (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() Let's take a look at the shape of the data: Python print("train_images: " + str(train_images.shape)) print("train_labels: " + str(train_labels.shape)) print("test_images: " + str(test_images.shape)) print("test_labels: " + str(test_labels.shape)) The result should be as follows: Plain Text train_images: (60000, 28, 28) train_labels: (60000,) test_images: (10000, 28, 28) test_labels: (10000,) We have 60,000 images for training and 10,000 images for testing. The images are greyscaled, 28 pixels by 28 pixels, and we can take a look at one of these: Python print(train_images[0]) The result should be (28 columns by 28 rows): Plain Text [[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 13 73 0 0 1 4 0 0 0 0 1 1 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 3 0 36 136 127 62 54 0 0 0 1 3 4 0 0 3] [ 0 0 0 0 0 0 0 0 0 0 0 0 6 0 102 204 176 134 144 123 23 0 0 0 0 12 10 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 155 236 207 178 107 156 161 109 64 23 77 130 72 15] [ 0 0 0 0 0 0 0 0 0 0 0 1 0 69 207 223 218 216 216 163 127 121 122 146 141 88 172 66] [ 0 0 0 0 0 0 0 0 0 1 1 1 0 200 232 232 233 229 223 223 215 213 164 127 123 196 229 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 183 225 216 223 228 235 227 224 222 224 221 223 245 173 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 193 228 218 213 198 180 212 210 211 213 223 220 243 202 0] [ 0 0 0 0 0 0 0 0 0 1 3 0 12 219 220 212 218 192 169 227 208 218 224 212 226 197 209 52] [ 0 0 0 0 0 0 0 0 0 0 6 0 99 244 222 220 218 203 198 221 215 213 222 220 245 119 167 56] [ 0 0 0 0 0 0 0 0 0 4 0 0 55 236 228 230 228 240 232 213 218 223 234 217 217 209 92 0] [ 0 0 1 4 6 7 2 0 0 0 0 0 237 226 217 223 222 219 222 221 216 223 229 215 218 255 77 0] [ 0 3 0 0 0 0 0 0 0 62 145 204 228 207 213 221 218 208 211 218 224 223 219 215 224 244 159 0] [ 0 0 0 0 18 44 82 107 189 228 220 222 217 226 200 205 211 230 224 234 176 188 250 248 233 238 215 0] [ 0 57 187 208 224 221 224 208 204 214 208 209 200 159 245 193 206 223 255 255 221 234 221 211 220 232 246 0] [ 3 202 228 224 221 211 211 214 205 205 205 220 240 80 150 255 229 221 188 154 191 210 204 209 222 228 225 0] [ 98 233 198 210 222 229 229 234 249 220 194 215 217 241 65 73 106 117 168 219 221 215 217 223 223 224 229 29] [ 75 204 212 204 193 205 211 225 216 185 197 206 198 213 240 195 227 245 239 223 218 212 209 222 220 221 230 67] [ 48 203 183 194 213 197 185 190 194 192 202 214 219 221 220 236 225 216 199 206 186 181 177 172 181 205 206 115] [ 0 122 219 193 179 171 183 196 204 210 213 207 211 210 200 196 194 191 195 191 198 192 176 156 167 177 210 92] [ 0 0 74 189 212 191 175 172 175 181 185 188 189 188 193 198 204 209 210 210 211 188 188 194 192 216 170 0] [ 2 0 0 0 66 200 222 237 239 242 246 243 244 221 220 193 191 179 182 182 181 176 166 168 99 58 0 0] [ 0 0 0 0 0 0 0 40 61 44 72 41 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] We can check the label associated with this image: Python print(train_labels[0]) The result should be: Plain Text 9 This value represents an Ankle Boot. We can do a quick plot as follows: Python classes = [ "t_shirt_top", "trouser", "pullover", "dress", "coat", "sandal", "shirt", "sneaker", "bag", "ankle_boot" ] num_classes = len(classes) for i in range(num_classes): ax = plt.subplot(2, 5, i + 1) plt.imshow( np.column_stack(train_images[i].reshape(1, 28, 28)), cmap = plt.cm.binary ) plt.axis("off") ax.set_title(classes[train_labels[i]]) The result is shown in Figure 1. Figure 1. Fashion MNIST. Prepare Pandas Dataframe We need to reshape our dataset so that we can store it correctly later: Python train_images = train_images.reshape((train_images.shape[0], -1)) test_images = test_images.reshape((test_images.shape[0], -1)) And we can check the shapes: Python print("train_images: " + str(train_images.shape)) print("test_images: " + str(test_images.shape)) The result should be: Plain Text train_images: (60000, 784) test_images: (10000, 784) So, we have flattened the image structure. Now we'll create two Pandas Dataframes, as follows: Python import pandas as pd train_data_df = pd.DataFrame([ (i, image.astype(int).tolist(), classes[int(label)], ) for i, (image, label) in enumerate(zip(train_images, train_labels)) ], columns = ["idx", "img", "label"]) test_data_df = pd.DataFrame([ (i, image.astype(int).tolist(), classes[int(label)], ) for i, (image, label) in enumerate(zip(test_images, test_labels)) ], columns = ["idx", "img", "label"]) We need to convert the values in the img column to a suitable format for SingleStoreDB. We can do this using the following code: Python import struct def data_to_binary(data: list[float]): format_string = "f" * len(data) return struct.pack(format_string, *data) train_data_df["vector"] = train_data_df["img"].apply(data_to_binary) test_data_df["vector"] = test_data_df["img"].apply(data_to_binary) We can now drop the img column: Python train_data_df.drop("img", axis = 1, inplace = True) test_data_df.drop("img", axis = 1, inplace = True) Write Pandas Dataframes to SingleStoreDB We are now ready to write the Dataframes train_data_df and test_data_df to the tables train_data_blob and test_data_blob, respectively. First, we'll set up the connection to SingleStoreDB: Python from sqlalchemy import * db_connection = create_engine(connection_url) Finally, we are ready to write the Dataframes to SingleStoreDB. First, train_data_df: Python train_data_df.to_sql( "train_data_blob", con = db_connection, if_exists = "append", index = False, chunksize = 1000 ) And then test_data_df: Python test_data_df.to_sql( "test_data_blob", con = db_connection, if_exists = "append", index = False, chunksize = 1000 ) Example Queries Now that we have built our system, we can run some queries using the SQL Editor. Using the BLOB Type First, let's create two variables: SQL SET @qv_train_blob = ( SELECT vector FROM train_data_blob WHERE idx = 30000 ); SET @qv_test_blob = ( SELECT vector FROM test_data_blob WHERE idx = 500 ); In the first case, we are selecting an image vector 50% through the train data. In the second case, we are selecting an image vector 5% through the test data. Now, let's use EUCLIDEAN_DISTANCE with the train data: SQL SELECT label, EUCLIDEAN_DISTANCE(vector, @qv_train_blob) AS score FROM train_data_blob ORDER BY score LIMIT 5; The result should be: Plain Text +-------+-------------------+ | label | score | +-------+-------------------+ | dress | 0 | | dress | 570.5322076798119 | | dress | 612.5422434412177 | | dress | 653.6390441214478 | | dress | 665.1052548281363 | +-------+-------------------+ Next, let's try the same query but use the test data: SQL SELECT label, EUCLIDEAN_DISTANCE(vector, @qv_test_blob) AS score FROM train_data_blob ORDER BY score LIMIT 5; The result should be: Plain Text +----------+--------------------+ | label | score | +----------+--------------------+ | pullover | 1211.59399140141 | | pullover | 1295.9332544541019 | | pullover | 1316.508640305866 | | pullover | 1320.24278070361 | | pullover | 1346.3539653449236 | +----------+--------------------+ Using the VECTOR Type First, we'll copy the data from the tables using the BLOB type to the tables using the VECTOR type, as follows: SQL INSERT INTO train_data_vec (idx, label, vector) ( SELECT idx, label, vector FROM train_data_blob ); INSERT INTO test_data_vec (idx, label, vector) ( SELECT idx, label, vector FROM test_data_blob ); Next, we'll define an index as follows: SQL ALTER TABLE train_data_vec ADD VECTOR INDEX (vector) INDEX_OPTIONS '{ "index_type":"IVF_FLAT", "nlist":1000, "metric_type":"EUCLIDEAN_DISTANCE" }'; Many vector indexing options are available. Please see the Vector Indexing documentation. First, let's create two variables: SQL SET @qv_train_vec = ( SELECT vector FROM train_data_vec WHERE idx = 30000 ); SET @qv_test_vec = ( SELECT vector FROM test_data_vec WHERE idx = 500 ); In the first case, we are selecting an image vector 50% through the train data. In the second case, we are selecting an image vector of 5% through the test data. Now, let's use the Infix Operator <-> with the train data: SQL SELECT label, vector <-> @qv_train_vec AS score FROM train_data_vec ORDER BY score LIMIT 5; The result should be: Plain Text +-------+-------------------+ | label | score | +-------+-------------------+ | dress | 0 | | dress | 570.5322076798119 | | dress | 612.5422434412177 | | dress | 653.6390441214478 | | dress | 665.1052548281363 | +-------+-------------------+ Next, let's try the same query but use the test data: SQL SELECT label, vector <-> @qv_test_vec AS score FROM train_data_vec ORDER BY score LIMIT 5; The result should be: Plain Text +----------+--------------------+ | label | score | +----------+--------------------+ | pullover | 1211.59399140141 | | pullover | 1295.9332544541019 | | pullover | 1316.508640305866 | | pullover | 1320.24278070361 | | pullover | 1346.3539653449236 | +----------+--------------------+ Comparing the results, we can see that both approaches work well. However, the new ANN Index Search provides many benefits, as discussed in the Vector Indexing documentation. Summary In this short article, we've seen how to create an ANN Index using the new VECTOR data type with a well-known dataset. We've seen that the existing approach to storing vectors in SingleStoreDB using the BLOB type works well, but using the new vector features offers greater flexibility and choices.

By Akmal Chaudhri

CORE

Why You Might Need To Know Algorithms as a Mobile Developer: Emoji Example

There is an infinite discussion about the topic of knowing algorithms and data structures for a frontend (in my case, mobile) developer needed only to pass technical interviews in large tech companies, or if there is some benefit of using it in daily work. I think the truth is somewhere in between, as always. Of course, you rarely find a case when you need to implement a min heap or use dynamic programming approaches while working on UI and business logic for a service with a REST API, but having a basic understanding of performance, time, and memory complexity can help make small, simple optimizations in the app that can pay off a lot in the long run. I want to give an example of such a small optimization and decision-making process that can help us decide whether the extra effort is worth it or not. Example I'm working on a simple iOS application for my kids that should help them learn foreign languages. One of the basic functionalities is vocabulary where you can add a word you want to learn. I wanted to add some images for each word to visually represent it. However, in 2024, the best way may be to use an API of any image generation model, but it's too much for the simple app I'm trying to make, so I decided to go with emojis. There are over 1,000 emojis available, and most simple words or phrases that kids can try to learn will have a visual representation there. Here is a code example to obtain most of the emoji symbols and filter out only those that can be properly rendered. Swift var emojis: [Character: String] = [:] let ranges = [ 0x1F600...0x1F64F, 9100...9300, 0x2600...0x26FF, 0x2700...0x27BF, 0x1F300...0x1F5FF, 0x1F680...0x1F6FF, 0x1F900...0x1F9FF ] for range in ranges { for item in range { guard let scalar = UnicodeScalar(item), scalar.properties.isEmojiPresentation else { continue } let value = Character(scalar) emojis[value] = description(emoji: value) } } With each emoji character, we also store a description for it that we will use to find the right one for our word or phrase. Here are a few examples: Now let's consider how to find the right emoji for a given word or phrase in the most straightforward and simple way. Unfortunately, string comparing would not work here, as not all emojis contain a single word as a description, and the second is that users can use different versions of the word or even a phrase. Fortunately, Apple provides us with a built-in NaturalLanguage framework that can help us. We will utilize the sentence embedding functionality from it to measure the distance between the given word/phrase from the user and the emoji description we are storing. Here is a function for it: Swift func calculateDistance(text: String, items: [Character]) -> (String?, Double?) { guard let embedding = NLEmbedding.sentenceEmbedding(for: .english) else { return (nil, nil) } var minDistance: Double = Double.greatestFiniteMagnitude var emoji = "" for key in items { guard let value = emojis[key] else { continue } let distance = embedding.distance( between: text.lowercased(), and: value.lowercased() ) if distance < minDistance { minDistance = distance emoji = String(key) } } return (emoji, minDistance) } The algorithm is straightforward here: we run through all the emoji characters we have, taking a description and comparing it with the given text, saving the minimum distance found, and in the end, returning the emoji with the minimum distance to our text together with the distance itself for further filtering. This algorithm has a linear time complexity of O(n). Here are some examples of the results: The last one is not what I would expect to get as a smiling face, but it is smiling, so it works. We can also use the returned distance to filter stuff out. The value of distance is in the range between 0 and 2 (by default). By running some experiments, I found that 0.85 is a great filter point for everything that does not represent the meaning of the phrase in the given emoji. Everything less than 0.85 looks good, and everything greater than that, I'm filtering out and returning an empty string to not confuse users. We have a first version of our algorithm, and while it works, it's quite slow. To find a match for any request, it needs to go through every emoji and execute distance measurements for each description individually. This process takes around 3.8 seconds for every request from the user. Now we need to make an important decision: whether to invest time into optimization. To answer this question, let's think about what exactly we want to improve by optimizing this extra effort. Even though 3.8 seconds for emoji generation may seem unacceptable, I would still use it as an example and challenge the purpose of optimizing this time. My use case is the following: The user opens the vocabulary and wants to add a new word or phrase. The user types this word. When typing is finished, I make a network call to a translation API that gives me a translation of the word. Ideally, I want this emoji to appear at the same time the typing is finished, but I can survive with a delay that will not exceed the time it takes for the translation API call and show it at the same time I've got the translation. When I consider this behavior as a requirement, it's clear that 3.8 seconds is too long for a network call. I would say if it takes 0.3-0.5 seconds, I probably wouldn't optimize here because I wouldn't want to sacrifice the user experience. Later, I might need to revisit this topic and improve it, but for now, delivering a working product is better than never delivering perfect code. In my case, I have to optimize, so let's think about how to do it. We're already using a dictionary here, where emojis are keys and descriptions are values. We'll add an additional dictionary with swapped keys and values. Additionally, I'll split each description into separate words and use these words as keys. For the values, I'll use a list of emojis that correspond to each word in the description. To make this more efficient, I'll create an index for my emojis that can help me find the most relevant description for a given word in almost constant time. The main drawback of this approach is that it will only work with single words, not phrases. According to my target users, they will typically search for a single word. So, I'll use this index for single-word searches and keep the old approach for rare phrases that won't return an empty symbol in most cases by not finding an appropriate emoji explanation. Let's take a look at a few examples from the Index dictionary: And here's a function for such index creation: Swift var searchIndex: [String: [Character]] = [:] ... func prepareIndex() { for item in emojis { let words = item.value.components(separatedBy: " ") for word in words { var emojiItems: [Character] = [] let lowercasedWord = word.lowercased() if let items = searchIndex[lowercasedWord] { emojiItems = items } emojiItems.append(item.key) searchIndex[lowercasedWord] = emojiItems } } } Now, let's add two more functions for single words and phrases. Swift func emoji(word: String) -> String { guard let options = searchIndex[word.lowercased()] else { return emoji(text: word) } let result = calculateDistance(text: word, items: options) guard let value = result.0, let distance = result.1 else { return emoji(text: word) } return distance < singleWordAllowedDistance ? value : "" } func emoji(text: String) -> String { let result = calculateDistance(text: text, items: Array(emojis.keys)) guard let value = result.0, let distance = result.1 else { return "" } return distance < allowedDistance ? value : "" } allowedDistance and singleWordAllowedDistance are constants that help me to configure filtering. As you can see, we use the same distance calculation as before, but instead of all emojis, we're injecting a list of emojis that have the given word in their description. And for most cases, it will be just a few or even only one option. This makes the algorithm work in near constant time in most cases. Let's test it and measure the time. This updated algorithm gives a result within 0.04 - 0.08 seconds, which is around 50 times faster than before. However, there's a big issue: the words should be spelled exactly as they are presented in the description. We can fix this by using a Word Embedding with Neighbors function, which will give us a list of similar or close-in-meaning words to the given one. Here's an updated func emoji(word: String) -> String function. Swift func emoji(word: String) -> String { guard let wordEmbedding = NLEmbedding.wordEmbedding(for: .english) else { return "" } let neighbors = wordEmbedding.neighbors(for: word, maximumCount: 2).map({ $0.0 }) let words = [word] + neighbors for word in words { guard let options = searchIndex[word.lowercased()] else { continue } let result = calculateDistance(text: word, items: options) guard let value = result.0, let distance = result.1 else { return emoji(text: word) } return distance < singleWordAllowedDistance ? value : "" } return emoji(text: word) } Now it works very quickly and in most cases. Conclusion Knowing basic algorithms and data structures expands your toolset and helps you find areas in code that can be optimized. Especially when working on a large project with many developers and numerous modules in the application, having optimizations here and there will help the app run faster over time.

By Aleksei Pichukov

Why Can’t I Find the Right Data?

The modern data stack has helped democratize the creation, processing, and analysis of data across organizations. However, it has also led to a new set of challenges thanks to the decentralization of the data stack. In this post, we’ll discuss one of the cornerstones of the modern data stack—data catalogs—and why they fall short of overcoming the fragmentation to deliver a fully self-served data discovery experience. If you are the leader of the data team at a company with 200+ employees, there is a high probability that you have. Started seeing data discovery issues at your company; Tried one of the commercial or open-source data catalogs or Cobbled together an in-house data catalog. If that’s the case, you’d definitely find this post highly relatable. Pain Points This post is based on our own experience of building DataHub at LinkedIn and the learnings from 100+ interviews with data leaders and practitioners at various companies. There may be many reasons why a company adopts a data catalog, but here are the pain points we often come across: Your data team is spending a lot of time answering questions about where to find data and what datasets to use. Your company is making bad decisions because data is inconsistent, poor in quality, delayed, or simply unavailable. Your data team can't confidently apply changes to, migrate, or deprecate data because there’s no visibility into how the data is being used. The bottom line is that you want to empower your stakeholders to self-serve the data and, more importantly, the right data. The data team doesn't want to be bogged down by support questions as much as data consumers don't want to depend on the data team to answer their questions. Both of them share a common goal—True Self-service Data Discovery™. First Reaction In our research, we saw striking similarities in companies attempting to solve this problem themselves. The story often goes like this: Create a database to store metadata. Collect important metadata, such as schemas, descriptions, owners, usage, and lineage, from key data systems. Make it searchable through a web app. Voila! You now have a full self-service solution and proudly declare victory over all data discovery problems. Initial Excitement Let’s walk through what typically happened after this shiny new data catalog was introduced. It looked great on first impression. A handful of power users were super excited about the catalog and its potential. They were thrilled about their newfound visibility into the whole data ecosystem and the endless opportunities to explore new data. They were optimistic that this was indeed The Solution they’d been looking for. Reality Sets In A few months after launching, you started noticing that the user engagement waned quickly. Customer’s questions in your data team’s Slack channel didn’t seem to go away either. If anything, they became even harder for the team to answer. So what happened? People searched “revenue,” hoping to find the official revenue dataset. Instead, they got 100s of similarly named results, such as “revenue”, “revenue_new”, ”revenue_latest”, “revenue_final”, “revenue_final_final”, and were at a complete loss. Even if the person knew the exact name of what they were looking for, the data catalog only provided technical information, e.g., SQL definition, column descriptions, linage, and data profile, without any explicit instructions on how to use it for a specific use case. Your data team has painstakingly tagged datasets as "core", "golden", "important", etc., but the customers didn't know what these tags mean or their importance. Worse yet, they started tagging things randomly and messed up the curation effort. Is it really that hard to find the right data, even with such advanced search capabilities and all the rich metadata? Yes! Because the answer to “what’s the right data” depends on who you are and what use cases you’re trying to solve. Most data catalogs only present the information from the producer’s point of view but fail to cater to the data consumers. The Missing Piece Providing the producer’s point of view through automation and integration of all the technical metadata is definitely a key part of the solution. However, the consumer’s point of view—trusted tables used by my organization, common usage patterns for various business scenarios, impact from upstream changes have on my analyses—is the missing piece that completes the data discovery and understandability puzzle. Most of the data catalogs don't help users find the data they need; they help users find someone to pester, which is often referred to as a “tap on the shoulder.” This is not true self-service. The Solution We believe that there are three types of information/metadata required to make data discovery truly self-serviceable: Technical Metadata This refers to all metadata originating from the data systems, including schemas, lineage, SQL/code, description, data profile, data quality, etc. Automation and integration would keep the information at the user’s fingertips. Challenges There is no standard for metadata across data platforms. Worst yet, many companies build their own custom systems that hold or produce key metadata. How to integrate these systems at scale to ingest metadata accurately, reliably, and timely is an engineering challenge. Business Metadata Each business function operates based on a set of common business definitions, often referred to as “business terms.” Examples include Active Customers, Revenue, Employees, Churn, etc. As a data-driven organization relies heavily on these definitions to make key business decisions, it is paramount for data practitioners to correctly translate between the physical data and business terms. Challenges Many companies lack the tools, processes, and disciplines to govern and communicate these business terms. As a result, when serving a business ask, data practitioners often struggle to find the right data for a particular business term or end up producing results that contradict each other. Behavioral Metadata Surfacing the association between people and data is critical to effective data discovery. Users often place their trust in data based on who created or used it. They also prefer to learn how to do their analyses from more experienced “power users.” To that end, we need to encourage the sharing of these data learnings/insights across the company. This would also improve your organization’s data literacy, provide a better understanding of the business, and reduce inconsistencies. Challenges People interact with the data in different ways. Some query using Snowflake console, Notebooks, R, and Presto, while others explore using BI tools, dashboards, or even spreadsheets. As a result, the learnings and insights often spread across multiple places and make it difficult to associate people with data. It should be fairly clear by now that discovering the right data and understanding what it means is not a mere technical problem. It requires bringing technical, business, and behavioral metadata together. Doing this without creating an onerous governance process will boost your organization’s data productivity significantly and bring a truly data-driven culture to your company.

By Pardhu Gunnam

Data Life With Algorithms

Data is the lifeblood of the digital age. Algorithms collect, store, process, and analyze it to create new insights and value. The data life cycle is the process by which data is created, used, and disposed of. It typically includes the following stages: Data collection: Data can be collected from a variety of sources, such as sensors, user input, and public records. Data preparation: Data is often cleaned and processed before it can be analyzed. This may involve removing errors, formatting data consistently, and converting data to a common format. Data analysis: Algorithms are used to analyze data and extract insights. This may involve identifying patterns, trends, and relationships in the data. Data visualization: Data visualization techniques are used to present the results of data analysis in a clear and concise way. Data storage: Data is often stored for future use. This may involve storing data in a database, filesystem, or cloud storage service. Algorithms are used at every stage of the data life cycle. For example, algorithms can be used to: Collect data: Algorithms can be used to filter and collect data from a stream of data, such as sensor data or social media data. Prepare data: Algorithms can be used to clean and process data, such as removing errors, formatting data consistently, and converting data to a common format. Analyze data: Algorithms can be used to analyze data and extract insights, such as identifying patterns, trends, and relationships in the data. Visualize data: Algorithms can be used to create data visualizations, such as charts, graphs, and maps. Store data: Algorithms can be used to compress and encrypt data before storing it. Algorithms play a vital role in the data life cycle. They enable us to collect, store, process, and analyze data efficiently and effectively. Here are some examples of how algorithms are used in the data life cycle: Search engines: Search engines use algorithms to index and rank websites so that users can find the information they are looking for quickly and easily. Social media: Social media platforms use algorithms to recommend content to users based on their interests and past behavior. E-commerce websites: E-commerce websites use algorithms to recommend products to users based on their browsing history and purchase history. Fraud detection: Financial institutions use algorithms to detect fraudulent transactions. Medical diagnosis: Medical professionals use algorithms to diagnose diseases and recommend treatments. Data Data is the lifeblood of the digital age because it powers the technologies and innovations that shape our world. From the social media platforms we use to stay connected to the streaming services we watch to the self-driving cars that are being developed, all of these technologies rely on data to function. Data is collected from various sources, including sensors, devices, and online transactions. Once collected, data is stored and processed using specialized hardware and software. This process involves cleaning, organizing, and transforming the data into a format that can be analyzed. Algorithms Algorithms are used to analyze data and extract insights. Algorithms are mathematical formulas that can be used to perform various tasks, such as identifying patterns, making predictions, and optimizing processes. The insights gained from data analysis can be used to create new products and services, improve existing ones, and make better decisions. For example, companies can use data to personalize their marketing campaigns, develop new products that meet customer needs, and improve their supply chains. Data Can Be Collected From a Variety of Sources Sensors: Sensors can be used to collect data about the physical environment, such as temperature, humidity, and movement. For example, smart thermostats use sensors to collect data about the temperature in a room and adjust the thermostat accordingly. User input: Data can also be collected from users, such as through surveys, polls, and website forms. For example, e-commerce websites collect data about customer purchases and preferences in order to improve their product recommendations and marketing campaigns. Public records: Public records, such as census data and government reports, can also be used to collect data. For example, businesses can use census data to identify target markets, and government reports to track industry trends. Here Are Some Additional Examples of Data Collection Sources Social media: Social media platforms collect data about users' activity, such as the posts they like, the people they follow, and the content they share. This data is used to target users with relevant ads and to personalize their user experience. IoT devices: The Internet of Things (IoT) refers to the network of physical objects that are connected to the internet and can collect and transmit data. IoT devices, such as smart home devices and wearables, can be used to collect data about people's daily lives. Business transactions: Businesses collect data about their customers and transactions, such as purchase history and contact information. This data is used to improve customer service, develop new products and services, and target marketing campaigns. Data Can Also Be Collected From a Variety of Different Types of Data Sources Structured data: Structured data is data that is organized in a predefined format, such as a database table. Structured data is easy to store, process, and analyze. Unstructured data: Unstructured data is data that does not have a predefined format, such as text, images, and videos. Unstructured data is more difficult to store, process, and analyze than structured data, but it can contain valuable insights. Data Preparation Data preparation is the process of cleaning and processing data so that it is ready for analysis. This is an important step in any data science project, as it can have a significant impact on the quality of the results. There are a number of different data preparation tasks that may be necessary, depending on the specific data set and the desired outcome. Some common tasks include: Removing errors: Data may contain errors due to human mistakes, technical glitches, or other factors. It is important to identify and remove these errors before proceeding with the analysis. Formatting data consistently: Data may be collected from a variety of sources, and each source may have its own unique format. It is important to format the data consistently so that it can be easily processed and analyzed. Converting data to a common format: Data may be collected in various formats, such as CSV, Excel, and JSON. It is often helpful to convert the data to a common format, such as CSV so that it can be easily processed and analyzed by different tools and software. Handling missing values: Missing values are a common problem in data sets. There are a number of different ways to handle missing values, such as removing the rows with missing values, replacing the missing values with a default value, or estimating the missing values using a statistical model. Feature engineering: Feature engineering is the process of creating new features from existing features. This can be done to improve machine learning algorithms' performance or make the data more informative for analysis. Data preparation can be a time-consuming and challenging task, but it is essential for producing high-quality results. By carefully preparing the data, data scientists can increase the accuracy and reliability of their analyses. Here are some additional tips for data preparation: Start by understanding the data: Before you start cleaning and processing the data, it is important to understand what the data represents and how it will be used. This will help you to identify the most important tasks and to make informed decisions about how to handle the data. Use appropriate tools and techniques: There are a number of different data preparation tools and techniques available. Choose the tools and techniques that are most appropriate for your data set and your desired outcome. Document your work: It is important to document your data preparation work so that you can reproduce the results and so that others can understand how the data was prepared. This is especially important if you are working on a team or if you are sharing your data with others. How Algorithm Works An algorithm is a set of instructions that can be used to solve a problem or achieve a goal. Algorithms are used in many different fields, including computer science, mathematics, and engineering. In the context of data, algorithms are used to process and analyze data in order to extract useful information. For example, an algorithm could be used to sort a list of numbers, find the average of a set of values, or identify patterns in a dataset. Algorithms work with data by performing a series of steps on the data. These steps can include arithmetic operations, logical comparisons, and decision-making. The output of an algorithm is typically a new piece of data, such as a sorted list of numbers, a calculated average, or a set of identified patterns. Here is a simple example of an algorithm for calculating the average of a set of numbers: Initialize a variable sum to 0. Iterate over the set of numbers, adding each number to the variable sum. Divide the variable sum by the number of numbers in the set. The result is the average of the set of numbers. This algorithm can be implemented in any programming language and can be used to calculate the average of any set of numbers, regardless of size. More complex algorithms can be used to perform more sophisticated tasks, such as machine learning and natural language processing. These algorithms typically require large datasets to train, and they can be used to make predictions or generate creative text formats. Here are some examples of how algorithms are used with data in the real world: Search engines: Algorithms are used to rank the results of a search query based on the relevance of the results to the query and other factors. Social media: Algorithms are used to filter the content that users see in their feeds based on their interests and past behavior. Recommendation systems: Algorithms are used to recommend products, movies, and other content to users based on their past preferences. Fraud detection: Algorithms are used to identify fraudulent transactions and other suspicious activities. Medical diagnosis: Algorithms are used to assist doctors in diagnosing diseases and recommending treatments. These are just a few examples of the many ways that algorithms are used with data in the real world. As the amount of data that we collect and store continues to grow, algorithms will play an increasingly important role in helping us to make sense of that data and to use it to solve problems.

By Prashanth Mally

Data

DZone's Featured Data Resources

Top Data Experts

The Latest Data Topics