AI/ML Resources

DZone's Featured AI/ML Resources

Neural Network Representations

By Boluwatife Ben-Adeola

Trained neural networks arrive at solutions that achieve superhuman performance on an increasing number of tasks. It would be at least interesting and probably important to understand these solutions. Interesting, in the spirit of curiosity and getting answers to questions like, “Are there human-understandable algorithms that capture how object-detection nets work?”[a] This would add a new modality of use to our relationship with neural nets from just querying for answers (Oracle Models) or sending on tasks (Agent Models) to acquiring an enriched understanding of our world by studying the interpretable internals of these networks’ solutions (Microscope Models). [1] And important in its use in the pursuit of the kinds of standards that we (should?) demand of increasingly powerful systems, such as operational transparency, and guarantees on behavioral bounds. A common example of an idealized capability we could hope for is “lie detection” by monitoring the model’s internal state. [2] Mechanistic interpretability (mech interp) is a subfield of interpretability research that seeks a granular understanding of these networks. One could describe two categories of mech interp inquiry: Representation interpretability: Understanding what a model sees and how it does; i.e., what information have models found important to look for in their inputs and how is this information represented internally? Algorithmic interpretability: Understanding how this information is used for computation across the model to result in some observed outcome Figure 1: “A Conscious Blackbox," the cover graphic for James C. Scott’s Seeing Like a State (1998) This post is concerned with representation interpretability. Structured as an exposition of neural network representation research [b], it discusses various qualities of model representations which range in epistemic confidence from the obvious to the speculative and the merely desired. Notes: I’ll use "Models/Neural Nets" and "Model Components" interchangeably. A model component can be thought of as a layer or some other conceptually meaningful ensemble of layers in a network. Until properly introduced with a technical definition, I use expressions like “input-properties” and “input-qualities” in place of the more colloquially used “feature.” Now, to some foundational hypotheses about neural network representations. Decomposability The representations of inputs to a model are a composition of encodings of discrete information. That is, when a model looks for different qualities in an input, the representation of the input in some component of the model can be described as a combination of its representations of these qualities. This makes (de)composability a corollary of “encoding discrete information”- the model’s ability to represent a fixed set of different qualities as seen in its inputs. Figure 2: A model layer trained on a task that needs it to care about background colors (trained on only blue and red) and center shapes (only circles and triangles) The component has dedicated a different neuron to the input qualities: "background color is composed of red," "background color is composed of blue," "center object is a circle,” and “center object is a triangle.” Consider the alternative: if a model didn't identify any predictive discrete qualities of inputs in the course of training. To do well on a task, the network would have to work like a lookup table with its keys as the bare input pixels (since it can’t glean any discrete properties more interesting than “the ordered set of input pixels”) pointing to unique identifiers. We have a name for this in practice: memorizing. Therefore, saying, “Model components learn to identify useful discrete qualities of inputs and compose them to get internal representations used for downstream computation,” is not far off from saying “Sometimes, neural nets don’t completely memorize.” Figure 3: An example of how learning discrete input qualities affords generalization or robustness This example test input, not seen in training, has a representation expressed in the learned qualities. While the model might not fully appreciate what “purple” is, it’ll be better off than if it was just trying to do a table lookup for input pixels. Revisiting the hypothesis: "The representations of inputs to a model are a composition of encodings of discrete information." While, as we’ve seen, this verges on the obvious; it provides a template for introducing stricter specifications deserving of study. The first of these specification revisits looks at “…are a composition of encodings…” What is observed, speculated, and hoped for about the nature of these compositions of the encodings? Linearity To recap decomposition, we expect (non-memorizing) neural networks to identify and encode varied information from input qualities/properties. This implies that any activation state is a composition of these encodings. Figure 4: What the decomposability hypothesis suggests What is the nature of this composition? In this context, saying a representation is linear suggests the information of discrete input qualities are encoded as directions in activation space and they are composed into a representation by a vector sum: We’ll investigate both claims. Claim #1: Encoded Qualities Are Directions in Activation Space Composability already suggests that the representation of input in some model components (a vector in activation space) is composed of discrete encodings of input qualities (other vectors in activation space). The additional thing said here is that in a given input-quality encoding, we can think of there being some core essence of the quality which is the vector’s direction. This makes any particular encoding vector just a scaled version of this direction (unit vector.) Figure 5: Various encoding vectors for the red-ness quality in the input They are all just scaled representations of some fundamental red-ness unit vector, which specifies direction. This is simply a generalization of the composability argument that says neural networks can learn to make their encodings of input qualities "intensity"-sensitive by scaling some characteristic unit vector. Alternative Impractical Encoding Regimes Figure 6a An alternative encoding scheme could be that all we can get from models are binary encodings of properties; e.g., “The Red values in this RGB input are Non-zero.” This is clearly not very robust. Figure 6b Another is that we have multiple unique directions for qualities that could be described by mere differences in scale of some more fundamental quality: “One Neuron for "kind-of-red" for 0-127 in the RGB input, another for "really-red" for 128-255 in the RGB input.” We’d run out of directions fairly quickly. Claim #2: These Encodings Are Composed as a Vector Sum Now, this is the stronger of the two claims as it is not necessarily a consequence of anything introduced thus far. Figure 7: An example of 2-property representation Note: We assume independence between properties, ignoring the degenerate case where a size of zero implies the color is not red (nothing). A vector sum might seem like the natural (if not only) thing a network could do to combine these encoding vectors. To appreciate why this claim is worth verifying, it’ll be worth investigating if alternative non-linear functions could also get the job done. Recall that the thing we want is a function that combines these encodings at some component in the model in a way that preserves information for downstream computation. So this is effectively an information compression problem. As discussed in Elhage et al [3a], the following non-linear compression scheme could get the job done: Where we seek to compress values x and y into t. The value of Z is chosen according to the required floating-point precision needed for compressions. Python # A Python Implementation from math import floor def compress_values(x1, x2, precision=1): z = 10 ** precision compressed_val = (floor(z * x1) + x2) / z return round(compressed_val, precision * 2) def recover_values(compressed_val, precision=1): z = 10 ** precision x2_recovered = (compressed_val * z) - floor(compressed_val * z) x1_recovered = compressed_val - (x2_recovered / z) return round(x1_recovered, precision), round(x2_recovered, precision) # Now to compress vectors a and b a = [0.3, 0.6] b = [0.2, 0.8] compressed_a_b = [compress_values(a[0], b[0]), compress_values(a[1], b[1])] # Returned [0.32, 0.68] recovered_a, recovered_b = ( [x, y] for x, y in zip( recover_values(compressed_a_b[0]), recover_values(compressed_a_b[1]) ) ) # Returned ([0.3, 0.6], [0.2, 0.8]) assert all([recovered_a == a, recovered_b == b]) As demonstrated, we’re able to compress and recover vectors a and b just fine, so this is also a viable way of compressing information for later computation using non-linearities like the floor() function that neural networks can approximate. While this seems a little more tedious than just adding vectors, it shows the network does have options. This calls for some evidence and further arguments in support of linearity. Evidence of Linearity The often-cited example of a model component exhibiting strong linearity is the embedding layer in language models [4], where relationships like the following exist between representations of words: This example would hint at the following relationship between the quality of $plurality$ in the input words and the rest of their representation: Okay, so that’s some evidence for one component in a type of neural network having linear representations. The broad outline of arguments for this being prevalent across networks is that linear representations are both the more natural and performant [3b][3a] option for neural networks to settle on. How Important for Interpretability Is It That This Is True? If non-linear compression is prevalent across networks, there are two alternative regimes in which networks could operate: Computation is still mostly done on linear variables: In this regime, while the information is encoded and moved between components non-linearly, the model components would still decompress the representations to run linear computations. From an interpretability standpoint, while this needs some additional work to reverse engineer the decompression operation, this wouldn't pose too high a barrier.Figure 8:Non-linear compression and propagation intervened by linear computation Computation is done in a non-linear state: The model figures out a way to do computations directly on the non-linear representation. This would pose a challenge needing new interpretability methods. However, based on arguments discussed earlier about model architecture affordances this is expected to be unlikely. Figure 9: Direct non-linear computation Features As promised in the introduction, after avoiding the word “feature” this far into the post, we’ll introduce it properly. As a quick aside, I think the engagement of the research community on the topic of defining what we mean when we use the word “feature” is one of the things that makes mech interp, as a pre-paradigmatic science, exciting. While different definitions have been proposed [3c] and the final verdict is by no means out, in this post and others to come on mech interp, I’ll be using the following: "The features of a given neural network constitute a set of all the input qualities the network would dedicate a neuron to if it could." We’ve already discussed the idea of networks necessarily encoding discrete qualities of inputs, so the most interesting part of the definition is, “...would dedicate a neuron to if it could.” What Is Meant by “...Dedicate a Neuron To...”? In a case where all quality-encoding directions are unique one-hot vectors in activation space ([0, 1] and [1, 0], for example) the neurons are said to be basis-aligned; i.e., one neuron’s activation in the network independently represents the intensity of one input quality. Figure 10: Example of a representation with basis-aligned neurons Note that while sufficient, this property is not necessary for lossless compression of encodings with vector addition. The core requirement is that these feature directions be orthogonal. The reason for this is the same as when we explored the non-linear compression method: we want to completely recover each encoded feature downstream. Basis Vectors Following the Linearity hypothesis, we expect the activation vector to be a sum of all the scaled feature directions: Given an activation vector (which is what we can directly observe when our network fires), if we want to know the activation intensity of some feature in the input, all we need is the feature’s unit vector, feature^j_d: (where the character “.” in the following expression is the vector dot product.) If all the feature unit vectors of that network component (making up the set, Features_d) are orthogonal to each other: And, for any vector: These simplify our equation to give an expression for our feature intensity feature^j_i: Allowing us to fully recover our compressed feature: All that was to establish the ideal property of orthogonality between feature directions. This means even though the idea of “one neuron firing by x-much == one feature is present by x-much” is pretty convenient to think about, there are other equally performant feature directions that don’t have their neuron-firing patterns aligning this cleanly with feature patterns. (As an aside, it turns out basis-aligned neurons don’t happen that often. [3d]) Fig 11: Orthogonal feature directions from non-basis-aligned neurons With this context, the request: ”dedicate a neuron to…” might seem arbitrarily specific. Perhaps “dedicate an extra orthogonal direction vector” would be sufficient to accommodate an additional quality. But as you probably already know, orthogonal vectors in a space don’t grow on trees. A 2-dimensional space can only have 2 orthogonal vectors at a time, for example. So to make more room, we might need an extra dimension, i.e [X X] -> [X X X] which is tantamount to having an extra neuron dedicated to this feature. How Are These Features Stored in Neural Networks? To touch grass quickly, what does it mean when a model component has learned 3 orthogonal feature directions {[1 0 0], [0 1 0], [0 0 1]} for compressing an input vector [a b c]? To get the compressed activation vector, we expect a series of dot products with each feature direction to get our feature scale. Now we just have to sum up our scaled-up feature directions to get our “compressed” activation state. In this toy example, the features are just the vector values so lossless decompressing gets us what we started with. The question is: what does this look like in a model? The above sequence of transformations of dot products followed by a sum is equivalent to the operations of the deep learning workhorse: matrix multiplication. The earlier sentence, “…a model component has learned 3 orthogonal feature directions,” should have been a giveaway. Models store their learnings in weights, and so our feature vectors are just the rows of this layer’s learned weight matrix, W. Why didn’t I just say the whole time, “Matrix multiplication. End of section.” Because we don’t always have toy problems in the real world. The learned features aren’t always stored in just one set of weights. It could (and usually does) involve an arbitrarily long sequence of linear and non-linear compositions to arrive at some feature direction (but the key insight of decompositional linearity is that this computation can be summarised by a direction used to compose some activation). The promise of linearity we discussed only has to do with how feature representations are composed. For example, some arbitrary vector is more likely to not be hanging around for discovery by just reading one row of a layer’s weight matrix, but the computation to encode that feature is spread across several weights and model components. So we had to address features as arbitrary strange directions in activation space because they often are. This point brings the proposed dichotomy between representation and algorithmic interpretability into question. Back to our working definition of features: "The features of a given neural network constitute a set of all the input qualities the network would dedicate a neuron to if it could." On the Conditional Clause: “…Would Dedicate a Neuron to if It Could...” You can think of this definition of a feature as a bit of a set-up for an introduction to a hypothesis that addresses its counterfactual: What happens when a neural network cannot provide all its input qualities with dedicated neurons? Superposition Thus far, our model has done fine on the task that required it to compress and propagate 2 learned features — “size” and “red-ness” — through a 2-dimensional layer. What happens when a new task requires the compression and propagation of an additional feature like the x-displacement of the center of the square? Figure 12 This shows our network with a new task, requiring it to propagate one more learned property of the input: center x-displacement. We’ve returned to using neuron-aligned bases for convenience. Before we go further with this toy model, it would be worth thinking through if there are analogs of this in large real-world models. Let’s take the large language model GPT2 small [5]. Do you think, if you had all week, you could think of up to 769 useful features of an arbitrary 700-word query that would help predict the next token (e.g., “is a formal letter," “contains how many verbs," “is about about ‘Chinua Achebe,’” etc.)? Even if we ignored the fact that feature discovery was one of the known superpowers of neural networks [c] and assumed GPT2-small would also end up with only 769 useful input features to encode, we’d have a situation much like our toy problem above. This is because GPT2 has —at the narrowest point in its architecture— only 768 neurons to work with, just like our toy problem has 2 neurons but needs to encode information about 3 features. [d] So this whole “model component encodes more features it has neurons” business should be worth looking into. It probably also needs a shorter name. That name is the Superposition hypothesis. Considering the above thought experiment with GPT2 Small, it would seem this hypothesis is just stating the obvious- that models are somehow able to represent more input qualities (features) than they have dimensions for. What Exactly Is Hypothetical About Superposition? There’s a reason I introduced it this late in the post: it depends on other abstractions that aren't necessarily self-evident. The most important is the prior formulation of features. It assumes linear decomposition- the expression of neural net representations as sums of scaled directions representing discrete qualities of their inputs. These definitions might seem circular, but they’re not if defined sequentially: If you conceive of neural networks as encoding discrete information of inputs called Features as directions in activation space, then when we suspect the model has more of these features than it has neurons, we call this Superposition. A Way Forward As we’ve seen, it would be convenient if the features of a model were aligned with neurons and necessary for them to be orthogonal vectors to allow lossless recovery from compressed representations. So to suggest this isn't happening poses difficulties to interpretation and raises questions on how networks can pull this off anyway. Further development of the hypothesis provides a model for thinking about why and how superposition happens, clearly exposes the phenomenon in toy problems, and develops promising methods for working around barriers to interpretability [6]. More on this in a future post. Footnotes [a] That is, algorithms more descriptive than “Take this Neural Net architecture and fill in its weights with these values, then do a forward pass.” [b] Primarily from ideas introduced in Toy Models of Superposition [c] This refers specifically to the codification of features as their superpower. Humans are pretty good at predicting the next token in human text; we’re just not good at writing programs for extracting and representing this information vector space. All of that is hidden away in the mechanics of our cognition. [d] Technically, the number to compare the 768-dimension residual stream width to is the maximum number of features we think *any* single layer would have to deal with at a time. If we assume equal computational workload between layers and assume each batch of features was built based on computations on the previous, for the 12-layer GPT2 model, this would be 12 * 768 = 9,216 features you’d need to think up. References [1] Chris Olah on Mech Interp - 80000 Hours [2] Interpretability Dreams [3] Toy Models of Superposition [3a] Nonlinear Compression [3b] Features as Directions [3c] What are Features? [3d] Definitions and Motivation [4] Linguistic regularities in continuous space word representations: Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, pp. 746--751. [5] Language Models are Unsupervised Multitask Learners: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever [6] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning More

Evolution of Privacy-Preserving AI: From Protocols to Practical Implementations

By Petr Emelianov

Year by year, artificial intelligence evolves and becomes more efficient for solving everyday human tasks. But at the same time, it increases the possibility of personal information misuse, reaching unprecedented levels of power and speed in analyzing and spreading individuals' data. In this article, I would like to take a closer look at the strong connection between artificial intelligence systems and machine learning and their use of increasingly private and sensitive data. Together, we'll explore existing privacy risks, discuss traditional approaches to privacy in machine learning, and analyze ways to overcome security breaches. Importance of Privacy in AI It is no secret that today, AI is extensively used in many spheres, including marketing. NLP, or Natural Language Processing, interprets human language and is used in voice assistants and chatbots, understanding accents and emotions; it links social media content to engagement. Machine learning employs algorithms to analyze data, improve performance, and enable AI to make decisions without human intervention. Deep Learning relies on neural networks and uses extensive datasets for informed choices. These AI types often collaborate, posing challenges to data privacy. AI collects data intentionally, where users provide information, or unintentionally, for instance, through facial recognition. The problem arises when unintentional data collection leads to unexpected uses, compromising privacy. For example, discussing pet food or more intimate purchases around a phone can lead to targeted ads, revealing unintentional data gathering. AI algorithms, while being intelligent, may inadvertently capture information and subject it to unauthorized use. Thus, video doorbells with facial identification intended for family recognition may unintentionally collect data about unrelated individuals, causing neighbors to worry about surveillance and data access. Bearing in mind the above, it is crucially important to establish a framework for ethical decision-making regarding the use of new AI technologies. Addressing privacy challenges and contemplating the ethics of technology is imperative for the enduring success of AI. One of the main reasons for that is that finding a balance between technological innovation and privacy concerns will foster the development of socially responsible AI, contributing to the long-term creation of public value and private security. Traditional Approach Risks Before we proceed with efficient privacy-preserving techniques, let us take a look at traditional approaches and the problems they may face. Traditional approaches to privacy and machine learning are centered mainly around two concepts: user control and data protection. Users want to know who collects their data, for what purpose, and how long it will be stored. Data protection involves anonymized and encrypted data, but even here, the gaps are inevitable, especially in machine learning, where decryption is often necessary. Another issue is that machine learning involves multiple stakeholders, creating a complex web of trust. Trust is crucial when sharing digital assets, such as training data, inference data, and machine learning models across different entities. Just imagine that there is an entity that owns the training data, while another set of entities may own the inference data. The third entity provides a machine learning server running on the inference, performed by a model owned by someone else. Additionally, it operates on infrastructure from an extensive supply chain involving many parties. Due to this, all the entities must demonstrate trust in each other within a complex chain. Managing this web of trust becomes increasingly difficult. Examples of Security Breaches As we rely more on communication technologies using machine learning, the chance of data breaches and unauthorized access goes up. Hackers might try to take advantage of vulnerabilities in these systems to get hold of personal data, such as name, address, and financial information, which can result in fund losses and identity theft. A report on the malicious use of AI outlines three areas of security concern: expansion of existing threats, new attack methods, and changes in the typical character of threats. Examples of malicious AI use include BEC attacks using deepfake technology, contributing to social engineering tactics. AI-assisted cyber-attacks, demonstrated by IBM's DeepLocker, show how AI can enhance ransomware attacks by making decisions based on trends and patterns. Notably, TaskRabbit experienced an AI-assisted cyber-attack, where an AI-enabled botnet executed a DDoS attack, leading to a data breach which affected 3.75 million customers. Moreover, increased online shopping is fueling card-not-present (CNP) fraud, combined with rising synthetic identity and identity theft issues. Predicted losses from it could reach $200 billion by 2024, with transaction volumes rising over 23%. Privacy-Preserving Machine Learning This is when privacy-preserving machine learning comes in with a solution. Among the most effective techniques are federated learning, homomorphic encryption, and differential privacy. Federated learning allows separate entities to collectively train a model without sharing explicit data. In turn, homomorphic encryption enables machine learning on encrypted data throughout the process and differential privacy ensures that calculation outputs cannot be tied to individual data presence. These techniques, combined with trusted execution environments, can effectively address the challenges at the intersection of privacy and machine learning. Privacy Advantages of Federated Learning As you can see, classical machine learning models lack the efficiency to implement AI systems and IoT practices securely when compared to privacy-preserving machine learning techniques, particularly federated learning. Being a decentralized version of machine learning, FL helps make AI security-preserving techniques more reliable. In traditional methods, sensitive user data is sent to centralized servers for training, posing numerous privacy concerns, and federated learning addresses this by allowing models to be trained locally on devices, ensuring user data security. Enhanced Data Privacy and Security Federated learning, with its collaborative nature, treats each IoT device on the edge as a unique client, training models without transmitting raw data. This ensures that during the federated learning process, each IoT device only gathers the necessary information for its task. By keeping raw data on the device and sending only model updates to the central server, federated learning safeguards private information, minimizes the risk of personal data leakage, and ensures secure operations. Improved Data Accuracy and Diversity Another important issue is that centralized data used to train a model may not accurately represent the full spectrum of data that the model will encounter. In contrast, training models on decentralized data from various sources and exposing them to a broader range of information enhances the model's ability to generalize new data, handle variations, and reduce bias. Higher Adaptability One more advantage federated learning models exhibit is a notable capability to adapt to new situations without requiring retraining, which provides extra security and reliability. Using insights from previous experiences, these models can make predictions and apply knowledge gained in one field to another. For instance, if the model becomes more proficient in predicting outcomes in a specific domain, it can seamlessly apply this knowledge to another field, enhancing efficiency, reducing costs, and expediting processes. Encryption Techniques To enhance privacy in FL, even more efficient encryption techniques are often used. Among them are homomorphic encryption and secure multi-party computation. These methods ensure that data stays encrypted and secure during communication and model aggregation. The homomorphic encryption allows computations on encrypted data without decryption. For example, if a user wants to upload data to a cloud-based server, they can encrypt it, turning it into ciphertext, and only after that upload it. The server would then process that data without decrypting it, and then the user would get it back. After that, the user would decrypt it with their secret key. Multi-party computation, or MPC, enables multiple parties, each with their private data, to evaluate a computation without revealing any of the private data held by each party. A multi-party computation protocol ensures both privacy and accuracy. The private information held by the parties cannot be inferred from the execution of the protocol. If any party within the group decides to share information or deviates from the instructions during the protocol execution, the MPC will not allow it to force the other parties to output an incorrect result or leak any private information. Final Considerations Instead of the conclusion, I would like to stress the importance and urgency of embracing advanced security approaches in ML. For effective and long-term outcomes in AI safety and security, there should be coordinated efforts between the AI development community and legal and policy institutions. Building trust and establishing proactive channels for collaboration in developing norms, ethics, standards, and laws is crucial to avoid reactive and potentially ineffective responses from both the technical and policy sectors. I would also like to quote the authors of the report mentioned above, who propose the following recommendations to face security challenges in AI: Policymakers should collaborate closely with technical researchers to explore, prevent, and mitigate potential malicious applications of AI. AI researchers and engineers should recognize the dual-use nature of their work, considering the potential for misuse and allowing such considerations to influence research priorities and norms. They should also proactively engage with relevant stakeholders when harmful applications are foreseeable. Identify best practices from mature research areas, like computer security, and apply them to address dual-use concerns in AI. Actively work towards broadening the involvement of stakeholders and domain experts in discussions addressing these challenges. Hope this article encourages you to investigate the topic on your own, contributing to a more secure digital world. More

NIST AI Risk Management Framework: Developer’s Handbook

By Josephine Eskaline Joyce

CORE

AI and Rules for Agile Microservices in Minutes

By Val Huber

Exploring Text Generation With Python and GPT-4

By Ashok Gorantla

CORE

Advanced Brain-Computer Interfaces With Java

In the first part of this series, we introduced the basics of brain-computer interfaces (BCIs) and how Java can be employed in developing BCI applications. In this second part, let's delve deeper into advanced concepts and explore a real-world example of a BCI application using NeuroSky's MindWave Mobile headset and their Java SDK. Advanced Concepts in BCI Development Motor Imagery Classification: This involves the mental rehearsal of physical actions without actual execution. Advanced machine learning algorithms like deep learning models can significantly improve classification accuracy. Event-Related Potentials (ERPs): ERPs are specific patterns in brain signals that occur in response to particular events or stimuli. Developing BCI applications that exploit ERPs requires sophisticated signal processing techniques and accurate event detection algorithms. Hybrid BCI Systems: Hybrid BCI systems combine multiple signal acquisition methods or integrate BCIs with other physiological signals (like eye tracking or electromyography). Developing such systems requires expertise in multiple signal acquisition and processing techniques, as well as efficient integration of different modalities. Real-World BCI Example Developing a Java Application With NeuroSky's MindWave Mobile NeuroSky's MindWave Mobile is an EEG headset that measures brainwave signals and provides raw EEG data. The company provides a Java-based SDK called ThinkGear Connector (TGC), enabling developers to create custom applications that can receive and process the brainwave data. Step-by-Step Guide to Developing a Basic BCI Application Using the MindWave Mobile and TGC Establish Connection: Use the TGC's API to connect your Java application with the MindWave Mobile device over Bluetooth. The TGC provides straightforward methods for establishing and managing this connection. Java ThinkGearSocket neuroSocket = new ThinkGearSocket(this); neuroSocket.start(); Acquire Data: Once connected, your application will start receiving raw EEG data from the device. This data includes information about different types of brainwaves (e.g., alpha, beta, gamma), as well as attention and meditation levels. Java public void onRawDataReceived(int rawData) { // Process raw data } Process Data: Use signal processing techniques to filter out noise and extract useful features from the raw data. The TGC provides built-in methods for some basic processing tasks, but you may need to implement additional processing depending on your application's needs. Java public void onEEGPowerReceived(EEGPower eegPower) { // Process EEG power data } Interpret Data: Determine the user's mental state or intent based on the processed data. This could involve setting threshold levels for certain values or using machine learning algorithms to classify the data. For example, a high attention level might be interpreted as the user wanting to move a cursor on the screen. Java public void onAttentionReceived(int attention) { // Interpret attention data } Perform Action: Based on the interpretation of the data, have your application perform a specific action. This could be anything from moving a cursor, controlling a game character, or adjusting the difficulty level of a task. Java if (attention > ATTENTION_THRESHOLD) { // Perform action } Improving BCI Performance With Java Optimize Signal Processing: Enhance the quality of acquired brain signals by implementing advanced signal processing techniques, such as adaptive filtering or blind source separation. Employ Advanced Machine Learning Algorithms: Utilize state-of-the-art machine learning models, such as deep neural networks or ensemble methods, to improve classification accuracy and reduce user training time. Libraries like DeepLearning4j or TensorFlow Java can be employed for this purpose. Personalize BCI Models: Customize BCI models for individual users by incorporating user-specific features or adapting the model parameters during operation. This can be achieved using techniques like transfer learning or online learning. Implement Efficient Real-Time Processing: Ensure that your BCI application can process brain signals and generate output commands in real time. Optimize your code, use parallel processing techniques, and leverage Java's concurrency features to achieve low-latency performance. Evaluate and Validate Your BCI Application: Thoroughly test your BCI application on a diverse group of users and under various conditions to ensure its reliability and usability. Employ standard evaluation metrics and follow best practices for BCI validation. Conclusion Advanced BCI applications require a deep understanding of brain signal acquisition, processing, and classification techniques. Java, with its extensive libraries and robust performance, is an excellent choice for implementing such applications. By exploring advanced concepts, developing real-world examples, and continuously improving BCI performance, developers can contribute significantly to this revolutionary field.

By Arun Pandey

CORE

Code Search Using Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is becoming a popular paradigm for bridging the knowledge gap between pre-trained Large Language models and other data sources. For developer productivity, several code copilots help with code completion. Code Search is an age-old problem that can be rethought in the age of RAG. Imagine you are trying to contribute to a new code base (a GitHub repository) for a beginner task. Knowing which file to change and where to make the change can be time-consuming. We've all been there. You're enthusiastic about contributing to a new GitHub repository but overwhelmed. Which file do you modify? Where do you start? For newcomers, the maze of a new codebase can be truly daunting. Retrieval Augmented Generation for Code Search The technical solution consists of 2 parts. 1. Build a vector index generating embedding for every file (eg. .py .java.) 2. Query the vector index and leverage the code interpreter to provide instructions by calling GPT-x. Building the Vector Index Once you have a local copy of the GitHub repo, akin to a crawler of web search index, Traverse every file matching a regex (*.py, *.sh, *.java) Read the content and generate an embedding. Using OpenAI’s Ada embedding or Sentence BERT embedding (or both.) Build a vector store using annoy. Instead of choosing a single embedding, if we build multiple vector stores based on different embeddings, it improves the quality of retrieval. (anecdotally) However, there is a cost of maintaining multiple indices. 1. Prepare Your Requirements.txt To Install Necessary Python Packages pip install -r requirements.txt Python annoy==1.17.3 langchain==0.0.279 sentence-transformers==2.2.2 openai==0.28.0 open-interpreter==0.1.6 2. Walk Through Every File Python ### Traverse through every file in the directory def get_files(path): files = [] for r, d, f in os.walk(path): for file in f: if ".py" in file or ".sh" in file or ".java" in file: files.append(os.path.join(r, file)) return files 3. Get OpenAI Ada Embeddings Python embeddings = OpenAIEmbeddings(openai_api_key=" <Insert your key>") # we are getting embeddings for the contents of the file def get_file_embeddings(path): try: text = get_file_contents(path) ret = embeddings.embed_query(text) return ret except: return None def get_file_contents(path): with open(path, 'r') as f: return f.read() files = get_files(LOCAL_REPO_GITHUB_PATH) embeddings_dict = {} s = set() for file in files: e = get_file_embeddings(file) if (e is None): print ("Error in generating an embedding for the contents of file: ") print (file) s.add(file) else: embeddings_dict[file] = e 4. Generate the Annoy Index In Annoy, the metric can be "angular," "euclidean," "manhattan," "hamming," or "dot." Python annoy_index_t = AnnoyIndex(1536, 'angular') index_map = {} i = 0 for file in embeddings_dict: annoy_index_t.add_item(i, embeddings_dict[file]) index_map[i] = file i+=1 annoy_index_t.build(len(files)) name = "CodeBase" + "_ada.ann" annoy_index_t.save(name) ### Maintains a forward map of id -> file name with open('index_map' + "CodeBase" + '.txt', 'w') as f: for idx, path in index_map.items(): f.write(f'{idx}\t{path}\n') We can see the size of indices is proportional to the number of files in the local repository. Size of annoy index generated for popular GitHub repositories. Repository File Count (approx as its growing) Size Langchain 1983+ 60 MB Llama Index 779 14 MB Apache Solr 5000+ 328 MB Local GPT 8 165 KB Generate Response With Open Interpreter (Calls GPT-4) Once the index is built, a simple command line python script can be implemented to ask questions right from the terminal about your codebase. We can leverage Open Interpreter. One of the reasons to use Open Interpreter instead of us calling GPT-4 or other LLMs directly is because Open-Interpreter allows us to make changes to your file and run commands. It handles interaction with GPT-4. Python embeddings = OpenAIEmbeddings(openai_api_key="Your OPEN AI KEY") query = sys.argv[1] ### Your question depth = int(sys.argv[2]) ## Number of documents to retrieve from Vector SEarch name = sys.argv[3] ## Name of your index ### Get Top K files based on nearest neighbor search def query_top_files(query, top_n=4): # Load annoy index and index map t = AnnoyIndex(EMBEDDING_DIM, 'angular') t.load(name+'_ada.ann') index_map = load_index_map() # Get embeddings for the query query_embedding = get_embeddings_for_text(query) # Search in the Annoy index indices, distances = t.get_nns_by_vector(query_embedding, top_n, include_distances=True) # Fetch file paths for these indices (forward index helps) files = [(index_map[idx], dist) for idx, dist in zip(indices, distances)] return files ### Use Open Interpreter to make the call to GPT-4 import interpreter results = query_top_files(query, depth) file_content = "" s = set() print ("Files you might want to read:") for path, dist in results: content = get_file_contents(path) file_content += "Path : " file_content += path if (path not in s): print (path) s.add(path) file_content += "\n" file_content += content print( "open interpreter's recommendation") message = "Take a deep breath. I have a task to complete. Please help with the task below and answer my question. Task : READ THE FILE content below and their paths and answer " + query + "\n" + file_content interpreter.chat(message) print ("interpreter's recommendation done. (Risk: LLMs are known to hallucinate)") Anecdotal Results Langchain Question: Where should I make changes to add a new summarization prompt? The recommended files to change are; refine_prompts.py stuff_prompt.py map_reduce_prompt.py entity_summarization.py All of these files are indeed related to the summarization prompt in langchain. Local GPT Question: Which files should I change, and how do I add support to the new model Falcon 80B? Open interpreter identifies the files to be changed and gives specific step-by-step instruction for adding Falcon 80 b model to the list of models in constants.py and adding support in the user interface of localGPT_UI.py. For specific prompt templates, it recommends to modify the method get_prompt_template in prompt_template_utils.py. The complete code can be found here. Conclusion The advantages of a simple RAG solution like this will help with: Accelerated Onboarding: New contributors can quickly get up to speed with the codebase, reducing the onboarding time. Reduced Errors: With specific guidance, newcomers are less likely to make mistakes or introduce bugs. Increased Engagement: A supportive tool can encourage more contributions from the community, especially those hesitant due to unfamiliarity with the codebase. Continuous Learning: Even for experienced developers, the tool can be a means to discover and learn about lesser-known parts of the codebase.

By Raghavan Muthuregunathan

From Batch ML To Real-Time ML

Real-time machine learning refers to the application of machine learning algorithms that continuously learn from incoming data and make predictions or decisions in real-time. Unlike batch machine learning, where data is collected over a period and processed in batches offline, real-time ML operates instantaneously on streaming data, allowing for immediate responses to changes or events. Common use cases include fraud detection in financial transactions, predictive maintenance in manufacturing, recommendation systems in e-commerce, and personalized content delivery in media. Challenges in building real-time ML capabilities include managing high volumes of streaming data efficiently, ensuring low latency for timely responses, maintaining model accuracy and performance over time, and addressing privacy and security concerns associated with real-time data processing. This article delves into these concepts and provides insights into how organizations can overcome these challenges to deploy effective real-time ML systems. Use Cases Now we have explained the difference between batch ML and real-time ML, it's worth mentioning that in real-life use cases, you can have batch ML, real-time ML, or in between batch and real-time. For example, you can have scenarios where you have real-time inference with batch features, real-time inference with real-time features, or real-time inference with batch features and real-time features. Continuous machine learning is beyond the scope of this article, but you can apply real-time feature solutions to continuous machine learning (CML) too. Hybrid approaches that combine real-time and batch-learning aspects offer a flexible solution to address various requirements and constraints in different applications. Here are some expanded examples: Uses Cases Batch Real-Time Fraud detection in banking Initially, a fraud detection model can be trained offline using a large historical dataset of transactions. This batch training allows the model to learn complex patterns of fraudulent behavior over time, leveraging the entirety of available historical data. Once the model is deployed, it continues to learn in real-time as new transactions occur. Each transaction is processed in real-time, and the model is updated periodically (e.g., hourly or daily) using batches of recent transaction data. This real-time updating ensures that the model can quickly adapt to emerging fraud patterns without sacrificing computational efficiency. Recommendation systems in e-commerce A recommendation system may be initially trained offline using a batch of historical user interaction data, such as past purchases, clicks, and ratings. This batch training allows the model to learn user preferences and item similarities effectively. Once the model is deployed, it can be fine-tuned in real-time as users interact with the system. For example, when a user makes a purchase or provides feedback on a product, the model can be updated immediately to adjust future recommendations for that user. This real-time personalization enhances user experience and engagement without requiring retraining the entire model with each interaction. Natural Language Processing (NLP) applications NLP models, such as sentiment analysis or language translation models, can be trained offline using large corpora of text data. Batch training allows the model to learn semantic representations and language structures from diverse text sources. Once deployed, the model can be fine-tuned in real-time using user-generated text data, such as customer reviews or live chat interactions. Real-time fine-tuning enables the model to adapt to domain-specific or user-specific language nuances and evolving trends without requiring retraining from scratch. In each of these examples, the hybrid approach combines the depth of analysis provided by batch learning with the adaptability of real-time learning, resulting in more robust and responsive machine learning systems. The choice between real-time and batch-learning elements depends on the specific requirements of the application, such as data volume, latency constraints, and the need for continuous adaptation. What Are the Main Components of a Real-Time ML Pipeline? A real-time machine learning (ML) pipeline typically consists of several components working together to enable the continuous processing of data and the deployment of ML models with minimal latency. Here are the main components of such a pipeline: 1. Data Ingestion This component is responsible for collecting data from various sources in real-time. It could involve streaming data from sensors, databases, APIs, or other sources. 2. Streaming Data Processing and Feature Engineering Once the data is ingested, it needs to be processed in real-time. This component involves streaming data processing frameworks that handle the data streams efficiently. Features extracted from raw data are crucial for building ML models. This component involves transforming the raw data into meaningful features that can be used by the ML models. Feature engineering might include techniques like normalization, encoding categorical variables, and creating new features. 3. Model Training Training typically occurs at regular intervals, with the frequency varying between near-real-time, which involves more frequent time frames than batch training, or online-real-time training. 4. Model Inference This component involves deploying the ML models and making predictions in real time. The deployed models should be optimized for low latency inference, and they need to scale well to handle varying loads. 5. Scalability and Fault Tolerance Real-time ML pipelines must be scalable to handle large volumes of data and fault-tolerant to withstand failures gracefully. This often involves deploying the pipeline across distributed systems and implementing mechanisms for fault recovery and data replication. Challenges for Building Real-Time ML Pipelines Low Latency Requirement Real-time pipelines must process data and make predictions within strict time constraints, often in milliseconds. Achieving low latency requires optimizing every component of the pipeline, including data ingestion, pre-processing, model inference, and output delivery. Scalability Real-time pipelines must handle varying workloads and scale to accommodate increasing data volumes and computational demands. Designing scalable architectures involves choosing appropriate technologies and distributed computing strategies to ensure efficient resource utilization and horizontal scalability. Feature Engineering Generating features in real time from streaming data can be complex and resource-intensive. Designing efficient feature extraction and transformation pipelines that adapt to changing data distributions and maintain model accuracy over time is a key challenge. Security Robust authentication, authorization, and secure communication mechanisms are essential for real-time ML. Having effective incident response and monitoring capabilities enables organizations to detect and respond to security incidents promptly, bolstering the overall resilience of real-time ML pipelines against security threats. By addressing these security considerations comprehensively, organizations can build secure real-time ML pipelines that protect sensitive data and assets effectively. Cost Optimization Building and operating real-time ML pipelines can be costly, especially when using cloud-based infrastructure or third-party services. Optimizing resource utilization, selecting cost-effective technologies, and implementing auto-scaling and resource provisioning strategies are essential for controlling operational expenses. Robustness and Fault Tolerance Real-time pipelines must be resilient to failures and ensure continuous operation under adverse conditions. Implementing fault tolerance mechanisms, such as data replication, checkpointing, and automatic failover, is critical for maintaining system reliability and availability. Integration with Existing Systems Integrating real-time ML pipelines with existing IT infrastructure, data sources, and downstream applications requires careful planning and coordination. Ensuring compatibility, interoperability, and seamless data flow between different components of the system is essential for successful deployment and adoption. Addressing these challenges requires a combination of domain expertise, software engineering skills, and knowledge of distributed systems, machine learning algorithms, and cloud computing technologies. Opting for solutions that streamline operations by minimizing the number of tools involved can be a game-changer. This approach not only slashes integration efforts but also trims down maintenance costs and operational overheads while ushering in lower latency—a crucial factor in real-time ML applications. By consolidating feature processing and storage into a single, high-speed key-value store, with a real-time ML model serving, Hazelcast simplifies the AI landscape, reducing complexity and ensuring seamless data flow. The Future of Real-Time ML The future of real-time machine learning (ML) is closely intertwined with advancements in vector databases and the emergence of Relative Attribute Graphs (RAG). Vector databases provide efficient storage and querying capabilities for high-dimensional data, making them well-suited for managing the large feature spaces common in ML applications. Relative Attribute Graphs, on the other hand, offer a novel approach to representing and reasoning about complex relationships in data, enabling more sophisticated analysis and decision-making in real-time ML pipelines. In the context of finance and fintech, the integration of vector databases and RAGs holds significant promise for enhancing various aspects of real-time ML applications. One example is in fraud detection and prevention. Financial institutions must constantly monitor transactions and identify suspicious activities to mitigate fraud risk. By leveraging vector databases to store and query high-dimensional transaction data efficiently, combined with RAGs to model intricate relationships between transactions, real-time ML algorithms can detect anomalies and fraudulent patterns in real time with greater accuracy and speed. Another application area is in personalized financial recommendations and portfolio management. Traditional recommendation systems often struggle to capture the nuanced preferences and goals of individual users. However, by leveraging vector representations of user preferences and financial assets stored in vector databases, and utilizing RAGs to model the relative attributes and interdependencies between different investment options, real-time ML algorithms can generate personalized recommendations that better align with users' financial objectives and risk profiles. For example, a real-time ML system could analyze a user's financial history, risk tolerance, and market conditions to dynamically adjust their investment portfolio in response to changing market conditions and personal preferences. Furthermore, in algorithmic trading, real-time ML models powered by vector databases and RAGs can enable more sophisticated trading strategies that adapt to evolving market dynamics and exploit complex interrelationships between different financial instruments. By analyzing historical market data stored in vector databases and incorporating real-time market signals represented as RAGs, algorithmic trading systems can make more informed and timely trading decisions, optimizing trading performance and risk management. Overall, the future of real-time ML in finance and fintech is poised to benefit significantly from advancements in vector databases and RAGs. By leveraging these technologies, organizations can build more intelligent, adaptive, and efficient real-time ML pipelines that enable enhanced fraud detection, personalized financial services, and algorithmic trading strategies.

By Fawaz Ghali, PhD

CORE

Empowering ADHD Research With Generative AI: A Developer's Guide to Synthetic Data Generation

Attention Deficit Hyperactivity Disorder (ADHD) presents a complex challenge in the field of neurodevelopmental disorders, characterized by a wide range of symptoms such as inattention, hyperactivity, and impulsivity that significantly affect individuals' daily lives. In the era of digital healthcare transformation, the role of artificial intelligence (AI), and more specifically Generative AI, has become increasingly pivotal. For developers and researchers in the tech and healthcare sectors, this presents a unique opportunity to leverage the power of AI to foster advancements in understanding, diagnosing, and treating ADHD. From a developer's standpoint, the integration of Generative AI into ADHD research is not just about the end goal of improving patient outcomes but also about navigating the intricate process of designing, training, and implementing AI models that can accurately generate synthetic patient data. This data holds the key to unlocking new insights into ADHD without the ethical and privacy concerns associated with using real patient data. The challenge lies in how to effectively capture the complex, multidimensional nature of ADHD symptoms and treatment responses within these models, ensuring they can serve as a reliable foundation for further research and development. Methodology Generative AI refers to a subset of AI algorithms capable of generating new data instances similar but not identical to the training data. This article proposes utilizing Generative Adversarial Networks (GANs) to generate synthetic patient data, aiding in the research and understanding of ADHD without compromising patient privacy. Data Collection and Preprocessing Data will be synthetically generated to resemble real patient data, including symptoms, genetic information, and response to treatment. Preprocessing steps involve normalizing the data and ensuring it is suitable for training the GAN model. Application and Code Sample Model Training The GAN consists of two main components: the Generator, which generates new data instances, and the Discriminator, which evaluates them against real data. The training process involves teaching the Generator to produce increasingly accurate representations of ADHD patient data. Data Generation/Analysis Generated data can be used to identify patterns in ADHD symptoms and responses to treatment, contributing to more personalized and effective treatment strategies. Python from keras.models import Sequential from keras.layers import Dense import numpy as np # Define the generator def create_generator(): model = Sequential() model.add(Dense(units=100, input_dim=100)) model.add(Dense(units=100, activation='relu')) model.add(Dense(units=50, activation='relu')) model.add(Dense(units=5, activation='tanh')) return model # Example synthetic data generation (simplified) generator = create_generator() noise = np.random.normal(0, 1, [100, 100]) synthetic_data = generator.predict(noise) print("Generated Synthetic Data Shape:", synthetic_data.shape) Results The application of Generative AI in ADHD research could lead to significant advancements in personalized medicine, early diagnosis, and the development of new treatment modalities. However, the accuracy of the generated data and the ethical implications of synthetic data use are important considerations. Discussion This exploration opens up possibilities for using Generative AI to understand complex disorders like ADHD more deeply. Future research could focus on refining the models for greater accuracy and exploring other forms of AI to support healthcare professionals in diagnosis and treatment. Conclusion Generative AI has the potential to revolutionize the approach to ADHD by generating new insights and aiding in the development of more effective treatments. While there are challenges to overcome, the benefits to patient care and research could be substantial.

By venkataramaiah gude

Norm of a One-Dimensional Tensor in Python Libraries

The calculation of the norm of vectors is essential in both artificial intelligence and quantum computing for tasks such as feature scaling, regularization, distance metrics, convergence criteria, representing quantum states, ensuring unitarity of operations, error correction, and designing quantum algorithms and circuits. You will learn how to calculate the Euclidean (norm/distance), also known as the L2 norm, of a single-dimensional (1D) tensor in Python libraries like NumPy, SciPy, Scikit-Learn, TensorFlow, and PyTorch. Understand Norm vs Distance Before we begin, let's understand the difference between Euclidean norm vs Euclidean distance. Norm is the distance/length/size of the vector from the origin (0,0). Distance is the distance/length/size between two vectors. Prerequisites Install Jupyter. Run the code below in a Jupyter Notebook to install the prerequisites. Python # Install the prerequisites for you to run the notebook !pip install numpy !pip install scipy %pip install torch !pip install tensorflow You will use Jupyter Notebook to run the Python code cells to calculate the L2 norm in different Python libraries. Let's Get Started Now that you have Jupyter set up on your machine and installed the required Python libraries, let's get started by defining a 1D tensor using NumPy. NumPy NumPy is a Python library used for scientific computing. NumPy provides a multidimensional array and other derived objects. Tensor ranks Python # Define a single dimensional (1D) tensor import numpy as np vector1 = np.array([3,7]) #np.random.randint(1,5,2) vector2 = np.array([5,2]) #np.random.randint(1,5,2) print("Vector 1:",vector1) print("Vector 2:",vector2) print(f"shape & size of Vector1 & Vector2:", vector1.shape, vector1.size) Print the vectors Plain Text Vector 1: [3 7] Vector 2: [5 2] shape & size of Vector1 & Vector2: (2,) 2 Matplotlib Matplotlib is a Python visualization library for creating static, animated, and interactive visualizations. You will use Matplotlib's quiver to plot the vectors. Python # Draw the vectors using MatplotLib import matplotlib.pyplot as plt %matplotlib inline origin = np.array([0,0]) plt.quiver(*origin, vector1[0],vector1[1], angles='xy', color='r', scale_units='xy', scale=1) plt.quiver(*origin, vector2[0],vector2[1], angles='xy', color='b', scale_units='xy', scale=1) plt.plot([vector1[0],vector2[0]], [vector1[1],vector2[1]], 'go', linestyle="--") plt.title('Vector Representation') plt.xlim([0,10]) plt.ylim([0,10]) plt.grid() plt.show() Vector representation using Matplolib Python # L2 (Euclidean) norm of a vector # NumPy norm1 = np.linalg.norm(vector1, ord=2) print("The magnitude / distance from the origin",norm1) norm2 = np.linalg.norm(vector2, ord=2) print("The magnitude / distance from the origin",norm2) The output once you run this in the Jupyter Notebook: Plain Text The magnitude / distance from the origin 7.615773105863909 The magnitude / distance from the origin 5.385164807134504 SciPy SciPy is built on NumPy and is used for mathematical computations. If you observe, SciPy uses the same linalg functions as NumPy. Python # SciPy import scipy norm_vector1 = scipy.linalg.norm(vector1, ord=2) print("L2 norm in scipy for vector1:", norm_vector1) norm_vector2 = scipy.linalg.norm(vector2, ord=2) print("L2 norm in scipy for vector2:", norm_vector2) Output: Plain Text L2 norm in scipy for vector1: 7.615773105863909 L2 norm in scipy for vector2: 5.385164807134504 Scikit-Learn As the Scikit-learn documentation says: Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. We reshape the vector as Scikit-learn expects the vector to be 2-dimensional. Python # Sklearn from sklearn.metrics.pairwise import euclidean_distances vector1_reshape = vector1.reshape(1,-1) ## Scikit-learn expects the vector to be 2-Dimensional euclidean_distances(vector1_reshape, [[0, 0]])[0,0] Output Plain Text 7.615773105863909 TensorFlow TensorFlow is an end-to-end machine learning platform. Python # TensorFlow import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' import tensorflow as tf print("TensorFlow version:", tf.__version__) ## Tensorflow expects Tensor of types float32, float64, complex64, complex128 vector1_tf = vector1.astype(np.float64) tf_norm = tf.norm(vector1_tf, ord=2) print("Euclidean(l2) norm in TensorFlow:",tf_norm.numpy()) Output The output prints the version of TensorFlow and the L2 norm: Plain Text TensorFlow version: 2.15.0 Euclidean(l2) norm in TensorFlow: 7.615773105863909 PyTorch PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. Python # PyTorch import torch print("PyTorch version:", torch.__version__) norm_torch = torch.linalg.norm(torch.from_numpy(vector1_tf), ord=2) norm_torch.item() The output prints the PyTorch version and the norm: Plain Text PyTorch version: 2.1.2 7.615773105863909 Euclidean Distance Euclidean distance is calculated in the same way as a norm, except that you calculate the difference between the vectors before passing the difference - vector_diff, in this case, to the respective libraries. Python # Euclidean distance between the vectors import math vector_diff = vector1 - vector2 # Using norm euclidean_distance = np.linalg.norm(vector_diff, ord=2) print(euclidean_distance) # Using dot product norm_dot = math.sqrt(np.dot(vector_diff.T,vector_diff)) print(norm_dot) Output Output using the norm and dot functions of NumPy libraries: Plain Text 5.385164807134504 5.385164807134504 Python # SciPy from scipy.spatial import distance distance.euclidean(vector1,vector2) Output Using SciPy 5.385164807134504 The Jupyter Notebook with the outputs is available on the GitHub repository. You can run the Jupyter Notebook on Colab following the instructions on the GitHub repository.

By Vidyasagar (Sarath Chandra) Machupalli

CORE

How To Embed Documents for Semantic Search

In this post, you will take a closer look at embedding documents to be used for a semantic search. By means of examples, you will learn how embedding influences the search result and how you can improve the results. Enjoy! Introduction In a previous post, a chat with documents using LangChain4j and LocalAI was discussed. One of the conclusions was that the document format has a large influence on the results. In this post, you will take a closer look at the influence of source data and the way it is embedded in order to get a better search result. The source documents are two Wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and are mainly in a table format. The same documents were used in the previous post, so it will be interesting to see how the findings from that post compare to the approach used in this post. This blog can be read without reading the previous blogs if you are familiar with the concepts used. If not, it is recommended to read the previous blogs as mentioned in the prerequisites paragraph. The sources used in this blog can be found on GitHub. Prerequisites The prerequisites for this blog are: Basic knowledge of embedding and vector stores Basic Java knowledge: Java 21 is used Basic knowledge of LangChain4j - see the previous blogs: How to Use LangChain4j With LocalAI LangChain4j: Chat With Documents You need LocalAI if you want to run the examples at the end of this blog. See a previous blog on how you can make use of LocalAI. Version 2.2.0 is used for this blog. Embed Whole Document The easiest way to embed a document is to read the document, split it into chunks, and embed the chunks. Embedding means transforming the text into vectors (numbers). The question you will ask also needs to be embedded. The vectors are stored in a vector store which is able to find the results that are the closest to your question and will respond with these results. The source code consists of the following parts: The text needs to be embedded. An embedding model is needed for that; for simplicity, use the AllMiniLmL6V2EmbeddingModel. This model uses the BERT model, which is a popular embedding model. The embeddings need to be stored in an embedding store. Often, a vector database is used for this purpose; but in this case, you can use an in-memory embedding store. Read the two documents and add them to a DocumentSplitter. Here you will define to split the documents into chunks of 500 characters with no overlap. By means of the DocumentSplitter, the documents are split into TextSegments. The embedding model is used to embed the TextSegments. The TextSegments and their embedded counterpart are stored in the embedding store. The question is also embedded with the same model. Ask the embedding store to find relevant embedded segments to the embedded question. You can define how many results the store should retrieve. In this case, only one result is asked for. If a match is found, the following information is printed to the console: The score: A number indicating how well the result corresponds to the question The original text: The text of the segment The metadata: Will show you the document the segment comes from Java private static void askQuestion(String question) { EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel(); EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // Read and split the documents in segments of 500 chunks Document springsteenDiscography = loadDocument(toPath("example-files/Bruce_Springsteen_discography.pdf")); Document springsteenSongList = loadDocument(toPath("example-files/List_of_songs_recorded_by_Bruce_Springsteen.pdf")); ArrayList<Document> documents = new ArrayList<>(); documents.add(springsteenDiscography); documents.add(springsteenSongList); DocumentSplitter documentSplitter = DocumentSplitters.recursive(500, 0); List<TextSegment> documentSegments = documentSplitter.splitAll(documents); // Embed the segments Response<List<Embedding>> embeddings = embeddingModel.embedAll(documentSegments); embeddingStore.addAll(embeddings.content(), documentSegments); // Embed the question and find relevant segments Embedding queryEmbedding = embeddingModel.embed(question).content(); List<EmbeddingMatch<TextSegment>> embeddingMatch = embeddingStore.findRelevant(queryEmbedding,1); System.out.println(embeddingMatch.get(0).score()); System.out.println(embeddingMatch.get(0).embedded().text()); System.out.println(embeddingMatch.get(0).embedded().metadata()); } The questions are the following, and are some facts that can be found in the documents: Java public static void main(String[] args) { askQuestion("on which album was \"adam raised a cain\" originally released?"); askQuestion("what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?"); askQuestion("what is the highest chart position of the album \"tracks\" in canada?"); askQuestion("in which year was \"Highway Patrolman\" released?"); askQuestion("who produced \"all or nothin' at all?\""); } Question 1 The following is the result for question 1: "On which album was 'Adam Raised a Cain' originally released?" Shell 0.6794537224516205 Jim Cretecos 1973 [14] "57 Channels (And Nothin' On)" Bruce Springsteen Human Touch Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan 1992 [15] "7 Rooms of Gloom" (Four Tops cover) Holland–Dozier– Holland † Only the Strong Survive Ron Aniello Bruce Springsteen 2022 [16] "Across the Border" Bruce Springsteen The Ghost of Tom Joad Chuck Plotkin Bruce Springsteen 1995 [17] "Adam Raised a Cain" Bruce Springsteen Darkness on the Edge of Town Jon Landau Bruce Springsteen Steven Van Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=4, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } What do you see here? The score is 0.679…: This means that the segment matches 67.9% of the question. The segment itself contains the specified information at Line 27. The correct segment is chosen - this is great. The metadata shows the document where the segment comes from. You also see how the table is transformed into a text segment: it isn’t a table anymore. In the source document, the information is formatted as follows: Another thing to notice is where the text segment is split. So, if you had asked who produced this song, it would be an incomplete answer, because this row is split in column 4. Question 2 The following is the result for question 2: "What is the highest chart position of 'Greetings from Asbury Park, NJ' in the US?" Shell 0.6892728817378977 29. Greetings from Asbury Park, N.J. (LP liner notes). Bruce Springsteen. US: Columbia Records. 1973. KC 31903. 30. Nebraska (LP liner notes). Bruce Springsteen. US: Columbia Records. 1982. TC 38358. 31. Chapter and Verse (CD booklet). Bruce Springsteen. US: Columbia Records. 2016. 88985 35820 2. 32. Born to Run (LP liner notes). Bruce Springsteen. US: Columbia Records. 1975. PC 33795. 33. Tracks (CD box set liner notes). Bruce Springsteen. Europe: Columbia Records. 1998. COL 492605 2 2. Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=100, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } The information is found in the correct document, but the wrong text segment is found. This segment comes from the References section and you needed the information from the Songs table, just like for question 1. Question 3 The following is the result for question 3: "What is the highest chart position of the album 'Tracks' in Canada?" Shell 0.807258199400863 56. @billboardcharts (November 29, 2021). "Debuts on this week's #Billboard200 (1/2)..." (https://twitter.com/bil lboardcharts/status/1465346016702566400) (Tweet). Retrieved November 30, 2021 – via Twitter. 57. "ARIA Top 50 Albums Chart" (https://www.aria.com.au/charts/albums-chart/2021-11-29). Australian Recording Industry Association. November 29, 2021. Retrieved November 26, 2021. 58. "Billboard Canadian Albums" (https://www.fyimusicnews.ca/fyi-charts/billboard-canadian-albums). Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=142, file_name=Bruce_Springsteen_discography.pdf, document_type=PDF} } The information is found in the correct document, but also here, the segment comes from the References section, while the answer to the question can be found in the Compilation albums table. This can explain some of the wrong answers that were given in the previous post. Question 4 The following is the result for question 4: "In which year was 'Highway Patrolman' released?" Shell 0.6867325432140559 "Highway 29" Bruce Springsteen The Ghost of Tom Joad Chuck Plotkin Bruce Springsteen 1995 [17] "Highway Patrolman" Bruce Springsteen Nebraska Bruce Springsteen 1982 [30] "Hitch Hikin' " Bruce Springsteen Western Stars Ron Aniello Bruce Springsteen 2019 [53] "The Hitter" Bruce Springsteen Devils & Dust Brendan O'Brien Chuck Plotkin Bruce Springsteen 2005 [24] "The Honeymooners" Bruce Springsteen Tracks Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt 1998 [33] [76] "House of a Thousand Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=31, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } The information is found in the correct document and the correct segment is found. However, it is difficult to retrieve the correct answer because of the formatting of the text segment, and you do not have any context about what the information represents. The column headers are gone, so how should you know that 1982 is the answer to the question? Question 5 The following is the result for question 5: "Who produced 'All or Nothin’ at All'?" Shell 0.7036564758755796 Zandt (assistant) 1978 [18] "Addicted to Romance" Bruce Springsteen She Came to Me (soundtrack) Bryce Dessner 2023 [19] [20] "Ain't Good Enough for You" Bruce Springsteen The Promise Jon Landau Bruce Springsteen 2010 [21] [22] "Ain't Got You" Bruce Springsteen Tunnel of Love Jon Landau Chuck Plotkin Bruce Springsteen 1987 [23] "All I'm Thinkin' About" Bruce Springsteen Devils & Dust Brendan O'Brien Chuck Plotkin Bruce Springsteen 2005 [24] "All or Nothin' at All" Bruce Springsteen Human Touch Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=5, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} } The information is found in the correct document, but again, the segment is split in the row where the answer can be found. This can explain the incomplete answers that were given in the previous post. Conclusion Two answers are correct, one is partially correct, and two are wrong. Embed Markdown Document What would change when you convert the PDF documents into Markdown files? Tables are probably better to recognize in Markdown files than in PDF documents, and they allow you to segment the document at the row level instead of some arbitrary chunk size. Only the parts of the documents that contain the answers to the questions are converted; this means the Studio albums and Compilation albums from the discography and the List of songs recorded. The segmenting is done as follows: Split the document line per line. Retrieve the data of the table in the variable dataOnly. Save the header of the table in the variable header. Create a TextSegment for every row in dataOnly and add the header to the segment. The source code is as follows: Java List<Document> documents = loadDocuments(toPath("markdown-files")); List<TextSegment> segments = new ArrayList<>(); for (Document document : documents) { String[] splittedDocument = document.text().split("\n"); String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length); String header = splittedDocument[0] + "\n" + splittedDocument[1] + "\n"; for (String splittedLine : dataOnly) { segments.add(TextSegment.from(header + splittedLine, document.metadata())); } } Question 1 The following is the result for question 1: "On which album was 'Adam Raised a Cain' originally released?" Shell 0.6196628642947255 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| |The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} } The answer is incorrect. Question 2 The following is the result for question 2: "What is the highest chart position of 'Greetings from Asbury Park, NJ' in the US?" Shell 0.8229951885990189 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| | Greetings from Asbury Park,N.J. |60|71|—|—|—|—|—|—|35|41| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_studio_albums.md, document_type=UNKNOWN} } The answer is correct, and the answer can easily be retrieved, as you have the header information for every column. Question 3 The following is the result for question 3: "What is the highest chart position of the album 'Tracks' in Canada?" Shell 0.7646818618182345 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| |Tracks|27|97|—|63|—|36|—|4|11|50| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} } The answer is correct. Question 4 The following is the result for question 4: "In which year was 'Highway Patrolman' released?" Shell 0.6108392657222184 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } The answer is incorrect. The correct document is found, but the wrong segment is chosen. Question 5 The following is the result for question 5: "Who produced 'All or Nothin’ at All'?" Shell 0.6724577751120745 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| | "All or Nothin' at All" | Bruce Springsteen | Human Touch | Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan |1992 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } The answer is correct and complete this time. Conclusion Three answers are correct and complete. Two answers are incorrect. Note that the incorrect answers are for different questions as before. However, the result is slightly better than with the PDF files. Alternative Questions Let’s build upon this a bit further. You are not using a Large Language Model (LLM) here, which will help you with textual differences between the questions you ask and the interpretation of results. Maybe it helps when you change the question in order to use terminology that is closer to the data in the documents. The source code can be found here. Question 1 Let’s change question 1 from "On which album was 'Adam Raised a Cain' originally released?" to "What is the original release of 'Adam Raised a Cain'?". The column in the table is named original release, so that might make a difference. The result is the following: Shell 0.6370094541277747 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| | "Adam Raised a Cain" | Bruce Springsteen | Darkness on the Edge of Town | Jon Landau Bruce Springsteen Steven Van Zandt (assistant) | 1978| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } The answer is correct this time and the score is slightly higher. Question 4: Attempt #1 Question 4 is, "In which year was 'Highway Patrolman' released?" Remember that you only asked for the first relevant result. However, more relevant results can be displayed. Set the maximum number of results to 5. Java List<EmbeddingMatch<TextSegment>> relevantMatches = embeddingStore.findRelevant(queryEmbedding,5); The result is: Shell 0.6108392657222184 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6076896858171996 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Turn! Turn! Turn!" (with Roger McGuinn) | Pete Seeger † | Magic Tour Highlights (EP) | John Cooper | 2008| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6029946650419344 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6001672430441461 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Downbound Train" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.5982557901838741 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } As you can see, Highway Patrolman is a result, but only the fifth result. That is a bit strange, though. Question 4: Attempt #2 Let’s change question 4 from, "In which year was 'Highway Patrolman' released?" to, "In which year was the song 'Highway Patrolman' released?" So, you add "the song" to the question. The result is: Shell 0.6506125707025556 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.641000538311824 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Raise Your Hand" (live) (Eddie Floyd cover) | Steve Cropper Eddie Floyd Alvertis Isbell † | Live 1975–85 | Jon Landau Chuck Plotkin Bruce Springsteen |1986 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6402738046796352 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6362427185719677 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.635837703599965 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Wreck on the Highway"| Bruce Springsteen |The River | Jon Landau Bruce Springsteen Steven Van Zandt |1980 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } Now Highway Patrolman is the fourth result. It is getting better. Question 4: Attempt #3 Let’s add the words "of the album Nebraska" to question 4. The question becomes, "In which year was the song 'Highway Patrolman' of the album Nebraska released?" The result is: Shell 0.6468954949440158 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6444919056791143 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6376680100362238 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } 0.6367565537138745 | Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK |-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---| |The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15| Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} } 0.6364950606665447 | song | writer(s) | original release | Producer(s) |year| |-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-| |"Raise Your Hand" (live) (Eddie Floyd cover) | Steve Cropper Eddie Floyd Alvertis Isbell † | Live 1975–85 | Jon Landau Chuck Plotkin Bruce Springsteen |1986 | Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} } Again, an improvement: Highway Patrolman is now listed as the third result. Still, it is strange why it is not listed as the first result. However, by adding more information, it ranks higher in the result list. This is as expected. Conclusion Changing the question with terminology that is closer to the source data helps in order to get a better result. Adding more context to the question also helps. Displaying more results gives you more insight and lets you determine the correct answer from the result list. Markdown Embedding Combined With LLM The conclusions up until now are: The format of the documents and the way the documents are segmented and embedded have a significant influence on the results. If the question uses terminology that is close to the data in the documents, better results are achieved. In the previous blog, the PDF files were used as source data. This is now solved by using Markdown and better segmenting and embedding. The second bullet above can probably be solved by using an LLM. Let’s see what happens when you combine the embedding in this post with an LLM from the previous post. The full source code can be found on GitHub. The questions and results are: Q: On which album was “Adam Raised a Cain” originally released?A: The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen album in 1995. This answer is wrong but corresponds to the first embedded result earlier. Q: What is the highest chart position of “Greetings from Asbury Park, NJ” in the US?A: The highest chart position of “Greetings from Asbury Park, NJ” in the US is 60. This answer is correct. Q: What is the highest chart position of the album “Tracks” in Canada?A: The album “Tracks” by Metallica has not reached the highest chart position in Canada. According to the provided information, the highest chart position it reached in Canada was #27. This answer is wrong. The first embedded result earlier was correct, so it is strange that the wrong answer is given here. Q: In which year was “Highway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The answer is wrong, but it can be explained. The first embedded result earlier showed the result for Working on the Highway, and this song was released in 1984. Q: In which year was “Higway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The question here included a typo. The result is similar to the previous one, so this is consistent. Q: Who produced “All or Nothin’ at All”?A: The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on Springsteen’s 1992 album “Human Touch.” This answer is correct and complete. With the PDF as source data, the answer was always incomplete. The results are worse than using the PDF files as source documents. However, when a correct answer is given, it is complete. The answers are also consistent; a minor typo does not change the answer like it did before. Markdown Embedding Combined With System Messages In the previous blog, the final solution also included adding system messages to the chat memory. Let’s see what happens when you combine this with Markdown source documents. The full source code can be found on GitHub. Below is a small extract of the system messages. Java ArrayList<ChatMessage> chatMessages = new ArrayList<>(); chatMessages.add(new SystemMessage("Use the provided documents to answer the questions.")); chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\".")); The questions and results are: Q: On which album was “Adam Raised a Cain” originally released?A: The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen in 1995. This answer is wrong and similar to the previous paragraph. Q: What is the highest chart position of “Greetings from Asbury Park, NJ” in the US?A: Based on the provided information, the highest chart position of “Greetings from Asbury Park, N.J.” in the US is 60. This answer is correct and similar to the previous paragraph. Q: What is the highest chart position of the album “Tracks” in Canada?A: Based on the information provided, the highest chart position of the album “Tracks” in Canada is 64. This answer is still wrong. However, Metallica is not mentioned anymore, so it is a bit less wrong than in the previous paragraph. Q: In which year was “Highway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The answer is wrong and identical to the previous paragraph. Q: In which year was “Higway Patrolman” released?A: The song “Highway Patrolman” was released in 1984. The answer is wrong and identical to the previous paragraph. Q: Who produced “All or Nothin’ at All”?A: The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on the album “Human Touch” in 1992. This answer is correct, complete, and similar to the previous paragraph. Adding system messages did not have any influence on the results. Overall Conclusion What did you learn from all of this? The way documents are read and embedded seems to have the largest influence on the result. An advantage of this approach is that you are able to display a number of results. This allows you to determine which result is the correct one. Changing your question in order to use the terminology used in the text segments helps to get a better result. Querying a vector store is very fast. Embedding costs some time, but you only need to do this once. Using an LLM takes a lot more time to retrieve a result when you do not use a GPU. An interesting resource to read is Deconstructing RAG, a blog from LangChain. When improvements are made in this area, better results will be the consequence.

By Gunter Rotsaert

CORE

Data Streaming for AI in the Financial Services Industry (Part 1)

In the ever-evolving landscape of the Financial Services Industry (FSI), organizations face a multitude of challenges that hinder their journey toward AI-driven transformation. Legacy systems, stringent regulations, data silos, and a lack of agility have created a chaotic environment in need of a more effective way to use and share data across the organization. In this two-part series, I delve into my own personal observations, break open the prevailing issues within the FSI, and closely inspect the factors holding back progress. I’ll also highlight the need to integrate legacy systems, navigate strict regulatory landscapes, and break down data silos that impede agility and hinder data-driven decision-making. Last but not least, I’ll introduce a proven data strategy approach in which adopting data streaming technologies will help organizations overhaul their data pipelines—enabling real-time data ingestion, efficient processing, and seamless integration of disparate systems. If you're interested in a practical use case that showcases everything in action, keep an eye out for our upcoming report. But before diving into the solution, let’s start by understanding the problem. Complexity, Chaos, and Data Dilemmas Stepping into larger financial institutions often reveals a fascinating insight: a two-decade technology progression unfolding before my eyes. The core business continues to rely on mainframe systems running COBOL, while a secondary layer of services acts as the gateway to access the core and extension of services offerings that can’t be done in the core system. Data is heavily batched and undergoes nightly ETL processes to facilitate transfer between these layers. Real-time data access poses challenges, demanding multiple attempts and queries through the gateway for even a simple status update. Data warehouses are established, serving as data dumping grounds through ETL, where nearly half of the data remains unused. Business Intelligence (BI) tools extract, transform, and analyze the data to provide valuable insights for business decisions and product design. Batch and distributed processing prevail due to the sheer volume of data to be handled, resulting in data silos and delayed reflection of changing trends. In recent years, more agile approaches have emerged, with a data shift towards binary, key-value formats for better scalability on the cloud. However, due to the architectural complexity, data transfers between services have multiplied, leading to challenges in maintaining data integrity. Plus, these innovations primarily cater to new projects, leaving developers and internal users to navigate through multiple hoops within the system to accomplish tasks. Companies also find themselves paying the price for slow innovation and encounter high costs when implementing new changes. This is particularly true when it comes to AI-driven initiatives that demand a significant amount of data and swift action. Consequently, several challenges bubble to the surface and get in the way of progress, making it increasingly difficult for FSIs to adapt and prepare for the future. Here’s a breakdown of these challenges and the ideal state for FSIs. Reality Ideal state Data silos Decentralized nature of financial operations or team’s geographical location. Separate departments or business units maintain their own data and systems that were implemented over the years, resulting in isolated data and making it difficult to collaborate. There were already several attempts to break the silos, and the solutions somehow contributed to one of the many problems below (i.e., data pipeline chaos). A consolidated view of data across the organization. Ability to quickly view and pull data when needed. Legacy systems FSIs often grapple with legacy systems that have been in place for many years. These systems usually lack the agility to adapt to changes quickly. As a result, accessing and actioning data from these legacy systems can be time-consuming, leading to delays and sometimes making it downright impossible to make good use of the latest data. Data synchronization with the old systems, and modernized ETL pipelines. Migrate and retire from the old process. Data overload With vast amounts of data from various sources, including transactions, customer interactions, market data, and more, it can be overwhelming, making it challenging to extract valuable insights and derive actionable intelligence. It often leads to high storage bills and data is not fully used most of the time. Infrastructural change to adopt larger ingestion of data, planned data storage strategy, and a more cost-effective way to safely secure and store data with sufficient failover and recovery plan. Data pipeline chaos Managing data pipelines within FSIs can be a complex endeavor. With numerous data sources, formats, and integration points, the data pipeline can become fragmented and chaotic. Inconsistent data formats, incompatible systems, and manual processes can introduce errors and inefficiencies, making it challenging to ensure smooth data flow and maintain data quality. A data catalog is a centralized repository that serves as a comprehensive inventory and metadata management system for an organization's data assets.Reduced redundancy, improved efficiency, streamlined data flow, and introduce automation, monitoring and regular inspection. Open data initiatives With the increasing need for partner collaboration and government open API projects, the FSI faces the challenge of adapting its data practices. The demand to share data securely and seamlessly with external partners and government entities is growing. FSIs must establish frameworks and processes to facilitate data exchange while ensuring privacy, security, and compliance with regulations. Secure and well-defined APIs for data access that ensure data interoperability through common standards. Plus, version and access control over access points. Clearly, there’s a lot stacked up against FSIs attempting to leap into the world of AI. Now, let’s zoom in on the different data pipelines organizations are using to move their data from point A to B and the challenges many teams are facing with them. Understanding Batch, Micro-Batch, and Real-Time Data Pipelines There are all sorts of ways that move data around. To keep things simple, I’ll distill the most common pipelines today into three categories: Batch Micro-batch Real-time 1. Batch Pipelines These are typically used when processing large volumes of data in scheduled “chunks” at a time—often in overnight processing, periodic data updates, or batch reporting. Batch pipelines are well-suited for scenarios where immediate data processing isn't crucial, and the output is usually a report, like for investment profiles and insurance claims. The main setbacks include processing delays, potentially outdated results, scalability complexities, managing task sequences, resource allocation issues, and limitations in providing real-time data insights. I’ve witnessed an insurance customer running out of windows at night to run batches due to the sheer volume of data that needed processing (updating premiums, investment details, documents, agents’ commissions, etc.). Parallel processing or map-reduce are a few techniques to shorten the time, but they also introduce complexities, as parallel both require the developer to understand the distribution of data, dependency of data, and be able to maneuver between map and reduce functions. 2. Micro-Batch Pipelines Micro batch pipelines are a variation of batch pipelines where data is processed in smaller, more frequent batches at regular intervals for lower latency and fresher results. They’re commonly used for financial trading insights, clickstream analysis, recommendation systems, underwriting, and customer churn predictions. Challenges with micro-batch pipelines include managing the trade-off between processing frequency and resource usage, handling potential data inconsistencies across micro-batches, and addressing the overhead of initiating frequent processing jobs while still maintaining efficiency and reliability. 3. Real-Time Pipelines These pipelines process data as soon as it flows in. They offer minimal latency and are essential for applications requiring instant reactions, such as real-time analytics, transaction fraud detection, monitoring critical systems, interactive user experiences, continuous model training, and real-time predictions. However, real-time pipelines face challenges like handling high throughputs, maintaining consistently low latency, ensuring data correctness and consistency, managing resource scalability to accommodate varying workloads, and dealing with potential data integration complexities—all of which require robust architectural designs and careful implementation to deliver accurate and timely results. To summarize, here’s the important information about all three pipelines in one table. Batch Micro batch Real time Cadence Scheduled longer intervals Scheduled short intervals Real time Data size Large Small defined chunks Large Scaling Vertical Horizontal Horizontal Latency High (hours/days) Medium (seconds) Low (milliseconds) Datastore Data warehouse, Data lake, Databases, Files Distributed files system, Data warehouses, Databases Stream processing Systems, Data lake, Databases Open-source technologies Apache Hadoop, Map-reduce Apache Spark™ Apache Kafka®, Apache Flink® Industry use case examples Moving files (customer signature scans), and transferring data from the mainframe for core banking data or core insurance policy information. Large datasets for ML. Prepare near real-time business reports and needs to consume data from large dataset lookups such as generating risk management reviews for investment. Daily market trend analysis. Real-time transaction/fraud detection, instant claim approval, monitoring critical systems, and customer chatbot service. As a side note, some may categorize pipelines as either ETL or ELT. ETL (Extract, Transform, and Load) transforms data on a separate processing server before moving it to the data warehouse. ELT (Extract, Load, and Transform) transforms the data within the data warehouse first before it hits its destination. Depending on the destination of the data, if it’s going to a data lake, you’ll see most pipelines doing ELT. Whereas with a data source, like a data warehouse or database, since it requires data to be stored in a more structured manner, you will see more ETL. In my opinion, all three pipelines should be using both techniques to convert data into the desired state. Common Challenges of Working With Data Pipelines Pipelines are scattered across departments, and IT teams implement them with various technologies and platforms. From my own experience working with on-site data engineers, here are some common challenges working with data pipelines: Difficulty Accessing Data Unstructured data can be tricky. The lack of metadata makes it difficult to locate the desired data within the repository (like customer correspondence, emails, chat logs, legal documents.) Certain data analytics tools or platforms may have strict requirements regarding the input data format, posing difficulties in converting the data to the required format. So, multiple complex pipelines transform logic (and lots of it). Stringent security measures and regulatory compliance can introduce additional steps and complexities in gaining access to the necessary data. (Personal identifiable data, health record for claims). Noisy, “Dirty” Data Data lakes are prone to issues like duplicated data. Persistence of decayed or outdated data within the system can compromise the accuracy and reliability of AI models and insights. Input errors during data entry were not caught and filtered. (biggest data processing troubleshooting time wasted) Data mismatches between different datasets and inconsistencies in data. (Incorrect report and pipeline errors) Performance Large volumes of data, lack of efficient storage and processing power. Methods of retrieving data, such as APIs in which the request and response aren’t ideal for large volumes of data ingestion. The location of relevant data within the system and where they’re stored heavily impacts the frequency of when to process data, plus the latency and cost of retrieving it. Data Visibility (Data Governance and Metadata) Inadequate metadata results in a lack of clarity regarding the availability, ownership, and usage of data assets. Difficult to determine the existence and availability of specific data, impeding effective data usage and analysis. Troubleshooting Identifying inconsistencies, addressing data quality problems, or troubleshooting data processing failures can be time-consuming and complex. During the process of redesigning the data framework for AI, both predictive and generative, I’ll address the primary pain points for data engineers and also help solve some of the biggest challenges plaguing the FSI today. Taking FSIs From Point A to AI Looking through a data lens, the AI-driven world can be dissected into two primary categories: inference and machine learning. These domains differ in their data requirements and usage. Machine learning needs comprehensive datasets derived from historical, operational, and real-time sources, enabling training more accurate models. Incorporating real-time data into the dataset enhances the model and facilitates agile and intelligent systems. Inference prioritizes real-time focus, leveraging ML-generated models to respond to incoming events, queries, and requests. Building a generative AI model is a major undertaking. For FSI, it makes sense to reuse an existing model (foundation model) with some fine-tuning in specific areas to fit your use case. The “fine-tuning” will require you to provide a high-quality, high-volume dataset. The old saying still holds true: garbage in, garbage out. If the data isn’t reliable, to begin with, you’ll inevitably end up with unreliable AI. In my opinion, to prepare for the best AI outcome possible, it’s crucial to set up the following foundations: Data infrastructure: You need a robust, low latency, high throughput framework to transfer and store vast volumes of financial data for efficient data ingestion, storage, processing, and retrieval. It should support distributed and cloud computing and prioritize network latency, storage costs, and data safety. Data quality: To provide better data for determining the model, it’s best to go through data cleansing, normalization, de-duplication, and validation processes to remove inconsistencies, errors, and redundancies. Now, if I were to say that there’s a simple solution, I would either be an exceptional genius capable of solving world crises or blatantly lying. However, given the complexity we already have, it’s best to focus on generating the datasets required for ML and streamline the data needed for the inference phase to make decisions. Then, you can gradually address the issues caused by the current data being overly disorganized. Taking one domain at a time, solving business users’ problems first, and not being overly ambitious is the fastest path to success. But we’ll leave that for the next post. Summary Implementing a data strategy in the financial services industry can be intricate due to factors such as legacy systems and the consolidation of other businesses. Introducing AI into this mix can pose performance challenges, and some businesses might struggle to prepare data for machine learning applications. In my next post, I’ll walk you through a proven data strategy approach to streamline your troublesome data pipelines for real-time data ingestion, efficient processing, and seamless integration of disparate systems.

By Christina Lin

CORE

Revolutionizing Real-Time Alerts With AI, NATS, and Streamlit

Imagine you have an AI-powered personal alerting chat assistant that interacts using up-to-date data. Whether it’s a big move in the stock market that affects your investments, any significant change on your shared SharePoint documents, or discounts on Amazon you were waiting for, the application is designed to keep you informed and alert you about any significant changes based on the criteria you set in advance using your natural language. In this post, we will learn how to build a full-stack event-driven weather alert chat application in Python using pretty cool tools: Streamlit, NATS, and OpenAI. The app can collect real-time weather information, understand your criteria for alerts using AI, and deliver these alerts to the user interface. This piece of content and code samples can be incredibly helpful for those who love technology or those who are developers to understand how modern real-time alerting systems work with Larger Language Models (LLMs) and how to implement one. You can also quickly jump on the source code hosted on our GitHub and try it yourself. The Power Behind the Scenes Let’s take a closer look at how the AI weather alert chat application works and transforms raw data into actionable alerts, keeping you one step ahead of the weather. At the core of our application lies a responsive backend implemented in Python, powered by NATS to ensure real-time data processing and message management. Integrating OpenAI’s GPT model brings a conversational AI to life, capable of understanding alerts’ nature and responding to user queries. Users can specify their alert criteria in natural language, then the GPT model will interpret them. Image 1: Real-time alert app architecture Real-Time Data Collection The journey begins with the continuous asynchronous collection of weather data from various sources in the backend. Our application now uses the api.weatherapi.com service, fetching real-time weather information every 10 seconds. This data includes temperature, humidity, precipitation, and more, covering locations worldwide. This snippet asynchronously fetches current weather data for Estonia but the app can be improved to set the location from user input dynamically: async def fetch_weather_data(): api_url = f"http://api.weatherapi.com/v1/current.json?key={weather_api_key}&q=estonia" try: async with aiohttp.ClientSession() as session: async with session.get(api_url) as response: if response.status == 200: return await response.json() else: logging.error(f"Error fetching weather data: HTTP {response.status}") return None except Exception as e: logging.error(f"Error fetching weather data: {e}") return None The Role of NATS in Data Streaming The code segment in the main() function in the backend.py file demonstrates the integration of NATS for even-driven messaging, continuous weather monitoring, and alerting. We use the nats.py library to integrate NATS within Python code. First, we establish a connection to the NATs server running in Docker at nats://localhost:4222. nats_client = await nats.connect("nats://localhost:4222") Then, we define an asynchronous message_handler function that subscribes and processes messages received on the chat subject from the NATs server. If a message starts with "Set Alert:" (we append it on the frontend side), it extracts and updates the user's alert criteria. async def message_handler(msg): nonlocal user_alert_criteria data = msg.data.decode() if data.startswith("Set Alert:"): user_alert_criteria = data[len("Set Alert:"):].strip() logging.info(f"User alert criteria updated: {user_alert_criteria}") await nats_client.subscribe("chat", cb=message_handler) The backend service integrates with both external services like Weather API and Open AI Chat Completion API. If both weather data and user alert criteria are present, the app constructs a prompt for OpenAI’s GPT model to determine if the weather meets the user’s criteria. The prompt asks the AI to analyze the current weather against the user’s criteria and respond with “YES” or “NO” and a brief weather summary. Once the AI determines that the incoming weather data matches a user’s alert criteria, it crafts a personalized alert message and publishes a weather alert to the chat_response subject on the NATS server to update the frontend app with the latest changes. This message contains user-friendly notifications designed to inform and advise the user. For example, it might say, "Heads up! Rain is expected in Estonia tomorrow. Don't forget to bring an umbrella!" while True: current_weather = await fetch_weather_data() if current_weather and user_alert_criteria: logging.info(f"Current weather data: {current_weather}") prompt = f"Use the current weather: {current_weather} information and user alert criteria: {user_alert_criteria}. Identify if the weather meets these criteria and return only YES or NO with a short weather temperature info without explaining why." response_text = await get_openai_response(prompt) if response_text and "YES" in response_text: logging.info("Weather conditions met user criteria.") ai_response = f"Weather alert! Your specified conditions have been met. {response_text}" await nats_client.publish("chat_response", payload=ai_response.encode()) else: logging.info("Weather conditions did not meet user criteria.") else: logging.info("No current weather data or user alert criteria set.")await asyncio.sleep(10) Delivering and Receiving Alerts in Real-Time Let’s understand the overall communication flow between the backend and frontend. Through a simple chat interface built using Streamlit (see frontend.py file), the user inputs their weather alert criteria using natural language and submits it. alert_criteria = st.text_input("Set your weather alert criteria", key="alert_criteria", disabled=st.session_state['alert_set']) Below, Streamlit frontend code interacts with a backend service via NATS messaging. It publishes these criteria to the NATS server on the chat subject. def send_message_to_nats_handler(message): with NATSClient() as client: client.connect() client.publish("chat", payload=message.encode()) client.subscribe("chat_response", callback=read_message_from_nats_handler) client.wait() if set_alert_btn: st.session_state['alert_set'] = True st.success('Alert criteria set') send_message_to_nats_handler(f"Set Alert: {alert_criteria}") As we have seen in the previous section, the backend listens to the chat subject, receives the criteria, fetches current weather data, and uses AI to determine if an alert should be triggered. If conditions are met, the backend sends an alert message to the chat_response subject. The front end receives this message and updates the UI to notify the user. def read_message_from_nats_handler(msg): message = msg.payload.decode() st.session_state['conversation'].append(("AI", message)) st.markdown(f"<span style='color: red;'></span> AI: {message}", unsafe_allow_html=True) Try It Out To explore the real-time weather alert chat application in detail and try it out for yourself, please visit our GitHub repository linked earlier. The repository contains all the necessary code, detailed setup instructions, and additional documentation to help you get started. Once the setup is complete, you can start the Streamlit frontend and the Python backend. Set your weather alert criteria, and see how the system processes real-time weather data to keep you informed. Image 2: Streamlit UI for the alert app Building Stream Processing Pipelines Real-time weather alert chat application demonstrated a powerful use case of NATS for real-time messaging in a distributed system, allowing for efficient communication between a user-facing frontend and a data-processing backend. However, you should consider several key steps to ensure that the information presented to the user is relevant, accurate, and actionable. In the app, we are just fetching live raw weather data and sending it straightaway to OpenAI or the front end. Sometimes you need to transform this data to filter, enrich, aggregate, or normalize it in real time before it reaches the external services. You start to think about creating a stream processing pipeline with several stages. For example, not all the data fetched from the API will be relevant to every user and you can filter out unnecessary information at an initial stage. Also, data can come in various formats, especially if you’re sourcing information from multiple APIs for comprehensive alerting and you need to normalize this data. At the next stage, you enrich the data with extra context or information to the raw data to make it more useful. This could include comparing current weather conditions against historical data to identify unusual patterns or adding location-based insights using another external API, such as specific advice for weather conditions in a particular area. At later stages, you might aggregate hourly temperature data to give an average daytime temperature or to highlight the peak temperature reached during the day. Next Steps When it comes to transforming data, deploying, running, and scaling the app in a production environment, you might want to use dedicated frameworks in Python like GlassFlow to build sophisticated stream-processing pipelines. GlassFlow offers a fully managed serverless infrastructure for stream processing, you don’t have to think about setup, or maintenance where the app can handle large volumes of data and user requests with ease. It provides advanced state management capabilities, making it easier to track user alert criteria and other application states. Your application can scale with its user base without compromising performance. Recommended Content Microservices Data Synchronization Using PostgreSQL, Debezium, and NATS Training Fraud Detection ML Models with Real-time Data Streams

By Bobur Umurzokov

Generative AI With Spring Boot and Spring AI

It’s been more than 20 years since Spring Framework appeared in the software development landscape and 10 since Spring Boot version 1.0 was released. By now, nobody should have any doubt that Spring has created a unique style through which developers are freed from repetitive tasks and left to focus on business value delivery. As years passed, Spring’s technical depth has continually increased, covering a wide variety of development areas and technologies. On the other hand, its technical breadth has been continually expanded as more focused solutions have been experimented, proof of concepts created, and ultimately promoted under the projects’ umbrella (towards the technical depth). One such example is the new Spring AI project which, according to its reference documentation, aims to ease the development when a generative artificial intelligence layer is aimed to be incorporated into applications. Once again, developers are freed from repetitive tasks and offered simple interfaces for direct interaction with the pre-trained models that incorporate the actual processing algorithms. By interacting with generative pre-trained transformers (GPTs) directly or via Spring AI programmatically, users (developers) do not need to (although it would be useful) possess extensive machine learning knowledge. As an engineer, I strongly believe that even if such (developer) tools can be rather easily and rapidly used to produce results, it is advisable to temper ourselves to switch to a watchful mode and try to gain a decent understanding of the base concepts first. Moreover, by following this path, the outcome might be even more useful. Purpose This article shows how Spring AI can be integrated into a Spring Boot application and fulfill a programmatic interaction with Open AI. It is assumed that prompt design in general (prompt engineering) is a state-of-the-art activity. Consequently, the prompts used during experimentation are quite didactic, without much applicability. The focus here is on the communication interface, that is, Spring AI API. Before the Implementation First and foremost, one shall clarify the rationale for incorporating and utilizing a GPT solution, in addition to the desire to deliver with greater quality, in less time, and with lower costs. Generative AI is said to be good at doing a great deal of time-consuming tasks, quicker and more efficiently, and outputting the results. Moreover, if these results are further validated by experienced and wise humans, the chances of obtaining something useful increase. Fortunately, people are still part of the scenery. Next, one shall resist the temptation to jump right into the implementation and at least dedicate some time to get a bit familiar with the general concepts. An in-depth exploration of generative AI concepts is way beyond the scope of this article. Nevertheless, the “main actors” that appear in the interaction are briefly outlined below. The Stage – Generative AI is part of machine learning that is part of artificial intelligence Input – The provided data (incoming) Output – The computed results (outgoing) Large Language Model(LLM) – The fine-tuned algorithm based on the interpreted input produces the output Prompt – A state-of-the-art interface through which the input is passed to the model Prompt Template – A component that allows constructing structured parameterized prompts Tokens – The components the algorithm internally translates the input into, then uses to compile the results and ultimately constructs the output from Model’s context window – The threshold the model limits the number of tokens counts per call (usually, the more tokens are used, the more expensive the operation is) Finally, an implementation may be started, but as it progresses, it is advisable to revisit and refine the first two steps. Prompts In this exercise, we ask for the following: Plain Text Write {count = three} reasons why people in {location = Romania} should consider a {job = software architect} job. These reasons need to be short, so they fit on a poster. For instance, "{job} jobs are rewarding." This basically represents the prompt. As advised, a clear topic, a clear meaning of the task, and additional helpful pieces of information should be provided as part of the prompts, in order to increase the results’ accuracy. The prompt contains three parameters, which allow coverage for a wide range of jobs in various locations. count – The number of reasons aimed as part of the output job – The domain, the job interested in location – The country, town, region, etc. the job applicants reside Proof of Concept In this post, the simple proof of concept aims the following: Integrate Spring AI in a Spring Boot application and use it. Allow a client to communicate with Open AI via the application. The client issues a parametrized HTTP request to the application. The application uses a prompt to create the input, sends it to Open AI retrieves the output. The application sends the response to the client. Setup Java 21 Maven 3.9.2 Spring Boot – v. 3.2.2 Spring AI – v. 0.8.0-SNAPSHOT (still developed, experimental) Implementation Spring AI Integration Normally, this is a basic step not necessarily worth mentioning. Nevertheless, since Spring AI is currently released as a snapshot, in order to be able to integrate the Open AI auto-configuration dependency, one shall add a reference to Spring Milestone/Snapshot repositories. XML <repositories> <repository> <id>spring-milestones</id> <name>Spring Milestones</name> <url>https://repo.spring.io/milestone</url> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>spring-snapshots</id> <name>Spring Snapshots</name> <url>https://repo.spring.io/snapshot</url> <releases> <enabled>false</enabled> </releases> </repository> </repositories> The next step is to add the spring-ai-openai-spring-boot-starter Maven dependency. XML <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-openai-spring-boot-starter</artifactId> <version>0.8.0-SNAPSHOT</version> </dependency> Open AI ChatClient is now part of the application classpath. It is the component used to send the input to Open AI and retrieve the output. In order to be able to connect to the AI Model, the spring.ai.openai.api-key property needs to be set up in the application.properties file. Properties files spring.ai.openai.api-key = api-key-value Its value represents a valid API Key of the user on behalf of which the communication is made. By accessing the Open AI Platform, one can either sign up or sign in and generate one. Client: Spring Boot Application Communication The first part of the proof of concept is the communication between a client application (e.g., browser, cURL, etc.) and the application developed. This is done via a REST controller, accessible via an HTTP GET request. The URL is /job-reasons together with the three parameters previously outlined when the prompt was defined, which conducts to the following form: Plain Text /job-reasons?count={count}&job={job}&location={location} And the corresponding controller: Java @RestController public class OpenAiController { @GetMapping("/job-reasons") public ResponseEntity<String> jobReasons(@RequestParam(value = "count", required = false, defaultValue = "3") int count, @RequestParam("job") String job, @RequestParam("location") String location) { return ResponseEntity.ok().build(); } } Since the response from Open AI is going to be a String, the controller returns a ResponseEntity that encapsulates a String. If we run the application and issue a request, currently nothing is returned as part of the response body. Client: Open AI Communication Spring AI currently focuses on AI Models that process language and produce language or numbers. Examples of Open AI models in the former category are GPT4-openai or GPT3.5-openai. For fulfilling an interaction with these AI Models, which actually designate Open AI algorithms, Spring AI provides a uniform interface. ChatClient interface currently supports text input and output and has a simple contract. Java @FunctionalInterface public interface ChatClient extends ModelClient<Prompt, ChatResponse> { default String call(String message) { Prompt prompt = new Prompt(new UserMessage(message)); return call(prompt).getResult().getOutput().getContent(); } ChatResponse call(Prompt prompt); } The actual method of the functional interface is the one usually used. In the case of our proof of concept, this is exactly what is needed: a way of calling Open AI and sending the aimed parametrized Prompt as a parameter. The following OpenAiService is defined where an instance of ChatClient is injected. Java @Service public class OpenAiService { private final ChatClient client; public OpenAiService(OpenAiChatClient aiClient) { this.client = aiClient; } public String jobReasons(int count, String domain, String location) { final String promptText = """ Write {count} reasons why people in {location} should consider a {job} job. These reasons need to be short, so they fit on a poster. For instance, "{job} jobs are rewarding." """; final PromptTemplate promptTemplate = new PromptTemplate(promptText); promptTemplate.add("count", count); promptTemplate.add("job", domain); promptTemplate.add("location", location); ChatResponse response = client.call(promptTemplate.create()); return response.getResult().getOutput().getContent(); } } With the application running, if the following request is performed, from the browser: Plain Text http://localhost:8080/gen-ai/job-reasons?count=3&job=software%20architect&location=Romania Then the below result is retrieved: Lucrative career: Software architect jobs offer competitive salaries and excellent growth opportunities, ensuring financial stability and success in Romania. In-demand profession: As the demand for technology continues to grow, software architects are highly sought after in Romania and worldwide, providing abundant job prospects and job security. Creative problem-solving: Software architects play a crucial role in designing and developing innovative software solutions, allowing them to unleash their creativity and make a significant impact on various industries. This is exactly what it was intended – an easy interface through which the Open AI GPT model can be asked to write a couple of reasons why a certain job in a certain location is appealing. Adjustments and Observations The simple proof of concept developed so far mainly uses the default configurations available. The ChatClient instance may be configured according to the desired needs via various properties. As this is beyond the scope of this writing, only two are exemplified here. spring.ai.openai.chat.options.model designates the AI Model to use. By default, it is "gpt-35-turbo," but "gpt-4" and "gpt-4-32k" designate the latest versions. Although available, one may not be able to access these using a pay-as-you-go plan, but there are additional pieces of information available on the Open AI website to accommodate it. Another property worth mentioning is spring.ai.openai.chat.options.temperature. According to the reference documentation, the sampling temperature controls the “creativity of the responses." It is said that higher values make the output “more random," while lower ones are “more focused and deterministic." The default value is 0.8, if we decrease it to 0.3, restart the application, and ask again with the same request parameters, the below result is retrieved. Lucrative career opportunities: Software architect jobs in Romania offer competitive salaries and excellent growth prospects, making it an attractive career choice for individuals seeking financial stability and professional advancement. Challenging and intellectually stimulating work: As a software architect, you will be responsible for designing and implementing complex software systems, solving intricate technical problems, and collaborating with talented teams. This role offers continuous learning opportunities and the chance to work on cutting-edge technologies. High demand and job security: With the increasing reliance on technology and digital transformation across industries, the demand for skilled software architects is on the rise. Choosing a software architect job in Romania ensures job security and a wide range of employment options, both locally and internationally. It is visible that the output is way more descriptive in this case. One last consideration is related to the structure of the output obtained. It would be convenient to have the ability to map the actual payload received to a Java object (class or record, for instance). As of now, the representation is textual and so is the implementation. Output parsers may achieve this, similarly to Spring JDBC’s mapping structures. In this proof of concept, a BeanOutputParser is used, which allows deserializing the result directly in a Java record as below: Java public record JobReasons(String job, String location, List<String> reasons) { } This is done by taking the {format} as part of the prompt text and providing it as an instruction to the AI Model. The OpenAiService method becomes: Java public JobReasons formattedJobReasons(int count, String job, String location) { final String promptText = """ Write {count} reasons why people in {location} should consider a {job} job. These reasons need to be short, so they fit on a poster. For instance, "{job} jobs are rewarding." {format} """; BeanOutputParser<JobReasons> outputParser = new BeanOutputParser<>(JobReasons.class); final PromptTemplate promptTemplate = new PromptTemplate(promptText); promptTemplate.add("count", count); promptTemplate.add("job", job); promptTemplate.add("location", location); promptTemplate.add("format", outputParser.getFormat()); promptTemplate.setOutputParser(outputParser); final Prompt prompt = promptTemplate.create(); ChatResponse response = client.call(prompt); return outputParser.parse(response.getResult().getOutput().getContent()); } When invoking again, the output is as below: JSON { "job":"software architect", "location":"Romania", "reasons":[ "High demand", "Competitive salary", "Opportunities for growth" ] } The format is the expected one, but the reasons appear less explanatory, which means additional adjustments are required in order to achieve better usability. From a proof of concept point of view though, this is acceptable, as the focus was on the form. Conclusions Prompt design is an important part of the task – the better articulated prompts are, the better the input and the higher the output quality. Using Spring AI to integrate with various chat models is quite straightforward – this post showcased an Open AI integration. Nevertheless, in the case of Gen AI in general, just as in the case of almost any technology, it is very important to get familiar at least with the general concepts first. Then, to try to understand the magic behind the way the communication is carried out and only afterward, start writing “production” code. Last but not least, it is advisable to further explore the Spring AI API to understand the implementations and remain up-to-date as it evolves and improves. The code is available here. References Spring AI Reference

By Horatiu Dan

AI for Web Devs: Deploying Your AI App to Production

Welcome back to the series where we have been building an application with Qwik that incorporates AI tooling from OpenAI. So far we’ve created a pretty cool app that uses AI to generate text and images. Intro and Setup Your First AI Prompt Streaming Responses How Does AI Work Prompt Engineering AI-Generated Images Security and Reliability Deploying Now, there’s just one more thing to do. It’s launch time! I’ll be deploying to Akamai‘s cloud computing services (formerly Linode), but these steps should work with any VPS provider. Let’s do this! Setup Runtime Adapter There are a couple of things we need to get out of the way first: deciding where we are going to run our app, what runtime it will run in, and how the deployment pipeline should look. As I mentioned before, I’ll be deploying to a VPS in Akamai’s connected cloud, but any other VPS should work. For the runtime, I’ll be using Node.js, and I’ll keep the deployment simple by using Git. Qwik is cool because it’s designed to run in multiple JavaScript runtimes. That’s handy, but it also means that our code isn’t ready to run in production as is. Qwik needs to be aware of its runtime environment, which we can do with adapters. We can access see and install available adapters with the command, npm run qwik add. This will prompt us with several options for adapters, integrations, and plugins. For my case, I’ll go down and select the Fastify adapter. It works well on a VPS running Node.js. You can select a different target if you prefer. Once you select your integration, the terminal will show you the changes it’s about to make and prompt you to confirm. You’ll see that it wants to modify some files, create some new ones, install dependencies, and add some new npm scripts. Make sure you’re comfortable with these changes before confirming. Once these changes are installed, your app will have what it needs to run in production. You can test this by building the production assets and running the serve command. (Note: For some reason, npm run build always hangs for me, so I run the client and server build scripts separately). npm run build.client & npm run build.server & npm run serve This will build out our production assets and start the production server listening for requests at http://localhost:3000. If all goes well, you should be able to open that URL in your browser and see your app there. It won’t actually work because it’s missing the OpenAI API keys, but we’ll sort that part out on the production server. Push Changes To Git Repo As mentioned above, this deployment process is going to be focused on simplicity, not automation. So rather than introducing more complex tooling like Docker containers or Kubernetes, we’ll stick to a simpler, but more manual process: using Git to deploy our code. I’ll assume you already have some familiarity with Git and a remote repo you can push to. If not, please go make one now. You’ll need to commit your changes and push it to your repo. git commit -am "ready to commit" & git push origin main Prepare Production Server If you already have a VPS ready, feel free to skip this section. I’ll be deploying to an Akamai VPS. I won’t walk through the step-by-step process for setting up a server, but in case you’re interested, I chose the Nanode 1 GB shared CPU plan for $5/month with the following specs: Operating system: Ubuntu 22.04 LTS Location: Seattle, WA CPU: 1 RAM: 1 GB Storage: 25 GB Transfer: 1 TB Choosing different specs shouldn’t make a difference when it comes to running your app, although some of the commands to install any dependencies may be different. If you’ve never done this before, then try to match what I have above. You can even use a different provider, as long as you’re deploying to a server to which you have SSH access. Once you have your server provisioned and running, you should have a public IP address that looks something like 172.100.100.200. You can log into the server from your terminal with the following command: ssh root@172.100.100.200 You’ll have to provide the root password if you have not already set up an authorized key. We’ll use Git as a convenient tool to get our code from our repo into our server, so that will need to be installed. But before we do that, I always recommend updating the existing software. We can do the update and installation with the following command. sudo apt update && sudo apt install git -y Our server also needs Node.js to run our app. We could install the binary directly, but I prefer to use a tool called NVM, which allows us to easily manage Node versions. We can install it with this command: curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash Once NVM is installed, you can install the latest version of Node with: nvm install node Note that the terminal may say that NVM is not installed. If you exit the server and sign back in, it should work. Upload, Build, and Run App With our server set up, it’s time to get our code installed. With Git, it’s relatively easy. We can copy our code into our server using the clone command. You’ll want to use your own repo, but it should look something like this: git clone https://github.com/AustinGil/versus.git Our source code is now on the server, but it’s still not quite ready to run. We still need to install the NPM dependencies, build the production assets, and provide any environment variables. Let’s do it! First, navigate to the folder where you just cloned the project. I used: cd versus The install is easy enough: npm install The build command is: npm run build However, if you have any type-checking or linting errors, it will hang there. You can either fix the errors (which you probably should) or bypass them and build anyway with this: npm run build.client & npm run build.server The latest version of the project source code has working types if you want to check that. The last step is a bit tricky. As we saw above, environment variables will not be injected from the .env file when running the production app. Instead, we can provide them at runtime right before the serve command like this: OPENAI_API_KEY=your_api_key npm run serve You’ll want to provide your own API key there in order for the OpenAI requests to work. Also, for Node.js deployments, there’s an extra, necessary step. You must also set an ORIGIN variable assigned to the full URL where the app will be running. Qwik needs this information to properly configure their CSRF protection. If you don’t know the URL, you can disable this feature in the /src/entry.preview.tsx file by setting the createQwikCity options property checkOrigin to false: export default createQwikCity({ render, qwikCityPlan, checkOrigin: false }); This process is outlined in more detail in the docs, but it’s recommended not to disable, as CSRF can be quite dangerous. And anyway, you’ll need a URL to deploy the app anyway, so better to just set the ORIGIN environment variable. Note that if you make this change, you’ll want to redeploy and rerun the build and serve commands. If everything is configured correctly and running, you should start seeing the logs from Fastify in the terminal, confirming that the app is up and running. {"level":30,"time":1703810454465,"pid":23834,"hostname":"localhost","msg":"Server listening at http://[::1]:3000"} Unfortunately, accessing the app via IP address and port number doesn’t show the app (at least not for me). This is likely a networking issue, but also something that will be solved in the next section, where we run our app at the root domain. The Missing Steps Technically, the app is deployed, built, and running, but in my opinion, there is a lot to be desired before we can call it “production-ready.” Some tutorials would assume you know how to do the rest, but I don’t want to do you like that. We’re going to cover: Running the app in background mode Restarting the app if the server crashes Accessing the app at the root domain Setting up an SSL certificate One thing you will need to do for yourself is buy the domain name. There are lots of good places. I’ve been a fan of Porkbun and Namesilo. I don’t think there’s a huge difference for which registrar you use, but I like these because they offer WHOIS privacy and email forwarding at no extra charge on top of their already low prices. Before we do anything else on the server, it’ll be a good idea to point your domain name’s A record (@) to the server’s IP address. Doing this sooner can help with propagation times. Now, back in the server, there’s one glaring issue we need to deal with first. When we run the npm run serve command, our app will run as long as we keep the terminal open. Obviously, it would be nice to exit out of the server, close our terminal, and walk away from our computer to go eat pizza without the app crashing. So we’ll want to run that command in the background. There are plenty of ways to accomplish this: Docker, Kubernetes, Pulumis, etc., but I don’t like to add too much complexity. So for a basic app, I like to use PM2, a Node.js process manager with great features, including the ability to run our app in the background. From inside your server, run this command to install PM2 as a global NPM module: npm install -g pm2 Once it’s installed, we can tell PM2 what command to run with the “start” command: pm2 start "npm run serve" PM2 has a lot of really nice features in addition to running our apps in the background. One thing you’ll want to be aware of is the command to view logs from your app: pm2 logs In addition to running our app in the background, PM2 can also be configured to start or restart any process if the server crashes. This is super helpful to avoid downtime. You can set that up with this command: pm2 startup Ok, our app is now running and will continue to run after a server restart. Great! But we still can’t get to it. Lol! My preferred solution is using Caddy. This will resolve the networking issues, work as a great reverse proxy, and take care of the whole SSL process for us. We can follow the install instructions from their documentation and run these five commands: sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list sudo apt update sudo apt install caddy Once that’s done, you can go to your server’s IP address and you should see the default Caddy welcome page: Progress! In addition to showing us something is working, this page also gives us some handy information on how to work with Caddy. Ideally, you’ve already pointed your domain name to the server’s IP address. Next, we’ll want to modify the Caddyfile: sudo nano /etc/caddy/Caddyfile As their instructions suggest, we’ll want to replace the :80 line with our domain (or subdomain), but instead of uploading static files or changing the site root, I want to remove (or comment out) the root line and enable the reverse_proxy line, pointing the reverse proxy to my Node.js app running at port 3000. versus.austingil.com { reverse_proxy localhost:3000 } After saving the file and reloading Caddy (systemctl reload caddy), the new Caddyfile changes should take effect. Note that it may take a few moments before the app is fully up and running. This is because one of Caddy’s features is to provision a new SSL certificate for the domain. It also sets up the automatic redirect from HTTP to HTTPS. So now if you go to your domain (or subdomain), you should be redirected to the HTTPS version running a reverse proxy in front of your generative AI application which is resilient to server crashes. How awesome is that!? Using PM2 we can also enable some load-balancing in case you’re running a server with multiple cores. The full PM2 command including environment variables and load-balancing might look something like this: OPENAI_API_KEY=your_api_key ORIGIN=example.com pm2 start "npm run serve" -i max Note that you may need to remove the current instance from PM2 and rerun the start command, you don’t have to restart the Caddy process unless you change the Caddy file, and any changes to the Node.js source code will require a rebuild before running it again. Hell Yeah! We Did It! Alright, that’s it for this blog post and this series. I sincerely hope you enjoyed both and learned some cool things. Today, we covered a lot of things you need to know to deploy an AI-powered application: Runtime adapters Building for production Environment variables Process managers Reverse-proxies SSL certificates If you missed any of the previous posts, be sure to go back and check them out. I’d love to know what you thought about the whole series. If you want, you can play with the app I built. Let me know if you deployed your own app. Also, if you have ideas for topics you’d like me to discuss in the future I’d love to hear them :) UPDATE: If you liked this project and are curious to see what it might look like as a SvelteKit app, check out this blog post by Tim Smith where he converts this existing app over. Thank you so much for reading.

By Austin Gil

CORE

AI/ML

DZone's Featured AI/ML Resources

Top AI/ML Experts

The Latest AI/ML Topics