Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSYM] Callsites: Add data format support and loading from YAML #109781

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

alx32
Copy link
Contributor

@alx32 alx32 commented Sep 24, 2024

This PR adds support in the gSYM format for call site information and adds support for loading call sites from a YAML file. The support for YAML input is mostly for testing purposes - so we have a way to test the functionality.

Note that this data is not currently used in the gSYM tooling - the logic to use call sites will be added in a later PR.

The reason why we need call site information in gSYM files is so that we can support better call stack function disambiguation in the case where multiple functions have been merged due to optimization (linker ICF). When resolving a merged function on the callstack, we can use the call site information of the calling function to narrow down the actual function that is being called, from the set of all merged functions.

See this RFC for more details on this change.

Copy link

github-actions bot commented Sep 24, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@alx32 alx32 marked this pull request as ready for review September 24, 2024 17:21
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 24, 2024

@llvm/pr-subscribers-debuginfo

Author: None (alx32)

Changes

This PR adds support in the gSYM format for call site information and adds support for loading call sites from a YAML file. The support for YAML input is mostly for testing purposes - so we have a way to test the functionality.

Note that this data is not currently used in the gSYM tooling - the logic to use call sites will be added in a later PR.

The reason why we need call site information in gSYM files is so that we can support better call stack function disambiguation in the case where multiple functions have been merged due to optimization (linker ICF). When resolving a merged function on the callstack, we can use the call site information of the calling function to narrow down the actual function that is being called, from the set of all merged functions.

See this RFC for more details on this change.


Patch is 101.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109781.diff

15 Files Affected:

  • (added) llvm/include/llvm/DebugInfo/GSYM/CallSiteInfo.h (+224)
  • (modified) llvm/include/llvm/DebugInfo/GSYM/FunctionInfo.h (+7-3)
  • (modified) llvm/include/llvm/DebugInfo/GSYM/GsymCreator.h (+10-1)
  • (modified) llvm/include/llvm/DebugInfo/GSYM/GsymReader.h (+20)
  • (modified) llvm/lib/DebugInfo/GSYM/CMakeLists.txt (+1)
  • (added) llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp (+293)
  • (modified) llvm/lib/DebugInfo/GSYM/FunctionInfo.cpp (+30-1)
  • (modified) llvm/lib/DebugInfo/GSYM/GsymCreator.cpp (+6)
  • (modified) llvm/lib/DebugInfo/GSYM/GsymReader.cpp (+45)
  • (added) llvm/test/tools/llvm-gsymutil/ARM_AArch64/macho-gsym-callsite-info-dsym.yaml (+950)
  • (added) llvm/test/tools/llvm-gsymutil/ARM_AArch64/macho-gsym-callsite-info-exe.yaml (+558)
  • (added) llvm/test/tools/llvm-gsymutil/ARM_AArch64/macho-gsym-callsite-info-obj.test (+304)
  • (modified) llvm/tools/llvm-gsymutil/Opts.td (+2)
  • (modified) llvm/tools/llvm-gsymutil/llvm-gsymutil.cpp (+19)
  • (modified) llvm/unittests/DebugInfo/GSYM/GSYMTest.cpp (+2-2)
diff --git a/llvm/include/llvm/DebugInfo/GSYM/CallSiteInfo.h b/llvm/include/llvm/DebugInfo/GSYM/CallSiteInfo.h
new file mode 100644
index 00000000000000..45257f0e11578e
--- /dev/null
+++ b/llvm/include/llvm/DebugInfo/GSYM/CallSiteInfo.h
@@ -0,0 +1,224 @@
+//===- CallSiteInfo.h -------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_DEBUGINFO_GSYM_CALLSITEINFO_H
+#define LLVM_DEBUGINFO_GSYM_CALLSITEINFO_H
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/ADT/StringSet.h"
+#include "llvm/DebugInfo/GSYM/ExtractRanges.h"
+#include "llvm/Support/YAMLParser.h"
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace llvm {
+class DataExtractor;
+class raw_ostream;
+class StringTableBuilder;
+class CachedHashStringRef;
+
+namespace yaml {
+struct CallSiteYAML;
+struct FunctionYAML;
+struct FunctionsYAML;
+} // namespace yaml
+
+namespace gsym {
+class FileWriter;
+struct FunctionInfo;
+struct CallSiteInfo {
+public:
+  enum Flags : uint8_t {
+    None = 0,
+    // This flag specifies that the call site can only call a function within
+    // the same link unit as the call site.
+    InternalCall = 1 << 0,
+    // This flag specifies that the call site can only call a function outside
+    // the link unit that the call site is in.
+    ExternalCall = 1 << 1,
+  };
+
+  /// The return address of the call site.
+  uint64_t ReturnAddress;
+
+  /// Offsets into the string table for function names regex patterns.
+  std::vector<uint32_t> MatchRegex;
+
+  /// Bitwise OR of CallSiteInfo::Flags values
+  uint8_t Flags;
+
+  /// Decode a CallSiteInfo object from a binary data stream.
+  ///
+  /// \param Data The binary stream to read the data from.
+  /// \param Offset The current offset within the data stream.
+  /// \param BaseAddr The base address for decoding (unused here but included
+  /// for consistency).
+  ///
+  /// \returns A CallSiteInfo or an error describing the issue.
+  static llvm::Expected<CallSiteInfo>
+  decode(DataExtractor &Data, uint64_t &Offset, uint64_t BaseAddr);
+
+  /// Encode this CallSiteInfo object into a FileWriter stream.
+  ///
+  /// \param O The binary stream to write the data to.
+  /// \returns An error object that indicates success or failure.
+  llvm::Error encode(FileWriter &O) const;
+};
+
+struct CallSiteInfoCollection {
+public:
+  std::vector<CallSiteInfo> CallSites;
+
+  void clear() { CallSites.clear(); }
+
+  /// Query if a CallSiteInfoCollection object is valid.
+  ///
+  /// \returns True if the collection is not empty.
+  bool isValid() const { return !CallSites.empty(); }
+
+  /// Decode a CallSiteInfoCollection object from a binary data stream.
+  ///
+  /// \param Data The binary stream to read the data from.
+  /// \param BaseAddr The base address for decoding (unused here but included
+  /// for consistency).
+  ///
+  /// \returns A CallSiteInfoCollection or an error describing the issue.
+  static llvm::Expected<CallSiteInfoCollection> decode(DataExtractor &Data,
+                                                       uint64_t BaseAddr);
+
+  /// Encode this CallSiteInfoCollection object into a FileWriter stream.
+  ///
+  /// \param O The binary stream to write the data to.
+  /// \returns An error object that indicates success or failure.
+  llvm::Error encode(FileWriter &O) const;
+};
+
+bool operator==(const CallSiteInfoCollection &LHS,
+                const CallSiteInfoCollection &RHS);
+
+bool operator==(const CallSiteInfo &LHS, const CallSiteInfo &RHS);
+
+class CallSiteInfoLoader {
+public:
+  /// Constructor that initializes the CallSiteInfoLoader with necessary data
+  /// structures.
+  ///
+  /// \param StringOffsetMap A reference to a DenseMap that maps existing string
+  /// offsets to CachedHashStringRef. \param StrTab A reference to a
+  /// StringTableBuilder used for managing looking up and creating new strings.
+  /// \param StringStorage A reference to a StringSet for storing the data for
+  /// generated strings.
+  CallSiteInfoLoader(DenseMap<uint64_t, CachedHashStringRef> &StringOffsetMap,
+                     StringTableBuilder &StrTab, StringSet<> &StringStorage)
+      : StringOffsetMap(StringOffsetMap), StrTab(StrTab),
+        StringStorage(StringStorage) {}
+
+  /// Loads call site information from a YAML file and populates the provided
+  /// FunctionInfo vector.
+  ///
+  /// This method reads the specified YAML file, parses its content, and updates
+  /// the `Funcs` vector with call site information based on the YAML data.
+  ///
+  /// \param Funcs A reference to a vector of FunctionInfo objects to be
+  /// populated.
+  /// \param YAMLFile A StringRef representing the path to the YAML
+  /// file to be loaded.
+  ///
+  /// \returns An `llvm::Error` indicating success or describing any issues
+  /// encountered during the loading process.
+  llvm::Error loadYAML(std::vector<FunctionInfo> &Funcs, StringRef YAMLFile);
+
+private:
+  /// Retrieves an existing string from the StringOffsetMap using the provided
+  /// offset.
+  ///
+  /// \param offset A 32-bit unsigned integer representing the offset of the
+  /// string.
+  ///
+  /// \returns A StringRef corresponding to the string for the given offset.
+  ///
+  /// \note This method asserts that the offset exists in the StringOffsetMap.
+  StringRef stringFromOffset(uint32_t offset) const;
+
+  /// Obtains the offset corresponding to a given string in the StrTab. If the
+  /// string does not already exist, it is created.
+  ///
+  /// \param str A StringRef representing the string for which the offset is
+  /// requested.
+  ///
+  /// \returns A 32-bit unsigned integer representing the offset of the string.
+  uint32_t offsetFromString(StringRef str);
+
+  /// Reads the content of the YAML file specified by `YAMLFile` into
+  /// `yamlContent`.
+  ///
+  /// \param YAMLFile A StringRef representing the path to the YAML file.
+  /// \param Buffer The memory buffer containing the YAML content.
+  ///
+  /// \returns An `llvm::Error` indicating success or describing any issues
+  /// encountered while reading the file.
+  llvm::Error readYAMLFile(StringRef YAMLFile,
+                           std::unique_ptr<llvm::MemoryBuffer> &Buffer);
+
+  /// Parses the YAML content and populates `functionsYAML` with the parsed
+  /// data.
+  ///
+  /// \param Buffer The memory buffer containing the YAML content.
+  /// \param functionsYAML A reference to an llvm::yaml::FunctionsYAML object to
+  /// be populated.
+  ///
+  /// \returns An `llvm::Error` indicating success or describing any issues
+  /// encountered during parsing.
+  llvm::Error parseYAML(llvm::MemoryBuffer &Buffer,
+                        llvm::yaml::FunctionsYAML &functionsYAML);
+
+  /// Builds a map from function names to FunctionInfo pointers based on the
+  /// provided `Funcs` vector.
+  ///
+  /// \param Funcs A reference to a vector of FunctionInfo objects.
+  ///
+  /// \returns An unordered_map mapping function names (std::string) to their
+  /// corresponding FunctionInfo pointers.
+  std::unordered_map<std::string, FunctionInfo *>
+  buildFunctionMap(std::vector<FunctionInfo> &Funcs);
+
+  /// Processes the parsed YAML functions and updates the `FuncMap` accordingly.
+  ///
+  /// \param functionsYAML A constant reference to an llvm::yaml::FunctionsYAML
+  /// object containing parsed YAML data.
+  /// \param FuncMap A reference to an unordered_map mapping function names to
+  /// FunctionInfo pointers.
+  /// \param YAMLFile A StringRef representing the name of the YAML file (used
+  /// for error messages).
+  ///
+  /// \returns An `llvm::Error` indicating success or describing any issues
+  /// encountered during processing.
+  llvm::Error
+  processYAMLFunctions(const llvm::yaml::FunctionsYAML &functionsYAML,
+                       std::unordered_map<std::string, FunctionInfo *> &FuncMap,
+                       StringRef YAMLFile);
+
+  /// Map of existing string offsets to CachedHashStringRef.
+  DenseMap<uint64_t, CachedHashStringRef> &StringOffsetMap;
+
+  /// The gSYM string table builder.
+  StringTableBuilder &StrTab;
+
+  /// The gSYM string storage - we store generated strings here.
+  StringSet<> &StringStorage;
+};
+
+raw_ostream &operator<<(raw_ostream &OS, const CallSiteInfo &CSI);
+raw_ostream &operator<<(raw_ostream &OS, const CallSiteInfoCollection &CSIC);
+
+} // namespace gsym
+} // namespace llvm
+
+#endif // LLVM_DEBUGINFO_GSYM_CALLSITEINFO_H
diff --git a/llvm/include/llvm/DebugInfo/GSYM/FunctionInfo.h b/llvm/include/llvm/DebugInfo/GSYM/FunctionInfo.h
index 71209b6b5c9cd1..fd4ac3164c686d 100644
--- a/llvm/include/llvm/DebugInfo/GSYM/FunctionInfo.h
+++ b/llvm/include/llvm/DebugInfo/GSYM/FunctionInfo.h
@@ -10,6 +10,7 @@
 #define LLVM_DEBUGINFO_GSYM_FUNCTIONINFO_H
 
 #include "llvm/ADT/SmallString.h"
+#include "llvm/DebugInfo/GSYM/CallSiteInfo.h"
 #include "llvm/DebugInfo/GSYM/ExtractRanges.h"
 #include "llvm/DebugInfo/GSYM/InlineInfo.h"
 #include "llvm/DebugInfo/GSYM/LineTable.h"
@@ -63,7 +64,9 @@ class GsymReader;
 ///   enum InfoType {
 ///     EndOfList = 0u,
 ///     LineTableInfo = 1u,
-///     InlineInfo = 2u
+///     InlineInfo = 2u,
+///     MergedFunctionsInfo = 3u,
+///     CallSiteInfo = 4u
 ///   };
 ///
 /// This stream of tuples is terminated by a "InfoType" whose value is
@@ -73,7 +76,7 @@ class GsymReader;
 /// clients to still parse the format and skip over any data that they don't
 /// understand or want to parse.
 ///
-/// So the function information encoding essientially looks like:
+/// So the function information encoding essentially looks like:
 ///
 /// struct {
 ///   uint32_t Size;
@@ -92,6 +95,7 @@ struct FunctionInfo {
   std::optional<LineTable> OptLineTable;
   std::optional<InlineInfo> Inline;
   std::optional<MergedFunctionsInfo> MergedFunctions;
+  std::optional<CallSiteInfoCollection> CallSites;
   /// If we encode a FunctionInfo during segmenting so we know its size, we can
   /// cache that encoding here so we don't need to re-encode it when saving the
   /// GSYM file.
@@ -107,7 +111,7 @@ struct FunctionInfo {
   /// debug info, we might end up with multiple FunctionInfo objects for the
   /// same range and we need to be able to tell which one is the better object
   /// to use.
-  bool hasRichInfo() const { return OptLineTable || Inline; }
+  bool hasRichInfo() const { return OptLineTable || Inline || CallSites; }
 
   /// Query if a FunctionInfo object is valid.
   ///
diff --git a/llvm/include/llvm/DebugInfo/GSYM/GsymCreator.h b/llvm/include/llvm/DebugInfo/GSYM/GsymCreator.h
index 48808fb7b71e18..9e5b3c1f8d92de 100644
--- a/llvm/include/llvm/DebugInfo/GSYM/GsymCreator.h
+++ b/llvm/include/llvm/DebugInfo/GSYM/GsymCreator.h
@@ -352,13 +352,22 @@ class GsymCreator {
   /// \param   FI The function info object to emplace into our functions list.
   void addFunctionInfo(FunctionInfo &&FI);
 
+  /// Load call site information from a YAML file.
+  ///
+  /// This function reads call site information from a specified YAML file and
+  /// adds it to the GSYM data.
+  ///
+  /// \param YAMLFile The path to the YAML file containing call site
+  /// information.
+  llvm::Error loadCallSitesFromYAML(StringRef YAMLFile);
+
   /// Organize merged FunctionInfo's
   ///
   /// This method processes the list of function infos (Funcs) to identify and
   /// group functions with overlapping address ranges.
   ///
   /// \param  Out Output stream to report information about how merged
-  /// FunctionInfo's were handeled.
+  /// FunctionInfo's were handled.
   void prepareMergedFunctions(OutputAggregator &Out);
 
   /// Finalize the data in the GSYM creator prior to saving the data out.
diff --git a/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h b/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h
index 89f8c043b91519..72b7f3e7bfc42e 100644
--- a/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h
+++ b/llvm/include/llvm/DebugInfo/GSYM/GsymReader.h
@@ -181,6 +181,26 @@ class GsymReader {
   /// \param MFI The object to dump.
   void dump(raw_ostream &OS, const MergedFunctionsInfo &MFI);
 
+  /// Dump a CallSiteInfo object.
+  ///
+  /// This function will output the details of a CallSiteInfo object in a
+  /// human-readable format.
+  ///
+  /// \param OS The output stream to dump to.
+  ///
+  /// \param CSI The CallSiteInfo object to dump.
+  void dump(raw_ostream &OS, const CallSiteInfo &CSI);
+
+  /// Dump a CallSiteInfoCollection object.
+  ///
+  /// This function will iterate over a collection of CallSiteInfo objects and
+  /// dump each one.
+  ///
+  /// \param OS The output stream to dump to.
+  ///
+  /// \param CSIC The CallSiteInfoCollection object to dump.
+  void dump(raw_ostream &OS, const CallSiteInfoCollection &CSIC);
+
   /// Dump a LineTable object.
   ///
   /// This function will convert any string table indexes and file indexes
diff --git a/llvm/lib/DebugInfo/GSYM/CMakeLists.txt b/llvm/lib/DebugInfo/GSYM/CMakeLists.txt
index be90bfdaa7fd2b..c27d648db62f61 100644
--- a/llvm/lib/DebugInfo/GSYM/CMakeLists.txt
+++ b/llvm/lib/DebugInfo/GSYM/CMakeLists.txt
@@ -8,6 +8,7 @@ add_llvm_component_library(LLVMDebugInfoGSYM
   InlineInfo.cpp
   LineTable.cpp
   LookupResult.cpp
+  CallSiteInfo.cpp
   MergedFunctionsInfo.cpp
   ObjectFileTransformer.cpp
   ExtractRanges.cpp
diff --git a/llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp b/llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp
new file mode 100644
index 00000000000000..4ed3d3f67a44fd
--- /dev/null
+++ b/llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp
@@ -0,0 +1,293 @@
+//===- CallSiteInfo.cpp ----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/DebugInfo/GSYM/CallSiteInfo.h"
+#include "llvm/ADT/CachedHashString.h"
+#include "llvm/DebugInfo/GSYM/FileWriter.h"
+#include "llvm/DebugInfo/GSYM/FunctionInfo.h"
+#include "llvm/MC/StringTableBuilder.h"
+#include "llvm/Support/DataExtractor.h"
+#include "llvm/Support/YAMLParser.h"
+#include "llvm/Support/YAMLTraits.h"
+#include "llvm/Support/raw_ostream.h"
+#include <fstream>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+using namespace llvm;
+using namespace gsym;
+
+llvm::Error CallSiteInfo::encode(FileWriter &O) const {
+  O.writeU64(ReturnAddress);
+  O.writeU8(Flags);
+  O.writeU32(MatchRegex.size());
+  for (uint32_t Entry : MatchRegex)
+    O.writeU32(Entry);
+  return llvm::Error::success();
+}
+
+llvm::Expected<CallSiteInfo>
+CallSiteInfo::decode(DataExtractor &Data, uint64_t &Offset, uint64_t BaseAddr) {
+  CallSiteInfo CSI;
+
+  // Read ReturnAddress
+  if (!Data.isValidOffsetForDataOfSize(Offset, sizeof(uint64_t)))
+    return createStringError(std::errc::io_error,
+                             "0x%8.8" PRIx64 ": missing ReturnAddress", Offset);
+  CSI.ReturnAddress = Data.getU64(&Offset);
+
+  // Read Flags
+  if (!Data.isValidOffsetForDataOfSize(Offset, sizeof(uint8_t)))
+    return createStringError(std::errc::io_error,
+                             "0x%8.8" PRIx64 ": missing Flags", Offset);
+  CSI.Flags = Data.getU8(&Offset);
+
+  // Read number of MatchRegex entries
+  if (!Data.isValidOffsetForDataOfSize(Offset, sizeof(uint32_t)))
+    return createStringError(std::errc::io_error,
+                             "0x%8.8" PRIx64 ": missing MatchRegex count",
+                             Offset);
+  uint32_t NumEntries = Data.getU32(&Offset);
+
+  CSI.MatchRegex.reserve(NumEntries);
+  for (uint32_t i = 0; i < NumEntries; ++i) {
+    if (!Data.isValidOffsetForDataOfSize(Offset, sizeof(uint32_t)))
+      return createStringError(std::errc::io_error,
+                               "0x%8.8" PRIx64 ": missing MatchRegex entry",
+                               Offset);
+    uint32_t Entry = Data.getU32(&Offset);
+    CSI.MatchRegex.push_back(Entry);
+  }
+
+  return CSI;
+}
+
+llvm::Error CallSiteInfoCollection::encode(FileWriter &O) const {
+  O.writeU32(CallSites.size());
+  for (const CallSiteInfo &CSI : CallSites) {
+    if (llvm::Error Err = CSI.encode(O))
+      return Err;
+  }
+  return llvm::Error::success();
+}
+
+llvm::Expected<CallSiteInfoCollection>
+CallSiteInfoCollection::decode(DataExtractor &Data, uint64_t BaseAddr) {
+  CallSiteInfoCollection CSC;
+  uint64_t Offset = 0;
+
+  // Read number of CallSiteInfo entries
+  if (!Data.isValidOffsetForDataOfSize(Offset, sizeof(uint32_t)))
+    return createStringError(std::errc::io_error,
+                             "0x%8.8" PRIx64 ": missing CallSiteInfo count",
+                             Offset);
+  uint32_t NumCallSites = Data.getU32(&Offset);
+
+  CSC.CallSites.reserve(NumCallSites);
+  for (uint32_t i = 0; i < NumCallSites; ++i) {
+    llvm::Expected<CallSiteInfo> ECSI =
+        CallSiteInfo::decode(Data, Offset, BaseAddr);
+    if (!ECSI)
+      return ECSI.takeError();
+    CSC.CallSites.emplace_back(*ECSI);
+  }
+
+  return CSC;
+}
+
+/// Structures necessary for reading CallSiteInfo from YAML.
+namespace llvm {
+namespace yaml {
+
+struct CallSiteYAML {
+  // The offset of the return address of the call site - relative to the start
+  // of the function.
+  llvm::yaml::Hex64 return_offset;
+  std::vector<std::string> match_regex;
+  std::vector<std::string> flags;
+};
+
+struct FunctionYAML {
+  std::string name;
+  std::vector<CallSiteYAML> callsites;
+};
+
+struct FunctionsYAML {
+  std::vector<FunctionYAML> functions;
+};
+
+template <> struct MappingTraits<CallSiteYAML> {
+  static void mapping(IO &io, CallSiteYAML &callsite) {
+    io.mapRequired("return_offset", callsite.return_offset);
+    io.mapRequired("match_regex", callsite.match_regex);
+    io.mapOptional("flags", callsite.flags);
+  }
+};
+
+template <> struct MappingTraits<FunctionYAML> {
+  static void mapping(IO &io, FunctionYAML &func) {
+    io.mapRequired("name", func.name);
+    io.mapOptional("callsites", func.callsites);
+  }
+};
+
+template <> struct MappingTraits<FunctionsYAML> {
+  static void mapping(IO &io, FunctionsYAML &functionsYAML) {
+    io.mapRequired("functions", functionsYAML.functions);
+  }
+};
+
+} // namespace yaml
+} // namespace llvm
+
+LLVM_YAML_IS_SEQUENCE_VECTOR(CallSiteYAML)
+LLVM_YAML_IS_SEQUENCE_VECTOR(FunctionYAML)
+
+// Implementation of CallSiteInfoLoader
+StringRef CallSiteInfoLoader::stringFromOffset(uint32_t offset) const {
+  assert(StringOffsetMap.count(offset) &&
+         "expected function name offset to already be in StringOffsetMap");
+  return StringOffsetMap.find(offset)->second.val();
+}
+
+uint32_t CallSiteInfoLoader::offsetFromString(StringRef str) {
+  return StrTab.add(StringStorage.insert(str).first->getKey());
+}
+
+llvm::Error CallSiteInfoLoader::loadYAML(std::vector<FunctionInfo> &Funcs,
+                                         StringRef YAMLFile) {
+  std::unique_ptr<llvm::MemoryBuffer> Buffer;
+  // Step 1: Read YAML file
+  if (auto Err = readYAMLFile(YAMLFile, Buffer))
+    return Err;
+
+  // Step 2: Parse YAML content
+  llvm::yaml::FunctionsYAML functionsYAML;
+  if (auto Err = parseYAML(*Buffer, functionsYAML))
+    return Err;
+
+  // Step 3: Build function map from Funcs
+  auto FuncMap = buildFunctionMap(Funcs);
+
+  // Step 4: Process parsed YAML functions and update FuncMap
+  return processYAMLFunctions(functionsYAML, FuncMap, YAMLFile);
+}
+
+llvm::Error
+CallSiteInfoLoader::readYAMLFile(StringRef YAMLFile,
+                                 std::unique_ptr<llvm::MemoryBuffer> &Buffer) {
+  auto BufferOrError = llvm::MemoryBuffer::getFile(YAMLFile);
+  if (!BufferOrError)
+    return errorCodeToError(BufferOrError.getError());
+  Buffer = std::move(*BufferOrError);
+  return llvm::Error::success();
+}
+
+llvm::Error
+CallSiteInfoLoader::parseYAML(llvm::MemoryBuffer &Buffer,
+                              llvm::yaml::FunctionsYAML &functionsYAML) {
+  // Use the MemoryBufferRef constructor
+  llvm::yaml::Input yin(Buffer.getMemBufferRef());
+  yin >> functionsYAML...
[truncated]

llvm/tools/llvm-gsymutil/llvm-gsymutil.cpp Outdated Show resolved Hide resolved
llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/DebugInfo/GSYM/CallSiteInfo.cpp Outdated Show resolved Hide resolved
llvm/include/llvm/DebugInfo/GSYM/CallSiteInfo.h Outdated Show resolved Hide resolved
/// \param Data The binary stream to read the data from.
/// \param Offset The current offset within the data stream.
/// \param BaseAddr The base address for decoding (unused here but included
/// for consistency).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a static function, and I'm curious about the rationale for including this unused function for this purpose.
In fact, when examining InlineInfo, which has a similar usage, it does not pass Offset in its static member function. Instead, it encapsulates it within a local static function, as the Offset is actually tied to its context or parent function. You might consider doing something similar here if you want to maintain consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand here:

including this unused function for this purpose
The function is used, do you mean unused parameter ?

The BaseAddr is the one not being used, Offset is being used and is indeed encapsulated in the local static function CallSiteInfoCollection::decode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just delete BaseAddr which is not used.


template <> struct MappingTraits<FunctionsYAML> {
static void mapping(IO &io, FunctionsYAML &functionsYAML) {
io.mapRequired("functions", functionsYAML.functions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionsYAML doesn't seem much useful as it's a merely vector of FunctionYAML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that this syntax/struct, even though not very useful, is required for the correct interface to the YAML parser. Is there a better syntax for yin >> functionsYAML; that can maintain the current structure of the YAML format ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to std::vector<FunctionInfo> &Funcs that is passed in, I think you can declare std::vector<yaml::FunctionYAML> FuncYAMLs; and read it without having this redundant definition at your Step 2 in loadYAML like:

  std::vector<yaml::FunctionYAML> FuncYAMLs;
  yaml::Input Yin(Buffer->getMemBufferRef());
  Yin >> FuncYAMLs;

}

uint32_t CallSiteInfoLoader::offsetFromString(StringRef str) {
return StrTab.add(StringStorage.insert(str).first->getKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding this, but why do we need a separate StringStorage? Doesn't StrTab already ensure the uniqueness of strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringStorage is used for actual storage of the bytes of the string. Here the str comes from the YAML parser. This means that after parsing is done and the YAML parser is destructed, the str will point to invalid memory. So we copy the bytes of str into StringStorage to ensure they will be around for the lifetime of StrTab. This is how/why StringStorage exists in llvm/lib/DebugInfo/GSYM/GsymCreator.cpp also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Then why not using CachedHashStringRef before being added like in GsymCreator.cpp? And also why not updating StringOffsetMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored thins a bit for better integration.

FW.writeU32(0); // LineTableInfo InfoType data length.
TestFunctionInfoDecodeError(ByteOrder, OutStrm.str(), BaseAddr,
"0x00000008: unsupported InfoType 4");
"0x00000008: unsupported InfoType 7");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 7 instead of 5?

Copy link
Contributor Author

@alx32 alx32 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this was set to '4', '2' was the highest existent value, so a bit of "headroom" was offered.
I went with the original intent, offering extra headroom for new types without breaking the test. Should I put in 5 ?

llvm/tools/llvm-gsymutil/llvm-gsymutil.cpp Show resolved Hide resolved
llvm/lib/DebugInfo/GSYM/GsymReader.cpp Show resolved Hide resolved
llvm/tools/llvm-gsymutil/llvm-gsymutil.cpp Outdated Show resolved Hide resolved
llvm/tools/llvm-gsymutil/llvm-gsymutil.cpp Show resolved Hide resolved
};

/// The return address of the call site.
uint64_t ReturnAddress;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just initialize it with a default value for the safety.

Suggested change
uint64_t ReturnAddress;
uint64_t ReturnAddress = 0;

/// \param Data The binary stream to read the data from.
/// \param Offset The current offset within the data stream.
/// \param BaseAddr The base address for decoding (unused here but included
/// for consistency).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just delete BaseAddr which is not used.

@@ -96,6 +97,7 @@ static bool Quiet;
static std::vector<uint64_t> LookupAddresses;
static bool LookupAddressesFromStdin;
static bool StoreMergedFunctionInfo = false;
static std::vector<std::string> CallSiteYamlPaths;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you use the last arg only.

Suggested change
static std::vector<std::string> CallSiteYamlPaths;
static std::string CallSiteYamlPath;

bool operator==(const CallSiteInfoCollection &LHS,
const CallSiteInfoCollection &RHS);

bool operator==(const CallSiteInfo &LHS, const CallSiteInfo &RHS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are actually defined? Do you need to override these operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, no need for them.

/// structures.
///
/// \param StringOffsetMap A reference to a DenseMap that maps existing string
/// offsets to CachedHashStringRef. \param StrTab A reference to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the format consistent with others?


/// Loads call site information from a YAML file and populates the provided
/// FunctionInfo vector.
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content above and below seems similar, so you might just combine them into a single sentence concisely.

/// populated.
/// \param YAMLFile A StringRef representing the path to the YAML
/// file to be loaded.
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at GsymCreator.h , there is no space between \param and \returns. you could delete this empty line for all changes you made.

/// FunctionInfo vector.
///
/// offsets to CachedHashStringRef.
/// \param StrTab A reference to a StringTableBuilder used for managing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you update the comment without these parameters?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, remove these.

// Step 2: Parse YAML content
llvm::yaml::FunctionsYAML functionsYAML;
llvm::yaml::Input yin(Buffer->getMemBufferRef());
yin >> functionsYAML;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the single field functions: in the YAML seems a waste unless you plan to extend other fields to deal with more than the function scope like module, etc. Normally with this function level entity, it'd be easier to deal with merging as then can be easily appended.

And also please use a capital letter for the first character of the local variables. I know it's different than LLD convention, but it's actually common in LLVM.

@@ -385,6 +385,12 @@ uint32_t GsymCreator::insertString(StringRef S, bool Copy) {
return StrOff;
}

StringRef GsymCreator::getString(uint32_t offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return an optional value like std::optional? The release build ignores asserts. I'd make it explicit for its use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that the GSYM is corrupt - do you mean we should return optional and emit an error ?

Copy link
Collaborator

@clayborg clayborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main concern is adding data that contains full addresses anywhere in the FunctionInfo encodings. If we can avoid that and encode offsets, that will be good, but it only works is addresses are relative to the start address of the function's address range. We have no relocations in FunctionInfo prior to merge function info or call site info and I would like to keep it that way if possible.

Comment on lines 28 to 29
struct CallSiteYAML;
struct FunctionYAML;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't needed in the header file. Just FunctionsYAML. Can you move these forward decls to the .cpp file?

/// FunctionInfo vector.
///
/// offsets to CachedHashStringRef.
/// \param StrTab A reference to a StringTableBuilder used for managing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, remove these.

namespace llvm {
class DataExtractor;
class raw_ostream;
class StringTableBuilder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this, not needed in this header.

class DataExtractor;
class raw_ostream;
class StringTableBuilder;
class CachedHashStringRef;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this forward decl too.

#ifndef LLVM_DEBUGINFO_GSYM_CALLSITEINFO_H
#define LLVM_DEBUGINFO_GSYM_CALLSITEINFO_H

#include "llvm/ADT/DenseMap.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/StringSet.h"
#include "llvm/DebugInfo/GSYM/ExtractRanges.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with

#include "llvm/Support/Error.h"

#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/StringSet.h"
#include "llvm/DebugInfo/GSYM/ExtractRanges.h"
#include "llvm/Support/YAMLParser.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines 17 to 18
#include <string>
#include <unordered_map>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

if (!Data.isValidOffsetForDataOfSize(Offset, sizeof(uint64_t)))
return createStringError(std::errc::io_error,
"0x%8.8" PRIx64 ": missing ReturnAddress", Offset);
CSI.ReturnAddress = Data.getU64(&Offset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to the merged functions and this CallSiteInfo stuff, there is no data that requires relocations in the FunctionInfo objects. Everything was relative to the function start address from the address offsets table. I assume this address isn't relative to the start address of the function info's address range right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it's absolute - see below:

CSI.ReturnAddress = FuncInfo->Range.start() + CallSiteYAML.return_offset;

Should we make it relative to start ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants