Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user data collections w/ basic types #213

Merged
merged 19 commits into from
Sep 21, 2021
Merged

Conversation

gaede
Copy link
Contributor

@gaede gaede commented Sep 1, 2021

BEGINRELEASENOTES

  • add possibility to store additional user data as collections of fundamental types in PODIO files
    • uses std::vector<basic_type>
    • stored in simple branch in root (and simple block in SIO)
    • all fundamental types supported in PODIO (except bool) can be written
  • example code:
  auto& usrInts = store.create<podio::UserDataCollection<uint64_t> >("userInts");
  auto& usrDoubles = store.create<podio::UserDataCollection<double> >("userDoubles");
  // ...

  // add some unsigned ints
  usrInts.resize( i + 1 ) ;
  int myInt = 0 ;
  for( auto& iu : usrInts ){
    iu = myInt++  ;
  }
  // and some user double values
  unsigned nd = 100 ;
  usrDoubles.resize( nd ) ;
  for(unsigned id=0 ; id<nd ; ++id){
    usrDoubles[id] = 42. ;
  }

ENDRELEASENOTES

This should address most needs for storing user defined data in PODIO in an efficient way without any overhead. In ROOT every user data collection is stored directly in a branch of the given type and can be accessed directly.
No relations to or from the user defined collections are foreseen. Users need to keep track if indices, e.g. by running collections parallel to existing PODIO (EDM4hep) collections.

This PR should work now as an example for possible treatment of additional user data. Open questions:

  • should the set of basic_types be restricted ?
  • if so, to which types ?
  • some code cleanup/improvement needed (see suggestions below)
  • ...

Comment on lines 162 to 165
typedef int intData ;
typedef long longData ;
typedef float floatData ;
typedef double doubleData ;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typedef int intData ;
typedef long longData ;
typedef float floatData ;
typedef double doubleData ;

These seem to be unnecessary and cause the root problems in the CI (don't ask me why though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Thomas. This seems to do the trick (at least on my centos7 test system) However I do not understand why and more importantly how this can even work, as the branch for the user collection is defined (by the ROOTReader) as vector<doubleData> and doubleData is a non existing type w/o these typedefs !???

Comment on lines 94 to 95
&m_refCollections, // only need to store the ObjectIDs of the referenced objects
&m_vecmem_info } ;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&m_refCollections, // only need to store the ObjectIDs of the referenced objects
&m_vecmem_info } ;
nullptr,
nullptr } ;

Would allow it to not have to keep some unnecessary members for these parts in the class. This should be easily possible now, but I am not entirely sure whether everything that expects this already has the necessary nullptr check in place already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I had in initially - but it crashes in the SIOWriter as the SIOBlocks have:

  //---- read ref collections -----
  auto* refCols = collBuffers.references;
  for( auto* refC : *refCols ){
    unsigned size{0};
    device.data( size ) ;
    refC->resize(size) ;
    podio::handlePODDataSIO( device ,  &((*refC)[0]), size ) ;
  }

Should add a check for null pointer here ...

Comment on lines 17 to 18
class UserDataTypes{
private:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to put all of this into a function instead of a singleton? In principle everything that is needed is a "map" of typeids to easier to read type names and that could also just be constructed statically (or potentially even static constexpr).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. As this is still WIP I wanted to see how far we can take this approach and wanted to have a potentially extendible map (w/ adding a registerTypeName() method or similar...

Comment on lines 24 to 30
const std::map< const std::type_index, const std::string >_typeMap =
{
{ std::type_index( typeid(int) ), "podio::int" },
{ std::type_index( typeid(long) ), "podio::long" },
{ std::type_index( typeid(float) ), "podio::float" },
{ std::type_index( typeid(double) ),"podio::double"}
} ;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How restrictive do we want to be with this map? In principle I think it is a good idea to not be overly permissive, but I also think that a few more fundamental types might be useful (even if the listed ones probably cover > 90% of the use cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open for discussion - this is just an initial set that probably would have to be extended by at least some short and unsigned, maybe fixed_size types. This depends if we can make the mechanism extendible, i.e. such that users can add additional types if needed or not. For SIO this should work easily - with ROOT I am not so sure due to the discionary, etc...



/// Templated base class for storing std::vectors of basic types as user data in PODIO
template <typename BasicType>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template should be constrained to only accept those types that we actually support via the mapping above. Currently it is possible to do something like

auto& coll = store.create<podio::UserDataCollection<unsigned>>("unsignedColl");
writer.registerForWrite("unsignedColl");

which compiles fine but crashes at some point at run-time. It should be possible to catch this at compile time already either via some static_assert or enable_if trickery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agreed, if we decide to only provide a fixed set of types - if not, then of course it cannot be done and we leave failure to runtime...


virtual void read(sio::read_device& device, sio::version_type /*version*/) override {
auto collBuffers = _col->getBuffers();
if (not _col->isSubsetCollection()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check could be removed for the UserDataCollection, since it will always return true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.


auto& usrInts = store.get<podio::UserDataCollection<int> >("userInts");

#if 0 //access via vector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these left over from testing, or are they actually required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of: testing - and as example how the underlying vector could be accessed, if needed. Can also remove eventually.
All this is still very much WIP ...

Comment on lines 59 to 67
std::string valTypeName = coll->getValueTypeName() ;

auto collClassName = "vector<" + valTypeName + "Data>";

auto it = valTypeName.find("podio::User_") ;
if(it == 0 ){
valTypeName = valTypeName.substr( 12 , 1024 ) ; // rm 'podio_user_' prefix
collClassName = "vector<" + valTypeName +">";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding a getDataTypeName function (or similar) to the podio::CollectionBase? It would make some of the string juggling easier to do for general use, and give us a customization point where these user classes could be handled more nicely. As far as I can tell the major use cases for getValueTypeName currently are:

  • Keys for the map inside the SIOBlockFactory
  • Creating the vector<XYZData> for the root writer and reader

So maybe it could even be a getDataVectorTypeName alongside the getValueTypeName (?)

@vvolkl
Copy link
Contributor

vvolkl commented Sep 7, 2021

Hi Frank,

I think the access to the underlying vector here is very nice! In my opinion the available data types are fine; they can always be expanded if necessary. Thomas' point for a compile time check is still a good one though.

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already discussed during the meeting yesterday, I think this is a very nice addition and I like the implementation. I have left one comment that is not only about implementation details regarding the return value of getDataTypeName. I am not sure what others think there.

From a technical point of view I would then simply try to push as many failures to compile time rather than runtime, since we have a closed set of types that we want to support with this (the BUILTIN_TYPES sans std::string and the ALLOWED_FIXED_WIDTH_TYPES).

Comment on lines +68 to +69
/// fully qualified type name of stored POD elements - with namespace
std::string getDataTypeName() const override { return std::string("{{ (class | string ).strip(':')+"Data" }}"); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will always be used like this

const auto bufferClassName = "std::vector<" + coll->getDataTypeName() + ">";

Should we simply make this return the full thing (and potentially give it a name that indicates that)? Or do we foresee any potential use cases where the Data name alone would be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sense the fact that a vector<> is used here is an implementation detail of the ROOTReader/Writer and other future implementations (HDF5 ?) might make a different use of the underlying POD data type.
This is why I thought it would be more generic this way...

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To implement a compile-time constraint on the usable types it would be possible to do something along the lines of the following: Define a std::tuple containing all supported types.

using SupportedUserDataTypes = std::tuple<
    int,
    float,
    long,
    double
    >;

Around that it is then rather straight forward to implement the necessary enable_if machinery to make compilation fail in case of unsupported types

namespace detail {
  /**
   * Helper function to check whether a type T is in a std::tuple<Ts...>
   */
  template<typename T, typename ...Ts>
  constexpr bool inTuple(std::tuple<Ts...>) {
    return (std::is_same_v<T, Ts> || ...);
  }

  /**
   * Compile time helper function to check whether the given type is in the list
   * of supported types
   */
  template<typename T>
  constexpr bool isSupported() {
    return inTuple<T>(SupportedUserDataTypes{});
  }
}

/**
 * Alias template to be used to enable template specializations only for the types listed in the
 * SupportedUserDataTypes list
 */
template<typename T>
using EnableIfSupportedUserType = std::enable_if_t<detail::isSupported<T>()>;

Usage:

template<typename T
         typename = EnableIfSupportedUserType<T>>
// class or function that should only be available if T is in SupportedUserDataTypes

Comment on lines 15 to 22
template <typename BasicType>
std::string userDataTypeName() { throw std::runtime_error( std::string(" unsupported type for UserDataCollection: ")
+ typeid(BasicType).name() ) ; }

template <> std::string userDataTypeName<int>() {return "int" ; }
template <> std::string userDataTypeName<float>() {return "float" ; }
template <> std::string userDataTypeName<long>() {return "long" ; }
template <> std::string userDataTypeName<double>(){return "double" ; }
Copy link
Collaborator

@tmadlener tmadlener Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be made into a

template<typename BasicType, 
         typename = EnableIfSupportedUserType<BasicType>> 
constexpr const char* userDataTypeName();

template<> constexpr const char* userDataTypeName<int>() { return "int"; }
// ... etc.

without any changes necessary to the rest of the code. To ease the adding of additional types one could introduce a small macro, but still with that we would have to spell out the list of supported times a second time (after the initial std::tuple).

#define ADD_USER_TYPE(type) template<> constexpr const char* userDataTypeName<type>( return #type; }

@gaede gaede changed the title [WIP] Add user data collections w/ basic types Add user data collections w/ basic types Sep 13, 2021
Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I just have a few comments that are all more or less about the same thing: Trying to cut down the list of supported types to the minimum while keeping all the currently present functionality.

However, they can also be addressed later IMHO.

*/
using SupportedUserDataTypes = std::tuple<
int, long, float, double,
unsigned int, unsigned, unsigned long,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned int, unsigned, unsigned long,
unsigned, unsigned long,

Since unsigned int == unsigned, I think it can be removed here (which would then also make it slightly more consistent with short, long, etc...). However, we are listing more duplicate types here with the fixed width types in any case. Maybe we should find the smallest list necessary and put some static_asserts in place here to make sure that the expectations are met.

Comment on lines 67 to 83
PODIO_ADD_USER_TYPE(int)
PODIO_ADD_USER_TYPE(long)
PODIO_ADD_USER_TYPE(float)
PODIO_ADD_USER_TYPE(double)
PODIO_ADD_USER_TYPE(unsigned)
// PODIO_ADD_USER_TYPE(unsigned int)
PODIO_ADD_USER_TYPE(unsigned long)
PODIO_ADD_USER_TYPE(char)
PODIO_ADD_USER_TYPE(short)
PODIO_ADD_USER_TYPE(long long)
PODIO_ADD_USER_TYPE(unsigned long long)
// PODIO_ADD_USER_TYPE(int16_t)
// PODIO_ADD_USER_TYPE(int32_t)
// PODIO_ADD_USER_TYPE(int64_t)
PODIO_ADD_USER_TYPE(uint16_t)
// PODIO_ADD_USER_TYPE(uint32_t)
// PODIO_ADD_USER_TYPE(uint64_t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider putting the fixed-width integers into the definitions here, because that would (almost) always give us a name without spaces (this seems to be a requirement for sio? at least there is the sio_name function that just replaces spaces with _).

The only exception is unsigned long long for which

static_assert(std::is_same_v<uint64_t, unsigned long long>);

fails, even though it seems to be 64 bits wide on most architectures. I am not entirely sure how to best handle that in case sio really does not handle spaces in its block names.

#include <vector>
#include <typeindex>

#define PODIO_ADD_USER_TYPE(type) template<> constexpr const char* userDataTypeName<type>(){ return #type ; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should #undef this after we have used it to not give the impression that this is an extension point for users. (Even if a user tries to use it they will only run into a compilation error in any case)

Comment on lines 7 to 24
static SIOBlockUserData<int> _defaultintCollcetionSIOBlock ;
static SIOBlockUserData<long> _defaultlongCollcetionSIOBlock ;
static SIOBlockUserData<float> _defaultfloatCollcetionSIOBlock ;
static SIOBlockUserData<double> _defaultdoubleCollcetionSIOBlock ;
static SIOBlockUserData<unsigned> _defaultunsignedCollcetionSIOBlock ;
static SIOBlockUserData<unsigned int> _defaultunsignedintCollcetionSIOBlock ;
static SIOBlockUserData<unsigned long> _defaultunsignedlongCollcetionSIOBlock ;
static SIOBlockUserData<char> _defaultcharCollcetionSIOBlock ;
static SIOBlockUserData<short> _defaultshortCollcetionSIOBlock ;
static SIOBlockUserData<long long> _defaultlonglongCollcetionSIOBlock ;
static SIOBlockUserData<unsigned long long> _defaultunsignedlonglongCollcetionSIOBlock ;
static SIOBlockUserData<int16_t> _defaultint16_tCollcetionSIOBlock ;
static SIOBlockUserData<int32_t> _defaultint32_tCollcetionSIOBlock ;
static SIOBlockUserData<int64_t> _defaultint64_tCollcetionSIOBlock ;
static SIOBlockUserData<uint16_t> _defaultuint16_tCollcetionSIOBlock ;
static SIOBlockUserData<uint32_t> _defaultuint32_tCollcetionSIOBlock ;
static SIOBlockUserData<uint64_t> _defaultuint64_tCollcetionSIOBlock ;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the macro defining the userDataTypeName, I think we should be able to restrict ourselves to a smaller list (potentially focusing on the fixed width integer types). Since under the hood they will in any case call the the same userDataTypeName and register as the same type in the factory, e.g.

static_assert(std::is_same_v<podio::SIOBlockUserData<unsigned>, podio::SIOBlockUserData<uint32_t>>);

works fine. So even if users specify unsigned it should be enough if we have an uint32_t specialization.

 - leave comments on potentiall eddressing this
   in more generic way in PODIO
@gaede
Copy link
Contributor Author

gaede commented Sep 14, 2021

Thanks @tmadlener - added your suggestions. As discussed privately, we might need a more generic treatment of ensuring consistency and size of fundamental types in PODIO eventually ...

using SupportedUserDataTypes = std::tuple<
int, long, float, double,
unsigned int, unsigned, unsigned long,
char, short, long long,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you quickly clean up the non-fixed width types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually intentionally left them in, as this is what we currently support in PODIO. I think we should address the issue of 'set of supported types' in PODIO consistently in a dedicate PR (with sufficient documentation). What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a new feature we should already start restricted. We can open up easily later in case. The other way round is more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is no restriction really due to the equivalence of the ususal named integer types and the fixed size ones. But I agree that this makes this feature much clearer and it already does what we probably want to do in all of PODIO ...

  - this should be done for PODIO user data as well
    as discusse in EDM4hep meeting 21.09.21
@gaede
Copy link
Contributor Author

gaede commented Sep 21, 2021

Sure. Please merge after CI passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants